您的位置:首页 > 其它

数据分析模块pandas

2017-11-22 11:37 357 查看
一 介绍
pandas(Python Data Analysis Library)是基于numpy的数据分析模块,提供了大量标准数据模型和高效操作大型数据集所需要的工具,可以说pandas是使得Python能够成为高效且强大的数据分析环境的重要因素之一。
pandas主要提供了3种数据结构:
1)Series,带标签的一维数组。
2)DataFrame,带标签且大小可变的二维表格结构。
3)Panel,带标签且大小可变的三维数组。
二 代码
1、生成一维数组

>>>import pandas as pd


>>>import numpy as np


>>> x = pd.Series([1,3,5, np.nan])


>>>print(x)


01.0


13.0


25.0


3NaN


dtype: float64


2、生成二维数组

>>> dates = pd.date_range(start='20170101', end='20171231', freq='D')#间隔为天


>>>print(dates)


DatetimeIndex(['2017-01-01','2017-01-02','2017-01-03','2017-01-04',


'2017-01-05','2017-01-06','2017-01-07','2017-01-08',


'2017-01-09','2017-01-10',


...


'2017-12-22','2017-12-23','2017-12-24','2017-12-25',


'2017-12-26','2017-12-27','2017-12-28','2017-12-29',


'2017-12-30','2017-12-31'],


dtype='datetime64[ns]', length=365, freq='D')


>>> dates = pd.date_range(start='20170101', end='20171231', freq='M')#间隔为月


>>>print(dates)


DatetimeIndex(['2017-01-31','2017-02-28','2017-03-31','2017-04-30',


'2017-05-31','2017-06-30','2017-07-31','2017-08-31',


'2017-09-30','2017-10-31','2017-11-30','2017-12-31'],


dtype='datetime64[ns]', freq='M')


>>> df = pd.DataFrame(np.random.randn(12,4), index=dates, columns=list('ABCD'))


>>>print(df)


AB C D


2017-01-31-0.6825560.2441020.4508550.236475


2017-02-28-0.6300600.5906670.4824380.225697


2017-03-311.0669890.3193391.0949531.716053


2017-04-300.334944-0.053049-1.009493-1.039470


2017-05-31-0.380778-0.0444290.0756470.931243


2017-06-300.8675400.872197-0.738974-1.114596


2017-07-310.423371-1.0863860.183820-0.438921


2017-08-311.2851630.634134-0.4729731.281057


2017-09-30-1.002832-0.888122-1.316014-0.070637


2017-10-311.735617-0.2538150.5544031.536211


2017-11-302.0303840.6675561.0126980.239479


2017-12-312.059718-0.0890501.4205170.224578


>>> df = pd.DataFrame([[np.random.randint(1,100)for j in range(4)]for i in range(12)], index=dates, columns=list('ABCD'))


>>>print(df)


AB C D


2017-01-317532522


2017-02-2870997098


2017-03-3199477567


2017-04-3033701749


2017-05-3162886891


2017-06-3019751844


2017-07-3150856582


2017-08-315628776


2017-09-306173111


2017-10-318296692


2017-11-306359194


2017-12-3179586933


>>> df = pd.DataFrame({'A':[np.random.randint(1,100)for i in range(4)],


'B':pd.date_range(start='20130101', periods=4, freq='D'),


'C':pd.Series([1,2,3,4],index=list(range(4)),dtype='float32'),


'D':np.array([3]*4,dtype='int32'),


'E':pd.Categorical(["test","train","test","train"]),


'F':'foo'})




>>>print(df)


AB C D E F


0152013-01-011.03 test foo


1112013-01-022.03 train foo


2912013-01-033.03 test foo


3912013-01-044.03 train foo


>>> df = pd.DataFrame({'A':[np.random.randint(1,100)for i in range(4)],


'B':pd.date_range(start='20130101', periods=4, freq='D'),


'C':pd.Series([1,2,3,4],index=['zhang','li','zhou','wang'],dtype='float32'),


'D':np.array([3]*4,dtype='int32'),


'E':pd.Categorical(["test","train","test","train"]),


'F':'foo'})




>>>print(df)


AB C D E F


zhang 362013-01-011.03 test foo


li 862013-01-022.03 train foo


zhou 102013-01-033.03 test foo


wang 792013-01-044.03 train foo


>>>


3、二维数据查看

>>> df.head() #默认显示前5行


AB C D E F


zhang 362013-01-011.03 test foo


li 862013-01-022.03 train foo


zhou 102013-01-033.03 test foo


wang 792013-01-044.03 train foo


>>> df.head(3) #查看前3行


AB C D E F


zhang 362013-01-011.03 test foo


li 862013-01-022.03 train foo


zhou 102013-01-033.03 test foo


>>> df.tail(2) #查看最后2行


AB C D E F


zhou 102013-01-033.03 test foo


wang 792013-01-044.03 train foo


4、查看二维数据的索引、列名和数据

>>> df.index


Index(['zhang','li','zhou','wang'], dtype='object')


>>> df.columns


Index(['A','B','C','D','E','F'], dtype='object')


>>> df.values


array([[36,Timestamp('2013-01-01 00:00:00'),1.0,3,'test','foo'],


[86,Timestamp('2013-01-02 00:00:00'),2.0,3,'train','foo'],


[10,Timestamp('2013-01-03 00:00:00'),3.0,3,'test','foo'],


[79,Timestamp('2013-01-04 00:00:00'),4.0,3,'train','foo']], dtype=object)


5、查看数据的统计信息

>>> df.describe() #平均值、标准差、最小值、最大值等信息


A C D


count 4.0000004.0000004.0


mean 52.7500002.5000003.0


std 36.0682221.2909940.0


min 10.0000001.0000003.0


25%29.5000001.7500003.0


50%57.5000002.5000003.0


75%80.7500003.2500003.0


max 86.0000004.0000003.0


6、二维数据转置

>>> df.T


zhang li zhou \


A 368610


B 2013-01-0100:00:002013-01-0200:00:002013-01-0300:00:00


C 123


D 333


E test train test


F foofoo foo




wang


A 79


B 2013-01-0400:00:00


C 4


D 3


E train


F foo


 

7、排序

>>> df.sort_index(axis=0, ascending=False)#对轴进行排序


AB C D E F


zhou 102013-01-033.03 test foo


zhang 362013-01-011.03 test foo


wang 792013-01-044.03 train foo


li 862013-01-022.03 train foo


>>> df.sort_index(axis=1, ascending=False)


F E D C B A


zhang foo test 31.02013-01-0136


li foo train 32.02013-01-0286


zhou foo test 33.02013-01-0310


wang foo train 34.02013-01-0479


>>> df.sort_index(axis=0, ascending=True)


AB C D E F


li 862013-01-022.03 train foo


wang 792013-01-044.03 train foo


zhang 362013-01-011.03 test foo


zhou 102013-01-033.03 test foo


>>> df.sort_values(by='A')#对数据进行排序


AB C D E F


zhou 102013-01-033.03 test foo


zhang 362013-01-011.03 test foo


wang 792013-01-044.03 train foo


li 862013-01-022.03 train foo


>>> df.sort_values(by='A', ascending=False)#降序排列


AB C D E F


li 862013-01-022.03 train foo


wang 792013-01-044.03 train foo


zhang 362013-01-011.03 test foo


zhou 102013-01-033.03 test foo


 

8、数据选择

>>> df['A']#选择列


zhang 1


li 1


zhou 60


wang 58


Name: A, dtype: int64


>>> df[0:2]#使用切片选择多行


AB C D E F


zhang 12013-01-011.03 test foo


li 12013-01-022.03 train foo


>>> df.loc[:,['A','C']]#选择多列


A C


zhang 11.0


li 12.0


zhou 603.0


wang 584.0


>>> df.loc[['zhang','zhou'],['A','D','E']]#同时指定多行与多列进行选择


A D E


zhang 13 test


zhou 603 test


>>> df.loc['zhang',['A','D','E']]


A 1


D 3


E test


Name: zhang, dtype: object


9、数据修改和设置

>>> df.iat[0,2]=3#修改指定行、列位置的数据值


>>>print(df)


AB C D E F


zhang 12013-01-013.03 test foo


li 12013-01-022.03 train foo


zhou 602013-01-033.03 test foo


wang 582013-01-044.03 train foo


>>> df.loc[:,'D']=[np.random.randint(50,60)for i in range(4)]#修改某列的值


>>>print(df)


AB C D E F


zhang 12013-01-013.057 test foo


li 12013-01-022.052 train foo


zhou 602013-01-033.057 test foo


wang 582013-01-044.056 train foo


>>> df['C']=-df['C']#对指定列数据取反


>>>print(df)


AB C D E F


zhang 12013-01-01-3.057 test foo


li 12013-01-02-2.052 train foo


zhou 602013-01-03-3.057 test foo


wang 582013-01-04-4.056 train foo


10、缺失值处理

>>> df1 = df.reindex(index=['zhang','li','zhou','wang'], columns=list(df.columns)+['G'])


>>>print(df1)


AB C D E F G


zhang 12013-01-01-3.057 test foo NaN


li 12013-01-02-2.052 train foo NaN


zhou 602013-01-03-3.057 test foo NaN


wang 582013-01-04-4.056 train foo NaN


>>> df1.iat[0,6]=3#修改指定位置元素值,该列其他元素为缺失值NaN


>>>print(df1)


AB C D E F G


zhang 12013-01-01-3.057 test foo 3.0


li 12013-01-02-2.052 train foo NaN


zhou 602013-01-03-3.057 test foo NaN


wang 582013-01-04-4.056 train foo NaN


>>> pd.isnull(df1)#测试缺失值,返回值为True/False阵列


AB C D E F G


zhang FalseFalseFalseFalseFalseFalseFalse


li FalseFalseFalseFalseFalseFalseTrue


zhou FalseFalseFalseFalseFalseFalseTrue


wang FalseFalseFalseFalseFalseFalseTrue


>>> df1.dropna()#返回不包含缺失值的行


AB C D E F G


zhang 12013-01-01-3.057 test foo 3.0


>>> df1['G'].fillna(5, inplace=True)#使用指定值填充缺失值


>>>print(df1)


AB C D E F G


zhang 12013-01-01-3.057 test foo 3.0


li 12013-01-02-2.052 train foo 5.0


zhou 602013-01-03-3.057 test foo 5.0


wang 582013-01-04-4.056 train foo 5.0


11、数据操作

>>> df1.mean()#平均值,自动忽略缺失值


A 30.0


C -3.0


D 55.5


G 4.5


dtype: float64


>>> df.mean(1)#横向计算平均值


zhang 18.333333


li 17.000000


zhou 38.000000


wang 36.666667


dtype: float64


>>> df1.shift(1)#数据移位


AB C D E F G


zhang NaNNaTNaNNaNNaNNaNNaN


li 1.02013-01-01-3.057.0 test foo 3.0


zhou 1.02013-01-02-2.052.0 train foo 5.0


wang 60.02013-01-03-3.057.0 test foo 5.0


>>> df1['D'].value_counts()#直方图统计


572


561


521


Name: D, dtype: int64


>>>print(df1)


AB C D E F G


zhang 12013-01-01-3.057 test foo 3.0


li 12013-01-02-2.052 train foo 5.0


zhou 602013-01-03-3.057 test foo 5.0


wang 582013-01-04-4.056 train foo 5.0


>>> df2 = pd.DataFrame(np.random.randn(10,4))


>>>print(df2)


0123


0-0.939904-1.856658-0.2819650.203624


10.3501620.060674-0.9148080.135735


2-1.031384-1.6112740.341546-0.363671


30.139464-0.050959-0.810610-0.772648


4-1.146810-0.7916081.488790-0.490004


5-0.100707-0.763545-0.071274-0.298142


6-0.2120140.8097090.6931960.980568


7-0.812985-0.000325-0.675101-0.217394


80.066969-0.084609-0.4330990.535616


9-0.319120-0.5328541.321712-1.751913


>>> p1 = df2[:3] >>> print(p1) 0 1 2 3 0 -0.939904 -1.856658 -0.281965 0.203624 1 0.350162 0.060674 -0.914808 0.135735 2 -1.031384 -1.611274 0.341546 -0.363671 >>> p2 = df2[3:7] >>> print(p2) 0 1 2 3 3 0.139464 -0.050959 -0.810610 -0.772648 4 -1.146810 -0.791608 1.488790 -0.490004 5 -0.100707 -0.763545 -0.071274 -0.298142 6 -0.212014 0.809709 0.693196 0.980568 >>> p3 = df2[7:] >>> print(p3) 0 1 2 3 7 -0.812985 -0.000325 -0.675101 -0.217394 8 0.066969 -0.084609 -0.433099 0.535616 9 -0.319120 -0.532854 1.321712 -1.751913 >>> df3 = pd.concat([p1, p2, p3]) #数据行合并 >>> print(df3) 0 1 2 3 0 -0.939904 -1.856658 -0.281965 0.203624 1 0.350162 0.060674 -0.914808 0.135735 2 -1.031384 -1.611274 0.341546 -0.363671 3 0.139464 -0.050959 -0.810610 -0.772648 4 -1.146810 -0.791608 1.488790 -0.490004 5 -0.100707 -0.763545 -0.071274 -0.298142 6 -0.212014 0.809709 0.693196 0.980568 7 -0.812985 -0.000325 -0.675101 -0.217394 8 0.066969 -0.084609 -0.433099 0.535616 9 -0.319120 -0.532854 1.321712 -1.751913 >>> df2 == df3 0 1 2 3 0 True True True True 1 True True True True 2 True True True True 3 True True True True 4 True True True True 5 True True True True 6 True True True True 7 True True True True 8 True True True True 9 True True True True >>> df4 = pd.DataFrame({'A':[np.random.randint(1,5) for i in range(8)], 'B':[np.random.randint(10,15) for i in range(8)], 'C':[np.random.randint(20,30) for i in range(8)], 'D':[np.random.randint(80,100) for i in range(8)]}) >>> print(df4)AB C D 0 4 11 24 91 1 1 13 28 95 2 2 12 27 91 3 1 12 20 87 4 3 11 24 96 5 1 13 21 99 6 3 11 22 95 7 2 13 26 98 >>> >>> df4.groupby('A').sum() #数据分组计算B C D A 1 38 69 281 2 25 53 189 3 22 46 191 4 11 24 91 >>> >>> df4.groupby(['A','B']).mean() C D A B 1 12 20.0 87.0 13 24.5 97.0 2 12 27.0 91.0 13 26.0 98.0 3 11 23.0 95.5 4 11 24.0 91.0


12、结合matplotlib绘图

>>>import pandas as pd


>>>import numpy as np


>>>import matplotlib.pyplot as plt


>>> df = pd.DataFrame(np.random.randn(1000,2), columns=['B','C']).cumsum()


>>>print(df)


B C


00.0898860.511081


11.3237661.584758


21.489479-0.438671


30.831331-0.398021


4-0.2482330.494418


5-0.0130850.684518


60.666951-1.422161


71.768838-0.658786


82.6610800.648505


91.9517510.836261


103.5387851.657475


113.2540342.052609


124.2486201.568401


134.0771730.055622


143.452590-0.200314


152.627620-0.408829


163.690537-0.210440


173.1849240.365447


183.646556-0.150044


194.164563-0.023405


202.3914470.517872


212.8651530.686649


223.6231830.663927


231.5451170.151044


243.5959240.903619


253.0138041.855083


264.4388011.014572


275.1552160.882628


284.4314570.741509


292.8419490.709991


........


970-7.910646-13.738689


971-7.318091-14.811335


972-9.144376-15.466873


973-9.538658-15.367167


974-9.061114-16.822726


975-9.803798-17.368350


976-10.180575-17.270180


977-10.601352-17.671543


978-10.804909-19.535919


979-10.397964-20.361419


980-10.979640-20.300267


981-8.738223-20.202669


982-9.339929-21.528973


983-9.780686-20.902152


984-11.072655-21.235735


985-10.849717-20.439201


986-10.953247-19.708973


987-13.032707-18.687553


988-12.984567-19.557132


989-13.508836-18.747584


990-13.420713-19.883180


991-11.718125-20.474092


992-11.936512-21.360752


993-14.225655-22.006776


994-13.524940-20.844519


995-14.088767-20.492952


996-14.169056-20.666777


997-14.798708-19.960555


998-15.766568-19.395622


999-17.281143-19.089793




[1000 rows x 2 columns]


>>> df['A']= pd.Series(list(range(len(df))))


>>>print(df)


B C A


00.0898860.5110810


11.3237661.5847581


21.489479-0.4386712


30.831331-0.3980213


4-0.2482330.4944184


5-0.0130850.6845185


60.666951-1.4221616


71.768838-0.6587867


82.6610800.6485058


91.9517510.8362619


103.5387851.65747510


113.2540342.05260911


124.2486201.56840112


134.0771730.05562213


143.452590-0.20031414


152.627620-0.40882915


163.690537-0.21044016


173.1849240.36544717


183.646556-0.15004418


194.164563-0.02340519


202.3914470.51787220


212.8651530.68664921


223.6231830.66392722


231.5451170.15104423


243.5959240.90361924


253.0138041.85508325


264.4388011.01457226


275.1552160.88262827


284.4314570.74150928


292.8419490.70999129


...........


970-7.910646-13.738689970


971-7.318091-14.811335971


972-9.144376-15.466873972


973-9.538658-15.367167973


974-9.061114-16.822726974


975-9.803798-17.368350975


976-10.180575-17.270180976


977-10.601352-17.671543977


978-10.804909-19.535919978


979-10.397964-20.361419979


980-10.979640-20.300267980


981-8.738223-20.202669981


982-9.339929-21.528973982


983-9.780686-20.902152983


984-11.072655-21.235735984


985-10.849717-20.439201985


986-10.953247-19.708973986


987-13.032707-18.687553987


988-12.984567-19.557132988


989-13.508836-18.747584989


990-13.420713-19.883180990


991-11.718125-20.474092991


992-11.936512-21.360752992


993-14.225655-22.006776993


994-13.524940-20.844519994


995-14.088767-20.492952995


996-14.169056-20.666777996


997-14.798708-19.960555997


998-15.766568-19.395622998


999-17.281143-19.089793999




[1000 rows x 3 columns]


>>> plt.figure()


<matplotlib.figure.Figure object at 0x000002A2A0B10F28>


>>> df.plot(x='A')


<matplotlib.axes._subplots.AxesSubplot object at 0x000002A2A12FE7F0>


>>> plt.show()


运行结果



 

>>> df = pd.DataFrame(np.random.rand(10,4), columns=['a','b','c','d'])


>>>print(df)


a b c d


00.5044340.1908750.0016870.327372


10.4068440.6020290.9120750.815889


20.8285340.9859100.0946620.552089


30.1988430.8187850.7506490.967054


40.4984940.1513780.4175060.264438


50.6552880.6727880.0886160.433270


60.4931270.0092540.1794790.396655


70.4193860.9109860.0200040.229063


80.6714690.6121890.3749200.407093


90.4149780.0334990.7560250.717849


>>> df.plot(kind='bar')


<matplotlib.axes._subplots.AxesSubplot object at 0x000002A2A17BD7B8>


>>> plt.show()


运行结果



 

>>> df = pd.DataFrame(np.random.rand(10,4), columns=['a','b','c','d'])


>>> df.plot(kind='barh', stacked=True)


<matplotlib.axes._subplots.AxesSubplot object at 0x000002A2A3784390>


>>> plt.show()




 

 
 
 



大小: 16.5 KB



大小: 16.8 KB



大小: 60.8 KB

查看图片附件
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: