Python Pandas基础1
2014-07-23 17:20
411 查看
python系列全部来源于《Python for data analysis》笔记
1 简介
Pandas是python的一个数据分析包Pandas中的数据结构
Series:一维数组,与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近,其区别是:List中的元素可以是不同的数据类型,而Array和Series中则只允许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率。
Time- Series:以时间为索引的Series。
DataFrame:二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。
Panel :三维的数组,可以理解为DataFrame的容器。
1.1 Series
(1)简单创建序列(直接从数组array产生)In [4]: obj = Series([4, 7, -5, 3]) In [5]: obj Out[5]: 0 4 1 7 2 -5 3 3
可以看到它与纯粹的array的区别是,它包含了一个索引列。
(2)获取Series的索引和值
In [6]: obj.values Out[6]: array([ 4, 7, -5, 3]) In [7]: obj.index Out[7]: Int64Index([0, 1, 2, 3])
(3)创建序列的同时,指定索引
In [8]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c']) In [9]: obj2 Out[9]: d 4 b 7 a -5 c 3
(4)对序列中的元素操作,注意与NumPy的array区别
In [11]: obj2['a'] Out[11]: -5 In [12]: obj2['d'] = 6 In [13]: obj2[['c', 'a', 'd']] Out[13]: c 3 a -5 d 6
In [18]: 'b' in obj2 Out[18]: True In [19]: 'e' in obj2 Out[19]: False
(5)将Python基础数据类型dict转换为Series
In [20]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000} In [21]: obj3 = Series(sdata) In [22]: obj3 Out[22]: Ohio 35000 Oregon 16000 Texas 71000 Utah 5000
(6)Series在算术运算的重要方法是自动根据index索引找到相应的值,并执行操作
In [29]: obj3 In [30]: obj4 Out[29]: Out[30]: Ohio 35000 California NaN Oregon 16000 Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 Texas 71000 In [31]: obj3 + obj4 Out[31]: California NaN Ohio 70000 Oregon 32000 Texas 142000 Utah NaN
2 DataFrame
(1)创建DataFrame,最普通的方式是:从一个等长的列表或数组的Dict类型产生data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} frame = DataFrame(data)
得到的DataFrame将会被自动加上索引
In [38]: frame Out[38]: pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002
(2)为DataFrame指定列名:
In [39]: DataFrame(data, columns=['year', 'state', 'pop']) Out[39]: year state pop 0 2000 Ohio 1.5 1 2001 Ohio 1.7 2 2002 Ohio 3.6 3 2001 Nevada 2.4 4 2002 Nevada 2.9
(3) 为DataFrame增加一列,非常粗暴.....
In [46]: frame2['debt'] = 16.5 In [47]: frame2 Out[47]: year state pop debt one 2000 Ohio 1.5 16.5 two 2001 Ohio 1.7 16.5 three 2002 Ohio 3.6 16.5 four 2001 Nevada 2.4 16.5 five 2002 Nevada 2.9 16.5
In [48]: frame2['debt'] = np.arange(5.) In [49]: frame2 Out[49]: year state pop debt one 2000 Ohio 1.5 0 two 2001 Ohio 1.7 1 three 2002 Ohio 3.6 2 four 2001 Nevada 2.4 3 five 2002 Nevada 2.9 4
如果增加的列与原DataFrame长度不一致:
In [50]: val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five']) In [51]: frame2['debt'] = val In [52]: frame2 Out[52]: year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 -1.2 three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 -1.5 five 2002 Nevada 2.9 -1.7
(4)删除某一列,使用关键字"del"
In [53]: frame2['eastern'] = frame2.state == 'Ohio' In [54]: frame2 Out[54]: year state pop debt eastern one 2000 Ohio 1.5 NaN True two 2001 Ohio 1.7 -1.2 True three 2002 Ohio 3.6 NaN True four 2001 Nevada 2.4 -1.5 False five 2002 Nevada 2.9 -1.7 False In [55]: del frame2['eastern'] In [56]: frame2.columns Out[56]: Index([year, state, pop, debt], dtype=object)
(5)index索引对象
所有的数组或者其他序列化标签(类似‘name’、‘label’属性)被构造成Series或DataFrame对象时都会被转化为内部的索引。
In [68]: obj = Series(range(3), index=['a', 'b', 'c']) In [69]: index = obj.index In [70]: index Out[70]: Index([a, b, c], dtype=object)
Series或DataFrame的索引不能被修改,下式将会出错:
In [72]: index[1] = 'd' --------------------------------------------------------------------------- Exception Traceback (most recent call last) <ipython-input-72-676fdeb26a68> in <module>() ----> 1 index[1] = 'd'
但是可以重新被指定(方法1):
In [73]: index = pd.Index(np.arange(3)) In [74]: obj2 = Series([1.5, -2.5, 0], index=index)
使用'Reindex'指定(方法2):
<pre class="python" name="code">In [79]: obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c']) In [80]:obj Out[80]: d 4.5 b 7.2 a -5.3 c 3.6
In [81]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e']) In [82]: obj2 Out[82]: a -5.3 b 7.2 c 3.6 d 4.5 e NaN
或者:
In [90]: states = ['Texas', 'Utah', 'California'] In [91]: frame.reindex(columns=states) Out[91]: Texas Utah California a 1 NaN 2 c 4 NaN 5 d 7 NaN 8
(6)index索引有类似集合的性质
In [76]: frame3 Out[76]: state Nevada Ohio year 2000 NaN 1.5 2001 2.4 1.7 2002 2.9 3.6 In [77]: 'Ohio' in frame3.columns Out[77]: True In [78]: 2003 in frame3.index Out[78]: False
(7)使用”drop“方法从某一维度上删除部分数据,注意drop方法将会返回新的对象,不对数据对象本身造成影响
在Series对象中使用drop:
In [94]: obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e']) In [95]: new_obj = obj.drop('c') In [96]: new_obj Out[96]: a 0 b 1 d 3 e 4 In [97]: obj.drop(['d', 'c']) Out[97]: a 0 b 1 e 4
在DataFrame中使用drop:
In [98]: data = DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four']) In [99]: data.drop(['Colorado', 'Ohio']) Out[99]: one two three four Utah 8 9 10 11 New York 12 13 14 15 In [100]: data.drop('two', axis=1) In [101]: data.drop(['two', 'four'], axis=1) Out[100]: one three four Ohio 0 2 3 Colorado 4 6 7 Utah 8 10 11 New York 12 14 15
(8) 索引、选择和过滤
基本和array相同,示例如下:
(Series略)DataFrame中:
In [112]: data = DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four']) In [113]: data Out[113]: one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15
In [115]: data[['three', 'one']] Out[115]: three one Ohio 2 0 Colorado 6 4 Utah 10 8 New York 14 12
In [118]: data < 5 Out[118]: one two three four Ohio True True True True Colorado True False False False Utah False False False False New York False False False False
In [119]: data[data < 5] = 0 In [120]: data Out[120]: one two three four Ohio 0 0 0 0 Colorado 0 5 6 7 Utah 8 9 10 11 New York 12 13 14 15
使用犀利的 "ix "方法:
In [122]: data.ix[['Colorado', 'Utah'], [3, 0, 1]] Out[122]: four one two Colorado 7 0 5 Utah 11 8 9
关于索引的相关方法使用说明:
Type Notes
obj[val] Select single column or sequence of columns from the DataFrame. Special case con- veniences: boolean array (filter rows), slice (slice rows),
or boolean DataFrame (set values based on some criterion). obj.ix[val] Selects single row of subset of rows from the DataFrame.
obj.ix[:, val] Selects single column of subset of columns.
obj.ix[val1, val2] Select both rows and columns. reindex method Conform one or more axes to new indexes.
xs method Select single row or column as a Series by label.
icol, irow methods Select single column or row, respectively, as a Series by integer location.
get_value, set_value methods Select single value by row and column label.
相关文章推荐
- Python 数据分析包:pandas 基础
- Python数据分析入门之pandas基础总结
- Python Numpy,Pandas基础笔记
- Python:Pandas:DataFrame基础(3)
- Python:Pandas:DataFrame基础(2)
- Python pandas基础2
- python (numpy基础) (pandas基础)(正则表达式)
- 利用Python进行数据分析(14) pandas基础: 数据转换
- 利用Python进行数据分析(15) pandas基础: 字符串操作
- pandas基础-Python3
- python基础(Numpy,Pandas,Matplotlib,
- Python数据分析入门之pandas总结基础
- Python 数据分析包:pandas 基础
- python学习笔记二(pandas基础)
- 利用Python进行数据分析(13) pandas基础: 数据重塑/轴向旋转
- python pandas库基础
- Python 数据分析(一) 本实验将学习 pandas 基础,数据加载、存储与文件格式,数据规整化,绘图和可视化的知识
- Python 数据分析包:pandas 基础
- 利用Python进行数据分析 pandas基础: 处理缺失数据
- Python 数据分析包:pandas 基础