pandas working with missing data
2015-07-29 09:00
337 查看
In [1]: df = DataFrame(randn(5, 3), index=[’a’, ’c’, ’e’, ’f’, ’h’], ...: columns=[’one’, ’two’, ’three’]) In [2]: df[’four’] = ’bar’ In [3]: df[’five’] = df[’one’] > 0 In [4]: df Out[4]: one two three four five a -1.420361 -0.015601 -1.150641 bar False c -0.798334 -0.557697 0.381353 bar False e 1.337122 -1.531095 1.331458 bar True f -0.571329 -0.026671 -1.085663 bar False h -1.114738 -0.058216 -0.486768 bar False In [5]: df2 = df.reindex([’a’, ’b’, ’c’, ’d’, ’e’, ’f’, ’g’, ’h’]) In [6]: df2 Out[6]: one two three four five a -1.420361 -0.015601 -1.150641 bar False b NaN NaN NaN NaN NaN c -0.798334 -0.557697 0.381353 bar False d NaN NaN NaN NaN NaN e 1.337122 -1.531095 1.331458 bar True f -0.571329 -0.026671 -1.085663 bar False g NaN NaN NaN NaN NaN h -1.114738 -0.058216 -0.486768 bar False In [11]: df2[’timestamp’] = Timestamp(’20120101’) In [12]: df2 Out[12]: one two three four five timestamp a -1.420361 -0.015601 -1.150641 bar False 2012-01-01 c -0.798334 -0.557697 0.381353 bar False 2012-01-01 e 1.337122 -1.531095 1.331458 bar True 2012-01-01 f -0.571329 -0.026671 -1.085663 bar False 2012-01-01 h -1.114738 -0.058216 -0.486768 bar False 2012-01-01
calculations with missing data
• When summing data, NA (missing) values will be treated as zero
• If the data are all NA, the result will be NA
In [45]: df Out[45]: one two three a NaN -0.015601 -1.150641 c NaN -0.557697 0.381353 e NaN 0.000000 0.000000 f NaN 0.000000 0.000000 h NaN -0.058216 -0.486768 In [46]: df.dropna(axis=0) Out[46]: Empty DataFrame Columns: [one, two, three] Index: [] In [47]: df.dropna(axis=1) Out[47]: two three a -0.015601 -1.150641 c -0.557697 0.381353 e 0.000000 0.000000 f 0.000000 0.000000 h -0.058216 -0.486768
In [67]: np.random.seed(2) seed( ) 用于指定随机数生成时所用算法开始的整数值,如果 使用相同的seed( )值,则每次生成的随即数都相同,如果不设 置这个值,则系统根据时间来自己选择这个值,此时每次生成的 随机数因时间差异而不同。 In [68]: ser = Series(np.arange(1, 10.1, .25)**2 + np.random.randn(37)) In [69]: bad = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29]) In [70]: ser[bad] = np.nan In [71]: methods = [’linear’, ’quadratic’, ’cubic’] In [72]: df = DataFrame({m: ser.interpolate(method=m) for m in methods}) In [73]: plt.figure() Out[73]: <matplotlib.figure.Figure at 0xa8dbf22c> In [74]: df.plot() Out[74]: <matplotlib.axes._subplots.AxesSubplot at 0xa8da684c>
In [88]: d = {’a’: list(range(4)), ’b’: list(’ab..’), ’c’: [’a’, ’b’, nan, ’d’]} In [89]: df = DataFrame(d) In [90]: df.replace(’.’, nan) Out[90]: a b c 0 0 a a 1 1 b b 2 2 NaN NaN 3 3 NaN d In [91]: df.replace(r’\s*\.\s*’, nan, regex=True) Out[91]: a b c 0 0 a a 1 1 b b 2 2 NaN NaN 3 3 NaN d In [92]: df.replace([’a’, ’.’], [’b’, nan]) Out[92]: a b c 0 0 b b 1 1 b b 2 2 NaN NaN 3 3 NaN d In [93]: df.replace([r’\.’, r’(a)’], [’dot’, ’\1stuff’], regex=True) Out[93]: a b c 0 0 {stuff {stuff 1 1 b b 2 2 dot NaN 3 3 dot d In [99]: df.replace([r’\s*\.\s*’, r’a|b’], nan, regex=True) Out[99]: a b c 0 0 NaN NaN 1 1 NaN NaN 2 2 NaN NaN 3 3 NaN d In [100]: df.replace(regex=[r’\s*\.\s*’, r’a|b’], value=nan) Out[100]: a b c 0 0 NaN NaN 1 1 NaN NaN 2 2 NaN NaN 3 3 NaN d
相关文章推荐
- HDU 5319 Painter (2015 Multi-University Training Contest 3 2015多校联合)
- jQuery插件开发精品教程,让你的jQuery提升一个台阶
- HDU1542 Atlantis(面积并)
- 拾遗2015-07-28
- eclipse打断点调试进入到class文件中--解决方法
- Spring MVC详解(五) 处理器拦截器详解
- myeclipe eclipse 常遇问题:Some projects cannot be imported 、java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver、The file connot be validate
- frameset导航框架
- ios 异常捕获
- UVA 725 - Division
- webView加载html5
- mysql 警告提示Unsafe statement .The statement is unsafe because it uses a LIMIT clause
- JDK安装在配置与文档生成
- EF6 CodeFirst+Repository+Ninject+MVC4+EasyUI实践(六)
- 算法导论2.3-5二分查找
- CentOS 6.3下Samba服务器的安装与配置
- SIP初步
- Eclipse常见问题解决 - The method getTextContent() is undefined for the type Node.
- <c:forEach>标签的使用
- 软件工程(4-6章)