您的位置:首页 > 其它

pandas working with missing data

2015-07-29 09:00 337 查看
In [1]: df = DataFrame(randn(5, 3), index=[’a’, ’c’, ’e’, ’f’, ’h’],
...: columns=[’one’, ’two’, ’three’])
In [2]: df[’four’] = ’bar’
In [3]: df[’five’] = df[’one’] > 0
In [4]: df
Out[4]:
one       two       three     four five
a -1.420361 -0.015601 -1.150641 bar False
c -0.798334 -0.557697 0.381353 bar False
e 1.337122 -1.531095 1.331458 bar True
f -0.571329 -0.026671 -1.085663 bar False
h -1.114738 -0.058216 -0.486768 bar False
In [5]: df2 = df.reindex([’a’, ’b’, ’c’, ’d’, ’e’, ’f’, ’g’, ’h’])
In [6]: df2
Out[6]:
one       two       three     four five
a -1.420361 -0.015601 -1.150641 bar False
b NaN       NaN       NaN       NaN NaN
c -0.798334 -0.557697 0.381353 bar False
d NaN       NaN       NaN       NaN NaN
e 1.337122 -1.531095 1.331458 bar True
f -0.571329 -0.026671 -1.085663 bar False
g NaN       NaN       NaN       NaN NaN
h -1.114738 -0.058216 -0.486768 bar False
In [11]: df2[’timestamp’] = Timestamp(’20120101’)
In [12]: df2
Out[12]:
one       two       three     four five timestamp
a -1.420361 -0.015601 -1.150641 bar False 2012-01-01
c -0.798334 -0.557697 0.381353 bar False 2012-01-01
e 1.337122 -1.531095 1.331458 bar True 2012-01-01
f -0.571329 -0.026671 -1.085663 bar False 2012-01-01
h -1.114738 -0.058216 -0.486768 bar False 2012-01-01


calculations with missing data

• When summing data, NA (missing) values will be treated as zero

• If the data are all NA, the result will be NA

In [45]: df
Out[45]:
one two       three
a NaN -0.015601 -1.150641
c NaN -0.557697 0.381353
e NaN 0.000000 0.000000
f NaN 0.000000 0.000000
h NaN -0.058216 -0.486768
In [46]: df.dropna(axis=0)
Out[46]:
Empty DataFrame
Columns: [one, two, three]
Index: []
In [47]: df.dropna(axis=1)
Out[47]:
two       three
a -0.015601 -1.150641
c -0.557697 0.381353
e 0.000000 0.000000
f 0.000000 0.000000
h -0.058216 -0.486768


In [67]: np.random.seed(2)
seed( ) 用于指定随机数生成时所用算法开始的整数值,如果
使用相同的seed( )值,则每次生成的随即数都相同,如果不设
置这个值,则系统根据时间来自己选择这个值,此时每次生成的
随机数因时间差异而不同。
In [68]: ser = Series(np.arange(1, 10.1, .25)**2 + np.random.randn(37))
In [69]: bad = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29])
In [70]: ser[bad] = np.nan
In [71]: methods = [’linear’, ’quadratic’, ’cubic’]
In [72]: df = DataFrame({m: ser.interpolate(method=m) for m in methods})
In [73]: plt.figure()
Out[73]: <matplotlib.figure.Figure at 0xa8dbf22c>
In [74]: df.plot()
Out[74]: <matplotlib.axes._subplots.AxesSubplot at 0xa8da684c>


In [88]: d = {’a’: list(range(4)), ’b’: list(’ab..’), ’c’: [’a’, ’b’, nan, ’d’]}
In [89]: df = DataFrame(d)
In [90]: df.replace(’.’, nan)
Out[90]:
a b   c
0 0 a   a
1 1 b   b
2 2 NaN NaN
3 3 NaN d
In [91]: df.replace(r’\s*\.\s*’, nan, regex=True)
Out[91]:
a b   c
0 0 a   a
1 1 b   b
2 2 NaN NaN
3 3 NaN d
In [92]: df.replace([’a’, ’.’], [’b’, nan])
Out[92]:
a b   c
0 0 b   b
1 1 b   b
2 2 NaN NaN
3 3 NaN d
In [93]: df.replace([r’\.’, r’(a)’], [’dot’, ’\1stuff’], regex=True)
Out[93]:
a b      c
0 0 {stuff {stuff
1 1 b      b
2 2 dot    NaN
3 3 dot    d
In [99]: df.replace([r’\s*\.\s*’, r’a|b’], nan, regex=True)
Out[99]:
a b   c
0 0 NaN NaN
1 1 NaN NaN
2 2 NaN NaN
3 3 NaN d
In [100]: df.replace(regex=[r’\s*\.\s*’, r’a|b’], value=nan)
Out[100]:
a b   c
0 0 NaN NaN
1 1 NaN NaN
2 2 NaN NaN
3 3 NaN d
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: