python | 简单的数据分析
2017-02-09 18:08
891 查看
做数据分析的两大利器:python和R语言,这里介绍一个我用python学习的案例
第一步,设置工作目录
第二步,加载包
第三步,载入数据
第四步,查看数据
维数
((87020, 26), (37717, 24))
数据类型
ID object
Gender object
City object
Monthly_Income int64
DOB object
Lead_Creation_Date object
Loan_Amount_Applied float64
Loan_Tenure_Applied float64
Existing_EMI float64
Employer_Name object
Salary_Account object
Mobile_Verified object
Var5 int64
Var1 object
Loan_Amount_Submitted float64
Loan_Tenure_Submitted float64
Interest_Rate float64
Processing_Fee float64
EMI_Loan_Submitted float64
Filled_Form object
Device_Type object
Var2 object
Source object
Var4 int64
LoggedIn int64
Disbursed int64
dtype: object
查看数据
数据合并
(124737, 27)
查看异常值
空值
City 1401
DOB 0
Device_Type 0
Disbursed 37717
EMI_Loan_Submitted 84901
Employer_Name 113
Existing_EMI 111
Filled_Form 0
Gender 0
ID 0
Interest_Rate 84901
Lead_Creation_Date 0
Loan_Amount_Applied 111
Loan_Amount_Submitted 49535
Loan_Tenure_Applied 111
Loan_Tenure_Submitted 49535
LoggedIn 37717
Mobile_Verified 0
Monthly_Income 0
Processing_Fee 85346
Salary_Account 16801
Source 0
Var1 0
Var2 0
Var4 0
Var5 0
source 0
dtype: int64
- 查看每一列的取值种数
Gender这一列数据的不同取值和出现的次数
Male 71398
Female 53339
Name: Gender, dtype: int64
Salary_Account这一列数据的不同取值和出现的次数
HDFC Bank 25180
ICICI Bank 19547
State Bank of India 17110
Axis Bank 12590
Citibank 3398
Kotak Bank 2955
IDBI Bank 2213
Punjab National Bank 1747
Bank of India 1713
Bank of Baroda 1675
Standard Chartered Bank 1434
Canara Bank 1385
Union Bank of India 1330
Yes Bank 1120
ING Vysya 996
Corporation bank 948
Indian Overseas Bank 901
State Bank of Hyderabad 854
Indian Bank 773
Oriental Bank of Commerce 761
IndusInd Bank 711
Andhra Bank 706
Central Bank of India 648
Syndicate Bank 614
Bank of Maharasthra 576
HSBC 474
State Bank of Bikaner & Jaipur 448
Karur Vysya Bank 435
State Bank of Mysore 385
Federal Bank 377
Vijaya Bank 354
Allahabad Bank 345
UCO Bank 344
State Bank of Travancore 333
Karnataka Bank 279
United Bank of India 276
Dena Bank 268
Saraswat Bank 265
State Bank of Patiala 263
South Indian Bank 223
Deutsche Bank 176
Abhyuday Co-op Bank Ltd 161
The Ratnakar Bank Ltd 113
Tamil Nadu Mercantile Bank 103
Punjab & Sind bank 84
J&K Bank 78
Lakshmi Vilas bank 69
Dhanalakshmi Bank Ltd 66
State Bank of Indore 32
Catholic Syrian Bank 27
India Bulls 21
B N P Paribas 15
Firstrand Bank Limited 11
GIC Housing Finance Ltd 10
Bank of Rajasthan 8
Kerala Gramin Bank 4
Industrial And Commercial Bank Of China Limited 3
Ahmedabad Mercantile Cooperative Bank 1
Name: Salary_Account, dtype: int64
Mobile_Verified这一列数据的不同取值和出现的次数
Y 80928
N 43809
Name: Mobile_Verified, dtype: int64
Var1这一列数据的不同取值和出现的次数
HBXX 84901
HBXC 12952
HBXB 6502
HAXA 4214
HBXA 3042
HAXB 2879
HBXD 2818
HAXC 2171
HBXH 1387
HCXF 990
HAYT 710
HAVC 570
HAXM 386
HCXD 348
HCYS 318
HVYS 252
HAZD 161
HCXG 114
HAXF 22
Name: Var1, dtype: int64
Filled_Form这一列数据的不同取值和出现的次数
N 96740
Y 27997
Name: Filled_Form, dtype: int64
Device_Type这一列数据的不同取值和出现的次数
Web-browser 92105
Mobile 32632
Name: Device_Type, dtype: int64
Var2这一列数据的不同取值和出现的次数
B 53481
G 47338
C 20366
E 1855
D 918
F 770
A 9
Name: Var2, dtype: int64
Source这一列数据的不同取值和出现的次数
S122 55249
S133 42900
S159 7999
S143 6140
S127 2804
S137 2450
S134 1900
S161 1109
S151 1018
S157 929
S153 705
S144 447
S156 432
S158 294
S123 112
S141 83
S162 60
S124 43
S150 19
S160 11
S136 5
S138 5
S155 5
S139 4
S129 4
S135 2
S142 1
S140 1
S154 1
S125 1
S130 1
S126 1
S132 1
S131 1
Name: Source, dtype: int64
单个特征分析
计算字段数
删掉属性
缺失值用中位数填补
数据集的One-Hot编码
第一步,设置工作目录
#encoding:utf8 import os os.chdir("G:\\Anaconda3\\Scripts\\lecture01\\Feature_engineering_and_model_tuning\\Feature-engineering_and_Parameter_Tuning_XGBoost")
第二步,加载包
import pandas as pd import numpy as np %matplotlib inline
第三步,载入数据
#载入数据: train = pd.read_csv('Train.csv',encoding = "ISO-8859-1") test = pd.read_csv('Test.csv',encoding = "ISO-8859-1")
第四步,查看数据
维数
train.shape, test.shape
((87020, 26), (37717, 24))
数据类型
#看看数据的基本情况 train.dtypes
ID object
Gender object
City object
Monthly_Income int64
DOB object
Lead_Creation_Date object
Loan_Amount_Applied float64
Loan_Tenure_Applied float64
Existing_EMI float64
Employer_Name object
Salary_Account object
Mobile_Verified object
Var5 int64
Var1 object
Loan_Amount_Submitted float64
Loan_Tenure_Submitted float64
Interest_Rate float64
Processing_Fee float64
EMI_Loan_Submitted float64
Filled_Form object
Device_Type object
Var2 object
Source object
Var4 int64
LoggedIn int64
Disbursed int64
dtype: object
查看数据
#拿前5条出来看看 train.head(5)
数据合并
#合成一个总的data train['source']= 'train' test['source'] = 'test' data=pd.concat([train, test],ignore_index=True) data.shape
(124737, 27)
查看异常值
空值
data.apply(lambda x: sum(x.isnull()))
City 1401
DOB 0
Device_Type 0
Disbursed 37717
EMI_Loan_Submitted 84901
Employer_Name 113
Existing_EMI 111
Filled_Form 0
Gender 0
ID 0
Interest_Rate 84901
Lead_Creation_Date 0
Loan_Amount_Applied 111
Loan_Amount_Submitted 49535
Loan_Tenure_Applied 111
Loan_Tenure_Submitted 49535
LoggedIn 37717
Mobile_Verified 0
Monthly_Income 0
Processing_Fee 85346
Salary_Account 16801
Source 0
Var1 0
Var2 0
Var4 0
Var5 0
source 0
dtype: int64
- 查看每一列的取值种数
var = ['Gender','Salary_Account','Mobile_Verified','Var1','Filled_Form','Device_Type','Var2','Source'] for v in var: print ('\n%s这一列数据的不同取值和出现的次数\n'%v) print (data[v].value_counts())
Gender这一列数据的不同取值和出现的次数
Male 71398
Female 53339
Name: Gender, dtype: int64
Salary_Account这一列数据的不同取值和出现的次数
HDFC Bank 25180
ICICI Bank 19547
State Bank of India 17110
Axis Bank 12590
Citibank 3398
Kotak Bank 2955
IDBI Bank 2213
Punjab National Bank 1747
Bank of India 1713
Bank of Baroda 1675
Standard Chartered Bank 1434
Canara Bank 1385
Union Bank of India 1330
Yes Bank 1120
ING Vysya 996
Corporation bank 948
Indian Overseas Bank 901
State Bank of Hyderabad 854
Indian Bank 773
Oriental Bank of Commerce 761
IndusInd Bank 711
Andhra Bank 706
Central Bank of India 648
Syndicate Bank 614
Bank of Maharasthra 576
HSBC 474
State Bank of Bikaner & Jaipur 448
Karur Vysya Bank 435
State Bank of Mysore 385
Federal Bank 377
Vijaya Bank 354
Allahabad Bank 345
UCO Bank 344
State Bank of Travancore 333
Karnataka Bank 279
United Bank of India 276
Dena Bank 268
Saraswat Bank 265
State Bank of Patiala 263
South Indian Bank 223
Deutsche Bank 176
Abhyuday Co-op Bank Ltd 161
The Ratnakar Bank Ltd 113
Tamil Nadu Mercantile Bank 103
Punjab & Sind bank 84
J&K Bank 78
Lakshmi Vilas bank 69
Dhanalakshmi Bank Ltd 66
State Bank of Indore 32
Catholic Syrian Bank 27
India Bulls 21
B N P Paribas 15
Firstrand Bank Limited 11
GIC Housing Finance Ltd 10
Bank of Rajasthan 8
Kerala Gramin Bank 4
Industrial And Commercial Bank Of China Limited 3
Ahmedabad Mercantile Cooperative Bank 1
Name: Salary_Account, dtype: int64
Mobile_Verified这一列数据的不同取值和出现的次数
Y 80928
N 43809
Name: Mobile_Verified, dtype: int64
Var1这一列数据的不同取值和出现的次数
HBXX 84901
HBXC 12952
HBXB 6502
HAXA 4214
HBXA 3042
HAXB 2879
HBXD 2818
HAXC 2171
HBXH 1387
HCXF 990
HAYT 710
HAVC 570
HAXM 386
HCXD 348
HCYS 318
HVYS 252
HAZD 161
HCXG 114
HAXF 22
Name: Var1, dtype: int64
Filled_Form这一列数据的不同取值和出现的次数
N 96740
Y 27997
Name: Filled_Form, dtype: int64
Device_Type这一列数据的不同取值和出现的次数
Web-browser 92105
Mobile 32632
Name: Device_Type, dtype: int64
Var2这一列数据的不同取值和出现的次数
B 53481
G 47338
C 20366
E 1855
D 918
F 770
A 9
Name: Var2, dtype: int64
Source这一列数据的不同取值和出现的次数
S122 55249
S133 42900
S159 7999
S143 6140
S127 2804
S137 2450
S134 1900
S161 1109
S151 1018
S157 929
S153 705
S144 447
S156 432
S158 294
S123 112
S141 83
S162 60
S124 43
S150 19
S160 11
S136 5
S138 5
S155 5
S139 4
S129 4
S135 2
S142 1
S140 1
S154 1
S125 1
S130 1
S126 1
S132 1
S131 1
Name: Source, dtype: int64
单个特征分析
计算字段数
#City字段处理 len(data['City'].unique())
删掉属性
data.drop('City',axis=1,inplace=True)
缺失值用中位数填补
#找中位数去填补缺省值(因为缺省的不多) data['Loan_Amount_Applied'].fillna(data['Loan_Amount_Applied'].median(),inplace=True)
数据集的One-Hot编码
data = pd.get_dummies(data, columns=var_to_encode) data.columns
相关文章推荐
- 【Python数据分析】简单爬虫 爬取知乎神回复
- 利用 Python 进行数据分析(四)NumPy 基础:ndarray 简单介绍
- python数据分析——安装numpy,生成正态分布并简单分析
- Python数据可视化正态分布简单分析及实现代码
- Python数据分析利器——numpy简单教学
- Python3.4 简单的数据分析
- 【Python数据挖掘课程】九.回归模型LinearRegression简单分析氧化物数据
- Python爬虫爬取京东内存条数据并作简单分析
- 【Python数据分析】简单爬虫,爬取知乎神回复
- 利用Python进行数据分析(7) pandas基础: Series和DataFrame的简单介绍
- python实现人人网用户数据爬取及简单分析
- python数据分析scipy简单例子
- python爬取拉勾网招聘信息并利用pandas做简单数据分析
- PythonStock(9):使用优矿uqer.io 进行简单的数据分析
- python 抓取腾讯微博数据并做简单的分析 .
- 利用 Python 进行数据分析(一)简单介绍
- python数据分析之(7)简单绘图pylab
- python数据分析numpy简单例子
- Python利用itchat对微信中好友数据实现简单分析的方法
- Python数据分析(一):工具的简单使用