您的位置:首页 > 编程语言 > Python开发

python | 简单的数据分析

2017-02-09 18:08 891 查看
做数据分析的两大利器:python和R语言,这里介绍一个我用python学习的案例

第一步,设置工作目录

#encoding:utf8
import os
os.chdir("G:\\Anaconda3\\Scripts\\lecture01\\Feature_engineering_and_model_tuning\\Feature-engineering_and_Parameter_Tuning_XGBoost")


第二步,加载包

import pandas as pd
import numpy as np
%matplotlib inline


第三步,载入数据

#载入数据:
train = pd.read_csv('Train.csv',encoding = "ISO-8859-1")
test = pd.read_csv('Test.csv',encoding = "ISO-8859-1")


第四步,查看数据

维数

train.shape, test.shape


((87020, 26), (37717, 24))

数据类型

#看看数据的基本情况
train.dtypes


ID object

Gender object

City object

Monthly_Income int64

DOB object

Lead_Creation_Date object

Loan_Amount_Applied float64

Loan_Tenure_Applied float64

Existing_EMI float64

Employer_Name object

Salary_Account object

Mobile_Verified object

Var5 int64

Var1 object

Loan_Amount_Submitted float64

Loan_Tenure_Submitted float64

Interest_Rate float64

Processing_Fee float64

EMI_Loan_Submitted float64

Filled_Form object

Device_Type object

Var2 object

Source object

Var4 int64

LoggedIn int64

Disbursed int64

dtype: object

查看数据

#拿前5条出来看看
train.head(5)


数据合并

#合成一个总的data
train['source']= 'train'
test['source'] = 'test'
data=pd.concat([train, test],ignore_index=True)
data.shape


(124737, 27)

查看异常值

空值

data.apply(lambda x: sum(x.isnull()))


City 1401

DOB 0

Device_Type 0

Disbursed 37717

EMI_Loan_Submitted 84901

Employer_Name 113

Existing_EMI 111

Filled_Form 0

Gender 0

ID 0

Interest_Rate 84901

Lead_Creation_Date 0

Loan_Amount_Applied 111

Loan_Amount_Submitted 49535

Loan_Tenure_Applied 111

Loan_Tenure_Submitted 49535

LoggedIn 37717

Mobile_Verified 0

Monthly_Income 0

Processing_Fee 85346

Salary_Account 16801

Source 0

Var1 0

Var2 0

Var4 0

Var5 0

source 0

dtype: int64

- 查看每一列的取值种数

var = ['Gender','Salary_Account','Mobile_Verified','Var1','Filled_Form','Device_Type','Var2','Source']
for v in var:
print ('\n%s这一列数据的不同取值和出现的次数\n'%v)
print (data[v].value_counts())


Gender这一列数据的不同取值和出现的次数

Male 71398

Female 53339

Name: Gender, dtype: int64

Salary_Account这一列数据的不同取值和出现的次数

HDFC Bank 25180

ICICI Bank 19547

State Bank of India 17110

Axis Bank 12590

Citibank 3398

Kotak Bank 2955

IDBI Bank 2213

Punjab National Bank 1747

Bank of India 1713

Bank of Baroda 1675

Standard Chartered Bank 1434

Canara Bank 1385

Union Bank of India 1330

Yes Bank 1120

ING Vysya 996

Corporation bank 948

Indian Overseas Bank 901

State Bank of Hyderabad 854

Indian Bank 773

Oriental Bank of Commerce 761

IndusInd Bank 711

Andhra Bank 706

Central Bank of India 648

Syndicate Bank 614

Bank of Maharasthra 576

HSBC 474

State Bank of Bikaner & Jaipur 448

Karur Vysya Bank 435

State Bank of Mysore 385

Federal Bank 377

Vijaya Bank 354

Allahabad Bank 345

UCO Bank 344

State Bank of Travancore 333

Karnataka Bank 279

United Bank of India 276

Dena Bank 268

Saraswat Bank 265

State Bank of Patiala 263

South Indian Bank 223

Deutsche Bank 176

Abhyuday Co-op Bank Ltd 161

The Ratnakar Bank Ltd 113

Tamil Nadu Mercantile Bank 103

Punjab & Sind bank 84

J&K Bank 78

Lakshmi Vilas bank 69

Dhanalakshmi Bank Ltd 66

State Bank of Indore 32

Catholic Syrian Bank 27

India Bulls 21

B N P Paribas 15

Firstrand Bank Limited 11

GIC Housing Finance Ltd 10

Bank of Rajasthan 8

Kerala Gramin Bank 4

Industrial And Commercial Bank Of China Limited 3

Ahmedabad Mercantile Cooperative Bank 1

Name: Salary_Account, dtype: int64

Mobile_Verified这一列数据的不同取值和出现的次数

Y 80928

N 43809

Name: Mobile_Verified, dtype: int64

Var1这一列数据的不同取值和出现的次数

HBXX 84901

HBXC 12952

HBXB 6502

HAXA 4214

HBXA 3042

HAXB 2879

HBXD 2818

HAXC 2171

HBXH 1387

HCXF 990

HAYT 710

HAVC 570

HAXM 386

HCXD 348

HCYS 318

HVYS 252

HAZD 161

HCXG 114

HAXF 22

Name: Var1, dtype: int64

Filled_Form这一列数据的不同取值和出现的次数

N 96740

Y 27997

Name: Filled_Form, dtype: int64

Device_Type这一列数据的不同取值和出现的次数

Web-browser 92105

Mobile 32632

Name: Device_Type, dtype: int64

Var2这一列数据的不同取值和出现的次数

B 53481

G 47338

C 20366

E 1855

D 918

F 770

A 9

Name: Var2, dtype: int64

Source这一列数据的不同取值和出现的次数

S122 55249

S133 42900

S159 7999

S143 6140

S127 2804

S137 2450

S134 1900

S161 1109

S151 1018

S157 929

S153 705

S144 447

S156 432

S158 294

S123 112

S141 83

S162 60

S124 43

S150 19

S160 11

S136 5

S138 5

S155 5

S139 4

S129 4

S135 2

S142 1

S140 1

S154 1

S125 1

S130 1

S126 1

S132 1

S131 1

Name: Source, dtype: int64

单个特征分析

计算字段数

#City字段处理
len(data['City'].unique())


删掉属性

data.drop('City',axis=1,inplace=True)


缺失值用中位数填补

#找中位数去填补缺省值(因为缺省的不多)
data['Loan_Amount_Applied'].fillna(data['Loan_Amount_Applied'].median(),inplace=True)


数据集的One-Hot编码

data = pd.get_dummies(data, columns=var_to_encode)
data.columns
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  python 数据分析