您的位置:首页 > 编程语言 > Python开发

Kaggle/Titanic python分析和建模

2017-07-29 21:28 441 查看
Titanic是Kaggle入门项目,本文跟随https://www.kaggle.com/startupsci/titanic/titanic-data-science-solutions学习。

1.Workflow stages

完整的流程分7步;当然,Kaggle已经提供了第1和第2步了;绝大部分都是数据整理工作,即所谓的“特征工程”,其中,通过画图来探索数据是必备技能。

其中,Wrangle是什么意思?

Question or problem definition.
Acquire training and testing data.
Wrangle, prepare, cleanse the data.
Analyze, identify patterns, and explore the data.
Model, predict and solve the problem.
Visualize, report, and present the problem solving steps and final solution.
Supply or submit the results.
2. Analyze by describing data

通过pandas进行数据集的早期探索,可以问答以下的问题:

Which features are available in the dataset?

Which features are categorical?

Which features are numerical?

Which features are mixed data types?

Which features may contain errors or typos?

Which features contain blank, null or empty values?

What are the data types for various features?

What is the distribution of numerical feature values across the samples?

What is the distribution of categorical features?

3. Assumtions based on data analysis
在“Analyze by describing data”基础上按照以下几类进行假设分析。
Correlating feature:此例中,比如female的存活概率较高

Completing  feature

Correcting feature

Creating new feature

4. Analyze by pivoting features | Analyze by visualizing data
section 3 and section 4是必须一起考虑和操作的,通过这2步骤,能更深的理解数据的各特征。
并且通过此2步骤,将会考虑哪些特征是有用的,哪些特征是无用可丢弃的。
Assumtions必须通过本步骤提供证据,表格和直方图都是“透视”数据规律的好办法。
特征参数是类别变量时,使用表格进行“透视”数据。
特征参数是数值变量时,通过直方图进行“透视”数据。

4.1 Correlating feature
Correlating numerical features

Correlating numerical and ordinal features

Correlating categorical features

5. Wrangle data
这一步才是真正的“特征工程”处理了,之前的section 2/3/4都只是分析特征而已。

Correcting by dropping features

Creating new feature extracting from existing

Converting a categorical feature

Completing a numerical continuous feature

Create new feature combining existing features

Completing a categorical feature

Converting categorical feature to numeric

Quick completing and converting a numeric feature
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: