机器学习实践:Multiple Features Of One Input
2017-02-23 23:08
411 查看
这个数据集合是在华盛顿大学公开课中下载的,是华盛顿州king county一段时间的房屋交易详情。用到一半才知道这只是某连续12个月的。
Canvas is accessible via web browser at the URL: http://localhost:42164/index.html Opening Canvas in default web browser.
[5 rows x 3 columns]
观察到date是datetime类型,而且我只想用到年份和月份,所以剔除掉不需要的部分,之后还要划分训练集和测试集
[5 rows x 3 columns]
按照年月分组计算每组的平均值
[5 rows x 2 columns]
画出来图像看看
[13 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
![](https://oscdn.geek-share.com/Uploads/Images/Content/201909/10/43d44e5d940faf64ec05f18df22b0d73)
有点尴尬,本来以为数据有好几年的,结果只有一年的,实在是尴尬。硬着头皮按照之前的想法做吧
[5 rows x 3 columns]
![](https://oscdn.geek-share.com/Uploads/Images/Content/201909/10/def9e91c72cf0760324130d37ae63802)
然后试试内建的线性回归方法
![](https://oscdn.geek-share.com/Uploads/Images/Content/201909/10/825b8ea722b4afdd2f8f72897cea88f6)
yi=w0+w1xi+w2sin(2πxi/12−Φ)+ϵi
即
yi=w0+w1xi+w2sin(2πxi/12)+w3cos(2πxi/12)+ϵi
參差平方和
RSS=∑i=1N[yi−(w0+w1xi+w2sin(2πxi/12)+w3cos(2πxi/12))]2
求出梯度
为方便表示,令A=yi−(w0+w1xi+w2sin(2πxi/12)+w3cos(2πxi/12))
∇RSS=⎡⎣⎢⎢⎢⎢⎢⎢−2∑A−2∑Axi−2∑Asin(πxi/6)−2∑Acos(πxi/6)⎤⎦⎥⎥⎥⎥⎥⎥
首先计算y^i
计算RSS
计算RSS的梯度
迭代计算参数
![](https://oscdn.geek-share.com/Uploads/Images/Content/201909/10/232ba3be1753173e2abf47cbb2df7e75)
![](https://oscdn.geek-share.com/Uploads/Images/Content/201909/10/f06a5575c0356ef0bc346a9b3a8aae6d)
从上边的图形中可以看出新模型对训练集的拟合是非常好的
[5 rows x 2 columns]
![](https://oscdn.geek-share.com/Uploads/Images/Content/201909/10/b49af84387147b80d635638ebcc44a37)
这里看模型表现的不怎么样,主要原因应该是数据太少了,而且用的是每个月份所有房子的平均价格(不管大小、区位、档次等因素),如果有类似单位面积价格的话更好一些。另外,如果数据量大,每个月都有各种各样的房子成交(类似于训练集的数据量),那么表现也会好一些吧。
选房子的总价做预测不是很合理,如果有单位面积价格可能更好
数据量较小
启动graphlab
import graphlab as gl
加载数据
sale = gl.SFrame('kc_house_data.gl')
sale.show(view='Line Chart', x='date', y='price')
Canvas is accessible via web browser at the URL: http://localhost:42164/index.html Opening Canvas in default web browser.
数据预处理
打算只用时间和平均成交价格学习一下,所以先把这两个量整理一下data = sale.select_columns(['id', 'date', 'price'])
data.head((5))
id | date | price |
---|---|---|
7129300520 | 2014-10-13 00:00:00+00:00 | 221900.0 |
6414100192 | 2014-12-09 00:00:00+00:00 | 538000.0 |
5631500400 | 2015-02-25 00:00:00+00:00 | 180000.0 |
2487200875 | 2014-12-09 00:00:00+00:00 | 604000.0 |
1954400510 | 2015-02-18 00:00:00+00:00 | 510000.0 |
观察到date是datetime类型,而且我只想用到年份和月份,所以剔除掉不需要的部分,之后还要划分训练集和测试集
from datetime import date data['date'] = data['date'].apply(lambda x:date(x.year, x.month, 1)) data.head((5))
id | date | price |
---|---|---|
7129300520 | 2014-10-01 00:00:00 | 221900.0 |
6414100192 | 2014-12-01 00:00:00 | 538000.0 |
5631500400 | 2015-02-01 00:00:00 | 180000.0 |
2487200875 | 2014-12-01 00:00:00 | 604000.0 |
1954400510 | 2015-02-01 00:00:00 | 510000.0 |
train_data, test_data = data.random_split(0.8, 0)
按照年月分组计算每组的平均值
import graphlab.aggregate as agg
mean_price_month = tr 4000 ain_data.groupby(key_columns=['date'], operations={'mean_price':agg.MEAN('price')}) mean_price_month.head((5))
date | mean_price |
---|---|
2014-10-01 00:00:00 | 536548.106171 |
2015-03-01 00:00:00 | 546654.241562 |
2014-09-01 00:00:00 | 522656.637439 |
2014-08-01 00:00:00 | 535107.794344 |
2014-12-01 00:00:00 | 522391.405724 |
画出来图像看看
import matplotlib.pyplot as plt %matplotlib inline
#首先需要根据时间顺序对表进行排序 mean_price_month = mean_price_month.sort(['date']) mean_price_month
date | mean_price |
---|---|
2014-05-01 00:00:00 | 549571.932862 |
2014-06-01 00:00:00 | 556720.479011 |
2014-07-01 00:00:00 | 546336.921229 |
2014-08-01 00:00:00 | 535107.794344 |
2014-09-01 00:00:00 | 522656.637439 |
2014-10-01 00:00:00 | 536548.106171 |
2014-11-01 00:00:00 | 523900.917319 |
2014-12-01 00:00:00 | 522391.405724 |
2015-01-01 00:00:00 | 523216.857143 |
2015-02-01 00:00:00 | 500171.623 |
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
plt.plot(mean_price_month['date'], mean_price_month['mean_price'], '.')
有点尴尬,本来以为数据有好几年的,结果只有一年的,实在是尴尬。硬着头皮按照之前的想法做吧
内建的模型拟合
处理datetime类型不方便,转换成整数类型,从2014-05开始累计mean_price_month['mon_count'] = (mean_price_month['date'].apply(lambda x:x.year)-2014)*12 + mean_price_month['date'].apply(lambda x: x.month)-5 mean_price_month.head((5))
date | mean_price | mon_count |
---|---|---|
2014-05-01 00:00:00 | 549571.932862 | 0 |
2014-06-01 00:00:00 | 556720.479011 | 1 |
2014-07-01 00:00:00 | 546336.921229 | 2 |
2014-08-01 00:00:00 | 535107.794344 | 3 |
2014-09-01 00:00:00 | 522656.637439 | 4 |
model1 = gl.regression.create(mean_price_month, target='mean_price', features=['mon_count']) plt.plot(mean_price_month['mon_count'], mean_price_month['mean_price'], '.', mean_price_month['mon_count'], model1.predict(mean_price_month), '-')
Boosted trees regression:
--------------------------------------------------------
Number of examples : 13
Number of features : 1
Number of unpacked features : 1
+-----------+--------------+--------------------+---------------+
| Iteration | Elapsed Time | Training-max_error | Training-rmse |
+-----------+--------------+--------------------+---------------+
| 1 | 0.000348 | 415092.625000 | 388280.468750 |
| 2 | 0.000616 | 307045.250000 | 280394.250000 |
| 3 | 0.000691 | 229096.781250 | 202668.718750 |
| 4 | 0.000758 | 172862.531250 | 146742.203125 |
| 5 | 0.000824 | 132293.531250 | 106596.460938 |
| 6 | 0.000914 | 103025.906250 | 77906.953125 |
+-----------+--------------+--------------------+---------------+
然后试试内建的线性回归方法
model2 = gl.linear_regression.create(mean_price_month, target='mean_price', features=['mon_count']) plt.plot(mean_price_month['mon_count'], mean_price_month['mean_price'], '.', mean_price_month['mon_count'], model2.predict(mean_price_month), '-')
Linear regression:
--------------------------------------------------------
Number of examples : 13
Number of features : 1
Number of unpacked features : 1
Number of coefficients : 2
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1 | 2 | 0.000243 | 37346.856866 | 18013.646218 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.
梯度下降法来实现自己的模型
将价格随着月份的变化看作是由两部分组成的,首先是一个线性部分,然后是一个周期性变化的三角函数部分yi=w0+w1xi+w2sin(2πxi/12−Φ)+ϵi
即
yi=w0+w1xi+w2sin(2πxi/12)+w3cos(2πxi/12)+ϵi
參差平方和
RSS=∑i=1N[yi−(w0+w1xi+w2sin(2πxi/12)+w3cos(2πxi/12))]2
求出梯度
为方便表示,令A=yi−(w0+w1xi+w2sin(2πxi/12)+w3cos(2πxi/12))
∇RSS=⎡⎣⎢⎢⎢⎢⎢⎢−2∑A−2∑Axi−2∑Asin(πxi/6)−2∑Acos(πxi/6)⎤⎦⎥⎥⎥⎥⎥⎥
首先计算y^i
from math import * import numpy as np def get_haty(W, X): w0, w1, w2, w3 = W return w0 + w1*X + w2*np.sin(X*pi/6) + w3*np.cos(X*pi/6)
计算RSS
def get_RSS(W, X, Y): return sum((Y-get_haty(W,X))*((Y-get_haty(W,X))))
计算RSS的梯度
def get_RSS_grad(W, X, Y): a = -2*sum(Y - get_haty(W, X)) b = -2*sum((Y - get_haty(W, X))*X) c = -2*sum((Y - get_haty(W, X))*np.sin(pi*X/6)) d = -2*sum((Y - get_haty(W, X))*np.cos(pi*X/6)) return np.array([a, b, c, d])
迭代计算参数
def get_W(ini_W, X, Y, stop_grad, step, max_ite=1000000): W = ini_W ite = 0 grad = get_RSS_grad(W, X, Y) RSS = [get_RSS(W, X, Y)] while sum(grad*grad) > stop_grad*stop_grad and ite < max_ite: ite = ite+1 W = W - step*grad#如果必要可以给step加上一个越来越大的分母来控制学习率 grad = get_RSS_grad(W, X, Y) RSS.append(get_RSS(W, X, Y)) return W, RSS, ite
W = np.array([520000, 3000, -21000, 21000]) plt.plot(mean_price_month['mon_count'], mean_price_month['mean_price'], '.', mean_price_month['mon_count'], get_haty(W,mean_price_month['mon_count']), '-')
W, RSS, ite = get_W(W, np.array(mean_price_month['mon_count']), np.array(mean_price_month['mean_price']), 0.1, 0.0015 ) #学习率和初始位置的选择很重要,花费了较长的时间 print ite print RSS[-100:] plt.plot(mean_price_month['mon_count'], mean_price_month['mean_price'], '.', mean_price_month['mon_count'], get_haty(W,mean_price_month['mon_count']), '-')
0 [1484680834.0960073]
从上边的图形中可以看出新模型对训练集的拟合是非常好的
在测试集上实验
train_data, test_data = data.random_split(0.8, 0) test_data = test_data.groupby(key_columns=['date'], operations={'mean_price':agg.MEAN('price')}) test_data = test_data.sort('date') test_data.head((5))
date | mean_price |
---|---|
2014-05-01 00:00:00 | 542100.963173 |
2014-06-01 00:00:00 | 563056.417234 |
2014-07-01 00:00:00 | 538206.346793 |
2014-08-01 00:00:00 | 541864.872396 |
2014-09-01 00:00:00 | 557384.857567 |
test_data['mon_count'] = (test_data['date'].apply(lambda x:x.year)-2014)*12 + test_data['date'].apply(lambda x: x.month)-5
plt.plot(test_data['mon_count'], test_data['mean_price'], '.', test_data['mon_count'], get_haty(W, test_data['mon_count']), '-') print get_RSS(W, np.array(test_data['mon_count']), np.array(test_data['mean_price']))
2498013785.02
这里看模型表现的不怎么样,主要原因应该是数据太少了,而且用的是每个月份所有房子的平均价格(不管大小、区位、档次等因素),如果有类似单位面积价格的话更好一些。另外,如果数据量大,每个月都有各种各样的房子成交(类似于训练集的数据量),那么表现也会好一些吧。
总结
数据时间跨度太小选房子的总价做预测不是很合理,如果有单位面积价格可能更好
数据量较小
相关文章推荐
- 机器学习规则:ML工程最佳实践----rule_of_ml section 3【翻译】
- reason=>"Expected one of #, input, filter, output at line 1, column 1 (byte 1) after "}
- 机器学习实践初步总结
- 机器学习实践笔记3(树和随机森林)
- TensorFlow框架(5)之机器学习实践
- IllegalBlockSizeException: Input length must be multiple of 8 when decrypting with padded cipher
- 京东电商广告和推荐的机器学习系统实践
- 机器学习实践中的 7 种常见错误
- 京东广告推荐机器学习系统实践
- 人肉工程在机器学习实践中的作用
- 机器学习之多变量线性回归(Linear Regression with multiple variables)
- 数学之路(机器学习实践指南)-文本挖掘与NLP(6)
- 台湾大学林轩田机器学习基石课程学习笔记13 -- Hazard of Overfitting
- 'One year master program of embedded systems' 实践项目
- 机器学习实践 学习笔记1 Machine learning basics
- How to update multiple columns of one table using values from another table?
- 林轩田之机器学习课程笔记( combining predictive features之 adaptive boosting)(32之24)
- 数学之路(机器学习实践指南)-深度学习(1)-caffe
- javax.crypto.IllegalBlockSizeException: Input length must be multiple of 16 when decrypting with padded cipher--转载
- 林轩田之机器学习课程笔记( distilling hidden features之deep learning)(32之29)