您的位置:首页 > 其它

Andrew NG 机器学习 笔记-week1-单变量线性回归

2017-08-19 15:07 357 查看

一、Introduction

1.1 Welcome

What is Machine Learning

Grew out of work in AI(机器学习源于人工智能领域)

New capacity for computers(ML 已经发展成为计算机的一项新能力)

Examples:(机器学习应用实例)

Database mining

Large datasets from growth of automation/web

E.g.,Web click data,medical records,biology,engineering

Applications can’t program by hand

E.g.,Autonomous helicopter,handwriting recognition,most of Natural language Processing(NLP),Computer Vision.

Self-customizing programs(自定制化程序)

E.g.,Amazon,Netfix,iTunes Genius product recommendations(产品推荐)

Understanding human learning (被用来理解人类的学习和了解大脑)

1.2 What is ML?

目前没有确切的定义

Arthur Samuel(1959)对ML的定义:Field of study that gives computers the ability to learn without being explicitly programmed.(不是靠明确的编程,而是赋予计算机自学能力的研究领域)

Tom Mitchell(1988)Well-posed Learning Problem:A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T , as measured by P, improves with experience E . (对学习问题的合适定义:一个计算机程序在性能指标P的检验下,表现为通过经验E,使得处理任务T的性能有所提高,我们就说他关于P和T学习了E)

Example: playing checkers.

E = the experience of playing many games of checkers

T = the task of playing checkers.

P = the probability that the program will win the next game.

ML algorithms:(机器学习分类)

Supervised learning(教计算机如何学习)

Unsupervised learning(让计算机自己进行学习)

1.3 Supervised learning

例子一:

以房价预测:



一个学生从波特兰俄勒冈州的研究所收集了一些房价的数据。

将这些数据标在坐标上,横轴表示房子的面积,单位是平方英尺,纵轴表示房价,单位是千美元。

基于这组数据,假如你有一个朋友,他有一套 750 平方英尺房子,现在他希望把房子卖掉,能卖多少钱。



可以画一条直线,让直线尽可能匹配所有数据。可以看出750平方英尺可以卖$150,000。

可能还有更好的,比如我们用二次方程去拟合所有数据,可以看出750平方英尺可以卖$200, 000。

这就是一个监督学习的例子。

In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.

监督学习:需要数据集已经有正确答案,而算出更多正确答案的算法。

In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function.

这个例子是回归问题(regression problem):如果要预测的值是连续的比如上述的房价,那么就属于回归问题。

例子二:

通过查看病历来推测已知大小的肿瘤是恶性还是良性的



横轴表示肿瘤的大小,纵轴上,我标出 1 和 0 表示 是 或者 不是 恶性肿瘤。

可以使用不同的符号表示良性和恶性肿瘤,良性的肿瘤改成用 O 表示,恶性的继续用 X 表示。

In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.

这个例子是分类问题(classification problem):如果要预测的值是离散的即一个个标签,那么就属于分类问题。

如果加上年龄这个特征:



只需画一条线将两种肿瘤分开。

在其他机器学习问题中,我们通常有更多的特征比如年龄,肿块密度,肿瘤细胞尺寸的一致性和形状的一致性等等,还有一些其他的特征。

使用支持向量机算法可以让电脑处理无数多个特征。

1.4 Unsupervised learning



Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t necessarily know the effect of the variables.

监督学习的数据集中没有任何的标签。

We can derive this structure by clustering the data based on relationships among the variables in the data.

监督学习算法可能会把这些数据分成两个不同的簇。所以叫做聚类算法

With unsupervised learning there is no feedback based on the prediction results.

例子:

谷歌新闻分类

组织大型计算机集群,什么样的机器易于协同地工作

社交网络的分析,自动地给出朋友的分组

数据库客户信息市场分组

鸡尾酒宴问题,使用两个麦克风,分离叠加的声音

二、模型和代价函数(Model and Cost Function)

2.1 模型表示

m 代表训练集中实例的数量

x 代表特征/输入变量

y 代表目标变量/输出变量

(x,y) 代表训练集中的实例

(x(i) ,y(i)) 代表第 i 个观察实例

h 代表学习算法的解决方案或函数也称为假设(hypothesis)

监督学习算法的工作方式:



我们可以看到这里有我们的训练集如房屋价格,我们把它喂给我们的学习算法,学习算法的工作了,然后输出一个函数,通常表示为小写 h表示。h 叫作 hypothesis(假设) , h 根据输入的 x 值来得出 y 值,y 值对应房子的价格。 因此,h 是一个从x 到 y 的函数映射。

一种可能的表达方式为:hθ(x)=θ0+θ1x , 因为只含有一个特征/输入变量,因此这样的问题叫作 单变量线性回归问题

2.2 代价函数(Cost Function)

We can measure the accuracy of our hypothesis function by using a cost function. This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x’s and the actual output y’s.

J(θ0,θ1)=12m∑i=1m(y^i−yi)2=12m∑i=1m(hθ(xi)−yi)2

To break it apart, it is 12x¯ where x¯ is the mean of the squares of hθ(xi)−yi, or the difference between the predicted value and the actual value.

This function is otherwise called the “Squared error function”(平方误差函数), or “Mean squared error”(MSE,均方误差). The mean is halved (12) as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 12term.

平均值被分成12,是为了计算梯度下降时方便,平方的倒数将会消去12

代价函数的样子:



可以看出在三维空间中存在一个使得 J(θ0,θ1 )最小的点

代价函数的参数大于2时,图像就不能表示了

等高图 contour plot 或 contour figure ,相当于代价函数图像的切面图



我们真正需要的是一种有效的算法,能够自动地找出这些使代价函数 J 取最小值

的参数 θ 0 和 θ 1 来

三、参数训练(Parameter Learning)

3.1 梯度下降

梯度下降是一个用来求函数最小值的算法,我们将使用梯度下降算法来求出代价函数

J(θ0,θ1) 的最小值。

Have some function:J(θ0,θ1)

Want:minθ0,θ1 J(θ0,θ1)

Start with some θ0,θ1

Keep changing θ0,θ1 to reduce J(θ0,θ1) until we hopefully end up at a minimum

梯度下降背后的思想是:开始时我们随机选择一个参数的组合(θ 0 ,θ 1 ,…,θ n ),计算代价函数,然后我们寻找下一个能让代价函数值下降最多的参数组合。我们持续这么做直到到达一个局部最小值(local minimum),因为我们并没有尝试完所有的参数组合,所以不能确定我们得到的局部最小值是否便是全局最小值(global minimum),选择不同的初始参数组合,可能会找到不同的局部最小值



梯度下降(gradient descent)算法的公式为:



必须同步更新 θ0和θ1



其中 α 是学习率(learning rate),它决定了我们沿着能让代价函数下降程度最大的方

向向下迈出的步子有多大

当人们谈到梯度下降时,他们的意思就是同步更新。

以一个参数为例说明:



随着我接近最低点,导数值会越变越小,越来越接近零,所以,所以在α不用变化,移动的幅度也会自动变得越来越小,直到最终移动幅度非常小,你会发现,已经收敛到局部极小值。

如果 α 太小,很慢才会达到最低点。

如果 α 太大,它会导致无法收敛,甚至发散。



3.2 梯度下降在线性回归的应用(Gradient Descent For Linear Regression)

梯度下降算法和线性回归算法比较如图:



对我们之前的线性回归问题运用梯度下降法,关键在于求出代价函数的导数,即:



则算法改写成:



有时也称此为批量梯度下降(Batch Gradient Descent),指的是在梯度下降的每一步中,我们都用到了所有的训练样本。批量梯度下降法这个名字说明了我们需要考虑所有这一”批”训练样本,而事实上,有时也有其他类型的梯度下降法,不是这种”批量”型的,不考虑整个的训练集,而是每次只关注训练集中的一些小的子集。在后面的课程中,我们也将介绍这些方法。

四、线性代数回顾(Linear Algebra Review)

4.1 矩阵和向量(Matrices and Vectors)

Matrices are 2-dimensional arrays:

⎡⎣⎢⎢⎢adgjbehkcfil⎤⎦⎥⎥⎥

The above matrix has four rows and three columns, so it is a 4 x 3 matrix.

矩阵的维数即行数×列数

⎡⎣⎢⎢⎢wxyz⎤⎦⎥⎥⎥

A vector is a matrix with one column and many rows:

Notation and terms:

Aij refers to the element in the ith row and jth column of matrix A.

A vector with ‘n’ rows is referred to as an ‘n’-dimensional vector.

vi refers to the element in the ith row of the vector.

In general, all our vectors and matrices will be 1-indexed. Note that for some programming languages, the arrays are 0-indexed.(第一个元素从0开始还是从1开始)

Matrices are usually denoted by uppercase names while vectors are lowercase.

“Scalar”(数量) means that an object is a single value, not a vector(矢量) or matrix.

R refers to the set of scalar real numbers.

Rn refers to the set of n-dimensional vectors of real numbers.

% The ; denotes we are going back to a new row.
A = [1, 2, 3; 4, 5, 6; 7, 8, 9; 10, 11, 12]

% Initialize a vector
v = [1;2;3]

% Get the dimension of the matrix A where m = rows and n = columns
[m,n] = size(A)

% You could also store it this way
dim_A = size(A)

% Get the dimension of the vector v
dim_v = size(v)

% Now let's index into the 2nd row 3rd column of matrix A
A_23 = A(2,3)


A =

1    2    3
4    5    6
7    8    9
10   11   12

v =

1
2
3

m =  4
n =  3
dim_A =

4   3

dim_v =

3   1

A_23 =  6


4.2 加法和标量乘法(Addition and Scalar Multiplication)

[acbd]+[wyxz]=[a+wc+yb+xd+z]

[acbd]−[wyxz]=[a−wc−yb−xd−z]

[acbd]∗x=[a∗xc∗xb∗xd∗x]

[acbd]/x=[a/xc/xb/xd/x]

% Initialize matrix A and B
A = [1, 2, 4; 5, 3, 2]
B = [1, 3, 4; 1, 1, 1]

% Initialize constant s
s = 2

% See how element-wise addition works
add_AB = A + B

% See how element-wise subtraction works
sub_AB = A - B

% See how scalar multiplication works
mult_As = A * s

% Divide A by s
div_As = A / s

% What happens if we have a Matrix + scalar?
add_As = A + s


A =

1   2   4
5   3   2

B =

1   3   4
1   1   1

s =  2
add_AB =

2   5   8
6   4   3

sub_AB =

0  -1   0
4   2   1

mult_As =

2    4    8
10    6    4

div_As =

0.50000   1.00000   2.00000
2.50000   1.50000   1.00000

add_As =

3   4   6
7   5   4


4.3 矩阵向量乘法(Matric Vector Multiplication)

⎡⎣⎢acebdf⎤⎦⎥∗[xy]=⎡⎣⎢a∗x+b∗yc∗x+d∗ye∗x+f∗y⎤⎦⎥

% Initialize matrix A
A = [1, 2, 3; 4, 5, 6;7, 8, 9]

% Initialize vector v
v = [1; 1; 1]

% Multiply A * v
Av = A * v


A =

1   2   3
4   5   6
7   8   9

v =

1
1
1

Av =

6
15
24


4.4 矩阵乘法(Matirc Matirc Multiplication)

⎡⎣⎢acebdf⎤⎦⎥∗[wyxz]=⎡⎣⎢a∗w+b∗yc∗w+d∗ye∗w+f∗ya∗x+b∗zc∗x+d∗ze∗x+f∗z⎤⎦⎥

% Initialize a 3 by 2 matrix
A = [1, 2; 3, 4;5, 6]

% Initialize a 2 by 1 matrix
B = [1; 2]

% We expect a resulting matrix of (3 by 2)*(2 by 1) = (3 by 1)
mult_AB = A*B

% Make sure you understand why we got that result


A =

1   2
3   4
5   6

B =

1
2

mult_AB =

5
11
17


4.5 矩阵乘法的性质(Matirc Multiplication Properties)

Matrices are not commutative : A∗B≠B∗A 不满足交换律

Matrices are associative: (A∗B)∗C=A∗(B∗C) 满足结合律

The identity matrix(单位矩阵), when multiplied by any matrix of the same dimensions, results in the original matrix. It’s just like multiplying numbers by 1. The identity matrix simply has 1’s on the diagonal (upper left to lower right diagonal) and 0’s elsewhere.

⎡⎣⎢100010001⎤⎦⎥

单位矩阵一般用 I 或者 E 表示,满足交换律:AI=IA=A

% Initialize random matrices A and B
A = [1,2;4,5]
B = [1,1;0,2]

% Initialize a 2 by 2 identity matrix
I = eye(2)

% The above notation is the same as I = [1,0;0,1]

% What happens when we multiply I*A ?
IA = I*A

% How about A*I ?
AI = A*I

% Compute A*B
AB = A*B

% Is it equal to B*A?
BA = B*A

% Note that IA = AI but AB != BA


A =

1   2
4   5

B =

1   1
0   2

I =

Diagonal Matrix

1   0
0   1

IA =

1   2
4   5

AI =

1   2
4   5

AB =

1    5
4   14

BA =

5    7
8   10


4.6 逆、转置(Inverse and Transpose)

矩阵的逆:如矩阵 A 是一个 m×m 矩阵(方阵),如果有逆矩阵,则:

AA−1=A−1A=I

矩阵的转置:设A为 m×n 阶矩阵(即m行n列),第 i 行 j 列的元素是a(i,j) ,即:A=a(i,j)

定义 A 的转置为这样一个 n×m 阶矩阵 B,满足 B=a(j,i),即 b (i,j)=a (j,i)(B 的第 i 行第 j 列元素是 A 的第 j 行第 i 列元素),记 AT=B

直观来看,将 A 的所有元素绕着一条从第 1 行第 1 列元素出发的右下方 45 度的射线作镜面反转,即得到 A 的转置。

A=⎡⎣⎢acebdf⎤⎦⎥

AT=[abcdef]

Aij=ATji

% Initialize matrix A
A = [1,2,0;0,5,6;7,0,9]

% Transpose A
A_trans = A'

% Take the inverse of A
A_inv = inv(A)

% What is A^(-1)*A?
A_invA = inv(A)*A


A =

1   2   0
0   5   6
7   0   9

A_trans =

1   0   7
2   5   0
0   6   9

A_inv =

0.348837  -0.139535   0.093023
0.325581   0.069767  -0.046512
-0.271318   0.108527   0.038760

A_invA =

1.00000  -0.00000   0.00000
0.00000   1.00000  -0.00000
-0.00000   0.00000   1.00000
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐