机器学习-吴恩达

P3 监督学习

correct answers

in every example in our data set, we are told what is the “correct answer” that we would have quite liked the algorithms have predicted on that example, such as the prices of the house or whether the tumor is malignant or benign.

Regression：预测连续性变量的输出，predict continuous valued output(price)
eg: sell house，breast cancer

Classification: discrete valued output(0 or 1) eg：breast cancer: use tumor size and age to predict malignant or benign

inifinite number of features
SVM: allow computer to deal with an infinite number of features to aviode memory error.

无监督学习

不知道分类或者聚类的结果
clustering
cocktail party problem

使用octave的优势。

模型描述

m: number of training examples

$x^s$: “input” variable/feature

$y^s$: “output” variable/“target” variable

cost function（代价函数）

Hypothesis: $h^{\theta}(x)=\theta_0+\theta_1x$

Cost function: $J({\theta}0,{\theta}_1)={\frac{1}{2m}}{\times}{\sum(h{\theta_0}(X^{(i)})-y^{(i)})^2}$

contour plot

找到$\theta0$,$\theta1$，使得J最小。

梯度下降

目的：找到$\theta_0$,$\theta_1$，使得J最小。

从不同的角度出发，会获得不同的局部最优解。

梯度下降时要求同时更新$\theta_0$,$\theta_1$，其中$\alpha$后面的一项是J($\theta_0$,$\theta_1$),在$\theta_0$,$\theta_1$切线的斜率。learninng rate $\alpha$:越大表示梯度下降越快。

当接近局部最小值的时候，梯度下降会自动减小步幅，所以不用一直减小$\alpha$。

梯度下降算法的推导

将线性回归的J 函数带入梯度下降的公式，可以推导出$\theta_0$,$\theta_1$，这时候平方和求导抵消，因此没有平方。

凸函数只有全局最优解，没有局部最优解

Batch gradient descent algorithm

batch: each step of gradient descent uses all the training examples

3-1 Linear Algebra review: matrices and vectors

matrix: rectangular array of numbers

dimension of matrix: number of rows X number of columns

vector: An nX1 matrix

3-2 Linear Algebra review: Addition and scalar multiplication

矩阵加减以及和一个常量乘除，和数字的运算一致，逐个元素相加相减。

3-3 Linear Algebra review: matrix-vector multiplication

解方程的矩阵代入

3-3 Linear Algebra review: matrix-matrix multiplication

解方程的矩阵代入

3-4 Linear Algebra review: matrix multiplication properties

矩阵相乘没有交换率(communitive)

矩阵相乘符合结合律（asscociation)

identify matrix: $A I =I A =A$,I即为单位矩阵，维数通常隐藏的。

倒数矩阵

3-4 Linear Algebra review: Inverse and transpose

4-1 Linear Regression with multiple variables: Multiple variables

多元回归的函数

4-2 Linear Regression with multiple variables: Gradient descent for multiple variables

4-2 Linear Regression with multiple variables: Gradient descent in practice I: Feature Scaling

Feature scaling
idea: make sure features are on a similar scale
如果两个变量的范围量级差异很大，会造成梯度函数的等高图变成椭圆形，学习时，振荡范围很小，导致学习时间巨长。一般性控制全部变量的取值范围在-3~3区间即可

mean normalization：X=(x-u)/(max-min)
特征缩放不用非常精确，因为目的只是让梯度下降更快一点。

4-3 Linear Regression with multiple variables: Gradient descent in practice II: Leanring rate

对于线性回归，当学习率足够小，每一次迭代后J($\theta$)会一直递减，、。

学习率过大：J($\theta$)上升或者波动，不收敛。
学习率过小：收敛速度很慢。

在选择学习率的时候，需要打印J($\theta$)函数。tricky：可以以3倍的速度进行选择，尝试最大值和最小值后，在最大值附近选取一个合适的学习率。（…0.001，0.003，0.01，0.03，0.1，0.3，1…）

4-4 Linear regression with multiple variables:Features and polynomial regression

可以选择多种函数来fit the curve

4-5 Linear regression with multiple variables: Normal equation(正规方程，区别于迭代方法的直接解法)

J($\theta$)在$\theta$的导数=0
$\theta$=pinv(X’X)X’y X’倒置。pinv求倒数
对于正规方程，不用feature scaling.

4-5 Linear regression with multiple variables: Normal equation and non-invertibility

$\theta$=pinv(X’X)X’y 如果X’X不能求倒数，即为singular/degenerate 矩阵。
Octave：pinv()和inv()都是求倒数，但是pinv能够求解奇异矩阵的倒数，是pseudo inv.

X’X不能求倒数的可能原因：

Logistic regression： Classfication

回归不适合分类。
Classfication: y=0 or y=1
h(x) can be >1 or <1
logistic regression: 0<=h(x)<=1 用于classification。

Logistic regression: Hypothesis representation

want 0<=h(x)<=1

sigmoid function = logistic function

hθ(x)=g(θTx)
z=θTx
g(z)=1/(1+e−z)

hθ(x)=P(y=1|x;θ)=1−P(y=0|x;θ)P(y=0|x;θ)+P(y=1|x;θ)=1

Logistic regression: Decision boundary

y=1, θTx>=0;
y=0, θTx<0

Non-linear decision boundaries

Logistic regression: simplified cost function and gradient descent

Logistic regression: Advanced optimization

The probekm of overfitting

Unfit, high bias: too sample or use too few features

Overfiting, high variance: a compllicated function that creates alot of unnecessary curves and angles unrelated to the data
多幂函数的产生，用力拟合，cost function 很小，但是不具备广泛的拟合。

adressing overfiting,feature 太多，并且训练集太少会导致overfiting

1. reduce number of features
manually select which features to keep)
model selection algorithm(later in course)
1. regularization
keep all the features,but reduce magnitude/values of parameters theta
works well when we have a lot of features,each of which contributes a bit to predicting y.

Regularization: cost function

为了避免过度拟合（往往由高阶幂造成），在cost function上加一个参数项$\lambda$ $\theta$，如果参数项越大，训练的目的是使cost function 小，所以会设置$\lambda$非常大，$\theta$约等于0，就能减少cost function。
$\theta0$，$\theta1$，$\theta2$…$\theta$n越小，拟合的曲线越光滑。

正则化的项，是为了平衡拟合训练的目的以及近可能保持低的$\theta$值