ML Reading Notes-02

2018-01-23

Regression

Background:以pokeman go为例学习regression。对pokeman进化后的cp值问题建模。
训练数据有十只不同pokeman，它们有初始cp值及其他各种属性。

Steps:
① Model: a set of function
　　linear $y = b+w*x_{cp}$, bias & weight

② Goodness of function: train data and set loss function $L$ to evaluate goodness of $f$.
　　 $L(f) = \sum_{n=1}^{10}(\hat{y}^n - f(x_{cp}^n))^2$, input a function, output how bad it is.
　　sum over examples to calculate variance; $\hat{y}^n$—true value

③ Best function: gradient descent
　　pick the best function $f^* = arg min_f L(f)$, related to $w,b$

How to find the best function?

Gradient Descent

1.one parameter

pre: consider $L$ with only one parameter $w$

steps:
1.randomly pick an intial value $w^0$
2.compute $w^0$处导数:$$\frac{dL}{dw}|_{w=w^0}$$3.update $$w^1 = w^0 -\eta\frac{dL}{dw}|_{w=w^0}$$$\eta$ is learning rate/step length

一维梯度下降过程：

梯度下降方法由于初始点的选择常常只能找到局部最优点，非全局最优解。回归问题由于是凸面的因此局部最优就是全局最优。

2.two parameter

pre: $f^* = arg min_f L(f)$　 =>　 $ w^*,b^* = arg min_(w,b) L(w,b)$

steps:
1.randomly pick an intial value $w^0, b^0$
2.compute $w^0, b^0$处偏导数$$\frac{\partial L}{\partial w}|_{w=w^0,b=b^0},\frac{\partial L}{\partial b}|_{w=w^0,b=b^0}$$

gradient: $$\nabla L(w, b)=[\frac{\partial L}{\partial w},\frac{\partial L}{\partial b} ]^T$$

二维梯度下降过程：

3. 梯度收敛的情况

　　极小值、鞍点

因此收敛并不代表找到了最小值或极小值。

Model Generalization

比较了一次、二次、三次、四次、五次式模型对应的训练误差和泛化误差，它们之间的关系如下：

复杂模型会包含简单模型，因此也更有可能包含最优解。

泛化性能差

redesign the model: $y = b+ \sum w_i x_i$ , consider more hidden factors,模型中也可以考虑属性的平方

overfitting

def: a more complex model does not always lead to a better performance on testing data.

solution: Regularization--$\lambda\sum(w_i)^2 $

Regularization
1.origin loss：$$L = \sum_n (\hat{y}^n - (b+\sum w_ix_i))^2,$$只考虑预测值和真实值的误差
2.improved loss：$$L = \sum_n (\hat{y}^n - (b+\sum w_ix_i))^2+\lambda\sum(w_i)^2,$$考虑了weight的误差.当weight越小，function受到输入数值大小的影响越小，也就越平滑$$y+\sum w_i \Delta x_i = b + \sum (x_i + \Delta x_i) $$因此$\lambda$越大，考虑smooth就越多，training error就越少,如下图所示。