不喜欢学习数学英语而产生的眼高手低的典范

# definition

  • Field of study that gives computers the ability to learn without being explicitly programmed Arthur Samuel(1959)
  • A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. Tom Mitchell

# Supervised learning(teach the machine to do sth)

  • the “right answers” data sets are given
  • regression (回归) problem(predict continuous value)
  • classification problem(discrete valued output(0 or 1 or 2,…))
  • can predict according to infinite features

# Notations:

  • m=Number of training examples

  • x’s=“input” variable/features

  • y’s=“output” variable/features

  • (x,y)=one training example

  • (x(i),y(i))=theithtrainingexample(x^{(i)},y^{(i)})=the\quad i^{th}\quad training\quad example

  • h=hypothesis, mapping from x’s to y’s

    eg.hθ(x)=θ0+θ1x(univariablelinearregression)eg.h_\theta(x)=\theta_0+\theta_1x\quad(univariable\quad linear\quad regression)

    1.png

# Univariable linear regression

  • Goal:tominimizeθ0,θ1J(θ0,θ1),whereJ(θ0,θ1)=12mi=1m(hθ(x(i))y(i))2(costfunction)Goal:to\quad minimize_{\theta_0,\theta_1}J(\theta_0,\theta_1),\\ where\quad J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2(cost\quad function)\\

  • J is called cost function or squared error cost function

  • squared error cost function is the most used function for linear regression problem and works pretty well

# Gradient descent

  • to minimize function J

  • Outline:

    • Startwithsomeθ0,θ1Start\quad with\quad some\quad \theta_0,\theta_1

    • keepchangingθ0,θ1toreduceJ(θ0,θ1)keep\quad changing\quad \theta_0,\theta_1\quad to\quad reduce\quad J(\theta_0,\theta_1)

    • until we hopefully end up at a minimum

  • repeatuntilconvergence{θ0=θ0αθ0J(θ0,θ1)θ1=θ1αθ1J(θ0,θ1)}α is  called  learning  raterepeat\quad until\quad convergence\{\\ \theta_0'=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)\\ \theta_1'=\theta_1-\alpha\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)\\ \}\\ \alpha \ is\; called\;learning\; rate

  • Noticing that theta0 and theta1 are simultaneously updated

  • θ0=1mi=1m(hθ(x(i))y(i))θ1=1mi=1m(hθ(x(i))y(i))x(i)\frac{\partial}{\partial\theta_0}=\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\\ \frac{\partial}{\partial\theta_1}=\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})\cdot x^{(i)}

  • univariable linear regression is a convex function(a bowl shape function ), so it has no local optima other than the global optimum

  • "Batch" Gradient Descent : Each step of gradient descent uses all the training examples.(from 1 to m)

# Multivariate regression problem

  • n=number of features(variables)

  • xj(i)=theithtrainingexamplethejthfeaturex^{(i)}_j=the\quad i^{th}\quad training\quad example\quad the\quad j^{th}\quad feature

  • hθ(x)=i=0nθixi(definingx0=1)X(i)=[x0(i)x1(i)...xn(i)]Rn+1θ=[θ0θ1...θn]Rn+1hθ(x)=θTX(i)h_\theta(x)=\sum_{i=0}^n\theta_ix_i\quad (defining\quad x_0=1)\\ X^{(i)}=\begin{bmatrix} x_0^{(i)}\\ x_1^{(i)}\\ ...\\ x_n^{(i)}\\ \end{bmatrix} \in\mathbb{R}^{n+1} \quad \theta=\begin{bmatrix} \theta_0\\ \theta_1\\ ...\\ \theta_n \end{bmatrix} \in\mathbb{R}^{n+1}\\ h_\theta(x)=\theta^TX^{(i)}\\

  • J(θ)=J(θ0,θ1,...,θn)=12mi=1m(hθ(x(i))y(i))2J(\theta)=J(\theta_0,\theta_1,...,\theta_n)=\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2

  • Gradient descent:

    θj=θjα1mi=1m(hθ(xj(i))y(i))xj(i)θ=θαmXT(XθY)(wirte  in  matrix)\theta_j'=\theta_j-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x_j^{(i)})-y^{(i)})\cdot x^{(i)}_j\\ \theta'=\theta-\frac{\alpha}{m}*X^T(X\theta-Y)\quad(wirte\;in\;matrix)

# Practical tricks for making gradient descent work well

  • Feature Scaling(faster)

    • Idea: Make sure features are on a similar scale
    • *Get every feature into **approximately(not necessarily exact)*a [-1, 1] range

    E.g.

x1=sizes(02000feet2),x2=numberofbedrooms(15)x1=size2000,x2=#bedrooms5x1=size10002000,x2=#bedrooms25so  that  x1  and  x2  are  both  in  [0.5,0.5]x_1=sizes(0-2000feet^2),x2=number\,of\,bedrooms(1-5)\\ x_1'=\frac{size}{2000},x_2'=\frac{\#bedrooms}{5}\\ x_1''=\frac{size-1000}{2000},x_2''=\frac{\#bedrooms-2}{5}\\ so\;that\;x_1\;and\;x_2\;are\;both\;in\;[-0.5, 0.5]\\

let  xi=xixiˉmax{xi}min{xi}where  xiˉ=1mj=1mxi(j)let\;x_i'=\frac{x_i-\bar{x_i}}{max\{x_i\}-min\{x_i\}}\quad where\;\bar{x_i}=\frac{1}{m}\sum_{j=1}^mx^{(j)}_i

  • Choose learning rate properly

    • Declare convergence if your cost function decreases by less than 1e-3(?) in one iteration
    • If your cost function doesn’t decrease on every iteration, use smaller learning rate
  • Try to choose …, 0.001, 0.01, 0.1, 1,…

  • Choose appropriate features

  • Think about what features really determine the result in particular problem

  • Polynomial regression

    • tryhθ(x)=θ0+θ1(size)+θ2sizewhere  size=lengthwidthbetterthanθ0+θ1length+θ2widthtry\quad h_\theta(x)=\theta_0+\theta_1(size)+\theta_2\sqrt{size}\quad where\;size=length*width\\ better\quad than\quad \theta_0+\theta_1*length+\theta_2*width\\

    • pay attention to features’ scale

      hθ(x)=θ0+θ1size1000+θ2size32ifsizerangesin[1,1000],100032h_\theta(x)=\theta_0+\theta_1\frac{size}{1000}+\theta_2\frac{\sqrt{size}}{32}\\ if\quad size\quad ranges\quad in\quad [1,1000],\sqrt{1000}≈32

# Normal equation

  • X=[x0(1)=1x1(1)...xn(1)x0(2)=1x1(2)...xn(2)............x0(m)=1x1(m)...xn(m)]y=[y(1)y(2)...y(m)]J(θ)=12mi=1m(j=0nxj(i)θjy(i))2=12m(Xθy)T(Xθy)=12m(θTXTyT)(Xθy)=12m(θTXTXθθTXTyyTXθ+yTy)dJ(θ)dθ=12m(dθTdθXTXθ+(θTXTX)TdθdθdθTdθXTy(yTX)Tdθdθ)=12m(IXTXθ+XTXθTIXTyXTy)=12m((XTX+XTX)θ(XTy+XTy))so  if  dJ(θ)dθ=0XTXθXTy=0soθ=(XTX)1XTyin  fact  AX  is  also  a  vectord[y1y2...ym]d[x1x2...xn]=d[y1y2...ym]d[x1x2...xn]=[y1x1y2x1...ymx1y1x2y2x2...ymx2............y1xny2xn...ymxn]so  we  have  dXTdX=dXdX=I,dAXdX=ATdABdX=dAdXB+AdBdXwhile  X  is  a  vector  and  A  is  1×n  and  B  is  n×1X=\begin{bmatrix} x_0^{(1)}=1&x_1^{(1)}&...&x_n^{(1)}\\ x_0^{(2)}=1&x_1^{(2)}&...&x_n^{(2)}\\ ...&...&...&...\\ x_0^{(m)}=1&x_1^{(m)}&...&x_n^{(m)} \end{bmatrix}y=\begin{bmatrix} y^{(1)}\\ y^{(2)}\\ ...\\ y^{(m)} \end{bmatrix}\\ J(\theta)=\frac{1}{2m}\sum_{i=1}^m(\sum_{j=0}^nx_j^{(i)}\theta_j-y^{(i)})^2\\ =\frac{1}{2m}(X\theta-y)^T(X\theta-y)\\ =\frac{1}{2m}(\theta^TX^T-y^T)(X\theta-y)\\ =\frac{1}{2m}(\theta^TX^TX\theta-\theta^TX^Ty-y^TX\theta+y^Ty)\\ \frac{dJ(\theta)}{d\theta}=\frac{1}{2m}(\frac{d\theta^T}{d\theta}X^TX\theta+(\theta^TX^TX)^T\frac{d\theta}{d\theta}-\frac{d\theta^T}{d\theta}X^Ty-(y^TX)^T\frac{d\theta}{d\theta})\\ =\frac{1}{2m}(IX^TX\theta+X^TX\theta^TI-X^Ty-X^Ty)\\ =\frac{1}{2m}((X^TX+X^TX)\theta-(X^Ty+X^Ty))\\ so\;if\;\frac{dJ(\theta)}{d\theta}=0\\ X^TX\theta-X^Ty=0\\ so\quad \theta=(X^TX)^{-1}X^Ty\\**************************\\ in\;fact\;AX\;is\;also\;a\;vector\\ \frac{d\begin{bmatrix}y_1&y_2&...&y_m\end{bmatrix}}{d\begin{bmatrix}x_1\\x_2\\...\\x_n\end{bmatrix}}=\frac{d\begin{bmatrix}y_1\\y_2\\...\\y_m\end{bmatrix}}{d\begin{bmatrix}x_1\\x_2\\...\\x_n\end{bmatrix}}= \begin{bmatrix} \frac{\partial y_1}{\partial x_1}&\frac{\partial y_2}{\partial x_1}&...&\frac{\partial y_m}{\partial x_1}\\ \frac{\partial y_1}{\partial x_2}&\frac{\partial y_2}{\partial x_2}&...&\frac{\partial y_m}{\partial x_2}\\ ...&...&...&...\\ \frac{\partial y_1}{\partial x_n}&\frac{\partial y_2}{\partial x_n}&...&\frac{\partial y_m}{\partial x_n} \end{bmatrix}\\ so\;we\;have\;\frac{dX^T}{dX}=\frac{dX}{dX}=I,\frac{dAX}{dX}=A^T\\\frac{dAB}{dX}=\frac{dA}{dX}B+A\frac{dB}{dX}\quad while\;X\;is\;a\;vector\;and\;A\;is\;1\times n\;and\;B\;is\;n\times 1\\

  • When  (XTX)  is  noninvertible?Redundant  features(linearly  dependent)E.g.  x1=size  in  feet2,x2=size  in  m2,so  x13.282x2Too  many  features(E.g.  mn)When\;(X^TX)\;is\;non-invertible?\\ *Redundant\;features(linearly\;dependent)\\ E.g.\;x_1=size\;in\;feet^2,x_2=size\;in\;m^2,so\;x_1\equiv3.28^2x_2\\ *Too\;many\;features(E.g.\;m\leq n)

# Logistic Regression Classification

# binary class classification

  • y{0,1}y\in\{0,1\}

    0 refers to the absence of something while 1 refers to the presence of something

# logistic regression model
  • want  0hθ(x)1so  let  hθ(x(i))=g(θTX(i)),where  g(z)=11+ezhθ(x)=estimated  probability  that  y=1  on  input  xhθ(x)=P(y=1x;θ)P(y=1x;θ)>0.50hθ(x(i))want \; 0\leq h_\theta(x)\leq 1\\ so \; let\; h_\theta(x^{(i)})=g(\theta^TX^{(i)}),where\;g(z)=\frac{1}{1+e^{-z}}\\ h_\theta(x)=estimated\;probability\; that\; y=1\; on\; input\; x\\ h_\theta(x)=P(y=1|x;\theta)\\ P(y=1|x;\theta)>0.5\Leftrightarrow 0\leq h_\theta(x^{(i)})

  • Non-linear decision boundaries

    hθ(x)=g(θ0+θ1x1+θ2x2+θ3x12+θ4x22)h_\theta(x)=g(\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_1^2+\theta_4x_2^2)

    • the parameter theta determines the decision boundary
  • Cost function

    J(θ)=1mi=1mCost(hθ(x(i)),y(i))If  we  choose  Cost(hθ(x(i)),y(i))=12(hθ(x(i))y(i))2  like  linear  regression  problemthen  J(θ)  would  be  a  nonconvex  functionJ(\theta)=\frac{1}{m}\sum_{i=1}^mCost(h_\theta(x^{(i)}),y^{(i)})\\ If\;we\;choose\;Cost(h_\theta(x^{(i)}),y^{(i)})=\frac{1}{2}(h_\theta(x^{(i)})-y^{(i)})^2\;like\;linear\;regression\;problem\\ then\;J(\theta)\;would\;be\;a\;non-convex\;function

    Cost(hθ(x(i),y(i)))={log(hθ(x(i)))if  y=1log(1hθ(x(i)))if  y=0=y(i)log(hθ(x(i)))(1y(i))log(1hθ(x(i)))Cost(h_\theta(x^{(i)},y^{(i)}))=\begin{cases}-log(h_\theta(x^{(i)}))\quad if\;y=1\\-log(1-h_\theta(x^{(i)}))\quad if\;y=0\end{cases}\\ =-y^{(i)}log(h_\theta(x^{(i)}))-(1-y^{(i)})log(1-h_\theta(x^{(i)}))

    • The cost function derived from Maximum Likelihood Estimation and has a nice property that it’s convex

    J(θ)=1mi=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]minθJ(θ)Repeat{θj=θjαθjJ(θ)=θjαmi=1m(hθ(x(i))y(i))xj(i)}J(\theta)=-\frac{1}{m}\sum_{i=1}^m[y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))]\\ min_\theta J(\theta)\\ Repeat\{\\ \theta_j'=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\\ =\theta_j-\frac{\alpha}m\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}\\ \}\\

  • A  simple  proof(provided  by  SuBonan):sigmond(x)=sigmond(x)[1sigmond(x)]hθ(x(i))=sigmond(j=0nxj(i)θj)sohθ(x(i))θj=sigmond(j=0nxj(i)θj)[1sigmond(j=0nxj(i)θj)]xj(i)=hθ(x(i))[1hθ(x(i))]xj(i)whileCost(hθ(x(i)),y(i))=yln(hθ(x(i)))(1y)ln(1hθ(x(i)))soCost(hθ(x(i)),y)θj=yhθ(x(i))hθ(x(i))θj+1y1hθ(x(i))hθ(x(i))θj=[hθ(x(i))y]xj(i)J(θ)=1mi=1mCost(hθ(x(i)),y(i))soJ(θ)θj=1mi=1mθjCost(hθ(x(i)),y(i))=1mi=1m[hθ(x(i))y(i)]xj(i)A\;simple\;proof(provided\;by\;SuBonan):\\ sigmond'(x)=sigmond(x)[1-sigmond(x)]\\ h_\theta(x^{(i)})=sigmond(\sum_{j=0}^nx^{(i)}_j\theta_j)\\ so\quad \frac{\partial h_\theta(x^{(i)})}{\partial\theta_j}=sigmond(\sum_{j=0}^nx_j^{(i)}\theta_j)[1-sigmond(\sum_{j=0}^nx_j^{(i)}\theta_j)]x_j^{(i)}\\ =h_\theta(x^{(i)})[1-h_\theta(x^{(i)})]x_j^{(i)}\\ while\quad Cost(h_\theta(x^{(i)}),y^{(i)})=-yln(h_\theta(x^{(i)}))-(1-y)ln(1-h_\theta(x^{(i)}))\\ so\quad \frac{\partial Cost(h_\theta(x^{(i)}),y)}{\partial \theta_j}=\frac{-y}{h_\theta(x^{(i)})}\cdot \frac{\partial h_\theta(x^{(i)})}{\partial\theta_j}+\frac{1-y}{1-h_\theta(x^{(i)})}\cdot\frac{\partial h_\theta(x^{(i)})}{\partial\theta_j}\\ =[h_\theta(x^{(i)})-y]x_j^{(i)}\\ J(\theta)=\frac{1}{m}\sum_{i=1}^mCost(h_\theta(x^{(i)}),y^{(i)})\\ so\quad \frac{\partial J(\theta)}{\partial \theta_j}=\frac{1}{m}\sum_{i=1}^m\frac{\partial}{\partial\theta_j}Cost(h_\theta(x^{(i)}),y^{(i)})\\ =\frac{1}{m}\sum_{i=1}^m[h_\theta(x^{(i)})-y^{(i)}]x_j^{(i)}

  • So the vectorized implementation is:

    θ=θαmXT(sigmond(Xθ)Y)\theta'=\theta-\frac{\alpha}{m}X^T(sigmond(X\theta)-Y)

  • Advanced Optimization

    • Conjugate gradient
    • BFGS
    • L-BFGS
    • All algorithms above has following advantages:
      • No need to manually pick learning rate
      • Often faster than gradient descent

# Multiclass classfication

y{0,1,2,3,...}y\in\{0,1,2,3,...\}

# One-vs-all(one-vs-rest)
  • hθ(i)(X)=P(y=iX;θ)so  given  X  find  maxihθ(i)(X)h_\theta^{(i)}(X)=P(y=i|X;\theta)\\ so\;given\;X\;find\;max_ih_\theta^{(i)}(X)

# The problem of overfitting

2

  • Overfitting: If we have too many features, the learned hypothesis may fit the training set very well, but fail to generalize to new examples(predict prices on new examples).

  • Adressing overfitting:

    • Reduce features:

      • Manually select which features to keep.
      • Model selection algorithm
    • Regularization:

      • Keep all the features, but reduce magnitude/values of parameters theta.
      • Works well when we have a lot of features, each of which contributes a bit to predicting y.

# Cost Function

  • hθ(x)=θ0+θ1x+θ2x2+θ3x3Cost  function=1mi=1m[hθ(x)y]2+1000θ32h_\theta(x)=\theta_0+\theta_1x+\theta_2x^2+\theta_3x^3\\ Cost\;function=\frac{1}{m}\sum_{i=1}^m[h_\theta(x)-y]^2+1000\theta_3^2

    So that we can avoid theta3 being too large by minimizing cost function

  • But we don’t know which parameter theta to shrink in advance, so we choose to shrink all of them:

    J(θ)=12m[i=1m[hθ(x(i))y(i)]2+λj=1nθj2]λ=regularization  parameterJ(\theta)=\frac{1}{2m}[\sum_{i=1}^m[h_\theta(x^{(i)})-y^{(i)}]^2+\lambda\sum_{j=1}^n\theta_j^2]\\ \lambda=regularization\;parameter

    In practice, whether or not you choose to shrink theta0 makes very little difference to the results

# Regularized Linear Regression

  • Gradient descent:

θ=θαm[XT(XθY)+λθ]\theta'=\theta-\frac{\alpha}{m}[X^T(X\theta-Y)+\lambda\theta]

  • Normal equation:

    θ=(XTX+λIn+1)1XTyand  XTX+λIn+1  will  be  always  invertible\theta=(X^TX+\lambda I_{n+1})^{-1}X^Ty\\ and\;X^TX+\lambda I_{n+1}\;will\;be\;always\;invertible

# Regularized Logistic Regression

  • Gradient descent:

    θ=θαmXT(sigmond(Xθ)Y)+λαmθ\theta'=\theta-\frac{\alpha}{m}X^T(sigmond(X\theta)-Y)+\frac{\lambda\alpha}{m}\theta

# Neural Network

# Notations

ai(j)="activation"  of  unit  i  in  layer  jΘ(j)=matrix  of  weights  controlling  function  mapping  from  layer  jto  layer  j+1the  dimension  of  Θ(j)  is  sj+1×(sj+1)sj  is  the  number  of  units  in  layer  ja_i^{(j)}="activation"\;of\;unit\;i\;in\;layer\;j\\ \varTheta^{(j)}=matrix\;of\;weights\;controlling\;function\;mapping\;from\;layer\;j\\to\;layer\;j+1\\ the\;dimension\;of\;\varTheta^{(j)}\;is\;s_{j+1}\times(s_j+1)\\ s_j\;is\;the\;number\;of\;units\;in\;layer\;j

# Vectorized implementation

a(2)=sigmoid(Θ(1)X)Add  a0(2)=1a(3)=sigmoid(Θ(2)a(2))...output  layer:[P{output=1}P{output=2}...P{output=numOfLabels}]=sigmoid(Θ(n)a(n))but  P  is  just  prediction  not  the  answer,  soi=1numOfLabelsP{output=i}  does  not  necessarily  equal  to  1a^{(2)}=sigmoid(\varTheta^{(1)}X)\\ Add\;a_0^{(2)}=1\\ a^{(3)}=sigmoid(\varTheta^{(2)}a^{(2)})\\ ...\\ output\;layer:\\ \begin{bmatrix} P\{output=1\}\\ P\{output=2\}\\ ...\\ P\{output=numOfLabels\} \end{bmatrix}=sigmoid(\varTheta^{(n)}a^{(n)})\\ but\;P\;is\;just\;prediction\;not\;the\;answer,\;so\\ \sum_{i=1}^{numOfLabels}P\{output=i\}\;does\;not\;necessarily\;equal\;to\;1

# Cost Function

K=numOfLabelsL=numOfLayerslethΘ(x)RKrepresent  the  outputand  (hΘ(x))k  represent  the  ith  output  in  output  layerJ(Θ)=1m[i=1mk=1Kyk(i)log(hΘ(x(i))k)+(1yk(i))log(1hΘ(x(i))k)]+λ2ml=1L1i=2sl+1j=1sl+1(Θji(l))2K=numOfLabels\quad L=numOfLayers\\ let\quad h_\varTheta(x)\in\mathbb{R}^{K}\quad represent\;the\;output\\ and\;(h_\varTheta(x))_k\;represent\;the\;i^{th}\;output\;in\;output\;layer\\ J(\varTheta)=-\frac{1}{m}[\sum_{i=1}^m\sum_{k=1}^Ky_k^{(i)}log(h_\varTheta(x^{(i)})_k)+(1-y_k^{(i)})log(1-h_\varTheta(x^{(i)})_k)]\\+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=2}^{s_{l+1}}\sum_{j=1}^{s_l+1}(\varTheta_{ji}^{(l)})^2

# Backpropagation Algorithm

  • Gradient computation

    Supposing we have only one training example (m=1)

so  x,y  substitute  for  x(i),y(i)Intuition:δj(l)="error"  of  node  j  in  layer  lso\;x,y\;substitute\;for\;x^{(i)},y^{(i)}\\ Intuition:\delta_j^{(l)}="error"\;of\;node\;j\;in\;layer\;l\\

If the network has only four layers:

δ(4)=a(4)yδ(3)=(Θ(3))Tδ(4).sigmoid(Θ(2)a(2))=(Θ(3))Tδ(4).(sigmoid(Θ(2)a(2)).(1sigmoid(Θ(2)a(2)))=(Θ(3))Tδ(4).(a(3).(1a(3)))δ(2)=(Θ(2))Tδ(3).sigmoid(Θ(1)a(1))=(Θ(2))Tδ(3).(a(2).(1a(2)))and  input  layer  dont  have  δ(1)\delta^{(4)}=a^{(4)}-y\\ *\delta^{(3)}=(\varTheta^{(3)})^T\delta^{(4)}.*sigmoid'(\varTheta^{(2)}a^{(2)})\\ =(\varTheta^{(3)})^T\delta^{(4)}.*(sigmoid(\varTheta^{(2)}a^{(2)}).*(1-sigmoid(\varTheta^{(2)}a^{(2)}))\\ =(\varTheta^{(3)})^T\delta^{(4)}.*(a^{(3)}.*(1-a^{(3)}))\\ *\delta^{(2)}=(\varTheta^{(2)})^T\delta^{(3)}.*sigmoid'(\varTheta^{(1)}a^{(1)})\\ =(\varTheta^{(2)})^T\delta^{(3)}.*(a^{(2)}.*(1-a^{(2)}))\\ and\;input\;layer\;don't\;have\;\delta^{(1)}

Set  Δij(l)=0  (for  all  i,j,l)For  i=1  to  mSet  a(1)=x(i)forward  propagation  to  compute  a(l),l=2,3,...,LUsing  y(i),  compute  δ(L)=a(L)y(i)Compute  δ(L1),δ(L2),...,δ(2)Δ(l):=Δ(l)+δ(l+1)(a(l))TDij(l)=1mΔij(l)+λmΘij(l)  if  j0Dij(l)=1mΔij(l)  if  j=0Θij(l)J(Θ)=Dij(l)Θ(l)J(Θ)=D(l)Set\;\Delta_{ij}^{(l)}=0\;(for\;all\;i,j,l)\\ For\;i=1\;to\;m\\ Set\;a^{(1)}=x^{(i)}\\ forward\;propagation\;to\;compute\;a^{(l)},l=2,3,...,L\\ Using\;y^{(i)},\;compute\;\delta^{(L)}=a^{(L)}-y^{(i)}\\ Compute\;\delta^{(L-1)},\delta^{(L-2)},...,\delta^{(2)}\\ \Delta^{(l)}:=\Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T\\ D_{ij}^{(l)}=\frac{1}{m}\Delta_{ij}^{(l)}+\frac{\lambda}{m}\varTheta_{ij}^{(l)}\;if\;j\neq0\\ D_{ij}^{(l)}=\frac{1}{m}\Delta_{ij}^{(l)}\;if\;j=0\\ \frac{\partial}{\partial\varTheta_{ij}^{(l)}}J(\varTheta)=D_{ij}^{(l)}\\ \frac{\partial}{\partial\varTheta^{(l)}}J(\varTheta)=D^{(l)}

  • Gradient Checking

let  θ="unrolled"  Θ(l)θ=[Θ(1)(:);Θ(2)(:);...;Θ(L)(:)]Rnso  J(Θ(1),...,Θ(L))=J(θ)=J([θ1θ2...θn])and  θiJ(θ)=J([θ1...θi+ϵ...θn])J([θ1...θiϵ...θn])2ϵlet\;\theta="unrolled"\;\Theta^{(l)}\\ \theta=[\Theta^{(1)}(:);\Theta^{(2)}(:);...;\Theta^{(L)}(:)]\in\mathbb{R}^{n}\\ so\;J(\Theta^{(1)},...,\Theta^{(L)})=J(\theta)=J([\theta_1\quad\theta_2\quad...\quad\theta_n])\\ and\;\frac{\partial}{\partial\theta_i}J(\theta)=\frac{J([\theta_1\quad...\quad\theta_i+\epsilon\quad...\quad\theta_n])-J([\theta_1\quad...\quad\theta_i-\epsilon\quad...\quad\theta_n])}{2\epsilon}

1
2
3
4
5
6
7
for i = 1 : n
thetaPlus = theta;
thetaPlus(i) = thetaPlus(i) + EPSILON;
thetaMinus = theta;
thetaMinus(i) = thetaMinus(i) - EPSILON;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2 * EPSILON);
end;

Check that gradAppprox ≈ DVec, while epsilon = 1e-4

  • Implementation Note

    • Implement backprop to compute DVec
    • Implement numerical gradient check to compute gradApprox
    • Make sure they give similar values
    • Turn off gradient checking. Using backprop code for learning
      • numerical gradient checking is computenionally expensive compared with backprop
  • Random Initialization

    E.g.

    Theta1 = rand(10, 11) * (2 * INIT_EPSILON) - INIT_EPSILON;

    In order to break symmetry

# Bias and Variance

Take regularized linear regression problem as an example

Split the training examples into three parts randomly:

  • training set (to train and modify parameters) (around 60%)

    J(θ)=12m[i=1m[hθ(x(i))y(i)]2+λj=1nθj2]We  train  the  model  through  minimizing  J(θ)Jtrain(θ)=12m[i=1m[hθ(x(i))y(i)]2We  use  Jtrain(θ)  to  evaluate  how  well  our  model  fits  the  training  set(we  now  only  care  about  how  well  our  model  fits  the  training  set)J(\theta)=\frac{1}{2m}[\sum_{i=1}^m[h_\theta(x^{(i)})-y^{(i)}]^2+\lambda\sum_{j=1}^n\theta_j^2]\\ We\;train\;the\;model\;through\;minimizing\;J(\theta)\\ J_{train}(\theta)=\frac{1}{2m}[\sum_{i=1}^m[h_\theta(x^{(i)})-y^{(i)}]^2\\ We\;use\;J_{train}(\theta)\;to\;evaluate\;how\;well\;our\;model\;fits\;the\;training\;set\\ (we\;now\;only\;care\;about\;how\;well\;our\;model\;fits\;the\;training\;set)\\

  • validation set *(to prevent overfitting and help choose model(like degree of polynomial)) * (around 20%)

    Jcv(θ)=12m[i=1m[hθ(x(i))y(i)]2We  use  Jcv(θ)  to  prevent  overfitting.After  a  iteration,if  J(θ)  and  Jtrain(θ)  decreases  but  Jcv(θ)  increases,that  means  overfitting  problem.J_{cv}(\theta)=\frac{1}{2m}[\sum_{i=1}^m[h_\theta(x^{(i)})-y^{(i)}]^2\\ We\;use\;J_{cv}(\theta)\;to\;prevent\;overfitting.\\ After\;a\;iteration,if\;J(\theta)\;and\;J_{train}(\theta)\;decreases\;but\;J_{cv}(\theta)\;increases,\\ that\;means\;overfitting\;problem.

  • test set (only after finishing training to test the performance of the model) (around 20%)

    Jtest(θ)=12m[i=1m[hθ(x(i))y(i)]2After  all  trainings,we  use  Jtest(θ)  to  evaluate  the  performance  of  the  modelJ_{test}(\theta)=\frac{1}{2m}[\sum_{i=1}^m[h_\theta(x^{(i)})-y^{(i)}]^2\\ After\;all\;trainings,we\;use\;J_{test}(\theta)\;to\;evaluate\;the\;performance\;of\;the\;model

3

Problems:

  • High Bias(underfit)

    Jtrain(Θ)  and  Jcv(Θ)  are  both  highJ_{train}(\Theta)\;and\;J_{cv}(\Theta)\;are\;both\;high

4

  • High Variance(overfit)

    Jtrain(Θ)  will  be  lowJcv(Θ)>>Jtrain(Θ)J_{train}(\Theta)\;will\;be\;low\\ J_{cv}(\Theta)>>J_{train}(\Theta)

5

  • Solutions

    • Get more training examples to fix high variance
    • Try smaller sets of features to fix high variance
    • Try getting additional features to fix high bias
    • Try adding polynomial features to fix high bias
    • Try decreasing lambda to fix high bias
    • Try increasing lambda to fix high variance

# Error metrics for skewed classes

If class y=1 only counts for 0.5%, and y=0 counts for 99.5%, then y=1 is called a skewed class.

and predicting y=0 all the time may low the error rate, but it turns out not to be a good idea

predicted class↓/actual class→ 1 0
1 true positive false positive
0 false negative true negative

Precision=num  of  True  positivenum  of  Ture  positive+num  of  False  positiveRecall=num  of  True  positivenum  of  True  positive+num  of  False  negativeon  validation  setPrecision=\frac{num\;of\;True\;positive}{num\;of\;Ture\;positive+num\;of\;False\;positive}\\ Recall=\frac{num\;of\;True\;positive}{num\;of\;True\;positive+num\;of\;False\;negative}\\ *on\;validation\;set

  • Suppose we want to predict y=1 only if very confident

    0.9hθ(x)predict  y=1hθ(x)<0.9predict  y=00.9\leq h_\theta(x)\quad predict\;y=1\\ h_\theta(x)<0.9\quad predict\;y=0

    Then the precision will be higher and recall will be lower.

  • Suppose we want to avoid missing too many case of y=0

    0.3hθ(x)predict  y=1hθ(x)<0.3predict  y=00.3\leq h_\theta(x)\quad predict\;y=1\\ h_\theta(x)<0.3\quad predict\;y=0

    Then the precision will be lower and recall will be higher.

  • F1  Score:2×PrecisionRecallPrecision+RecallF_1\;Score:2\times\frac{Precision*Recall}{Precision+Recall}

    We use F1 Score to evaluate the performance of the algorithm on skewed classes.

# Support Vector machine(SVM)

6

minθ1m[i=1my(i)(log(hθ(x(i))))+(1y(i))(log(1hθ(x(i))))]+λ2mj=1nθj2Replace  log(z),log(1z)  with  cost1(z),cost0(z):minθ1m[i=1my(i)(cost1(hθ(x(i))))+(1y(i))(cost0(hθ(x(i)))]+λ2mj=1nθj2=minθi=1m[y(i)(cost1(hθ(x(i))))+(1y(i))(cost0(hθ(x(i)))]+λ2j=1nθj2=minθCi=1m[y(i)(cost1(hθ(x(i))))+(1y(i))(cost0(hθ(x(i)))]+12j=1nθj2min_\theta\frac{1}{m}[\sum_{i=1}^my^{(i)}(-log(h_\theta(x^{(i)})))+(1-y^{(i)})(-log(1-h_\theta(x^{(i)})))]+\frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2\\ Replace\;-log(z),-log(1-z)\;with\;cost_1(z),cost_0(z):\\ min_\theta\frac{1}{m}[\sum_{i=1}^my^{(i)}(cost_1(h_\theta(x^{(i)})))+(1-y^{(i)})(cost_0(h_\theta(x^{(i)}))]+\frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2\\ =min_\theta\sum_{i=1}^m[y^{(i)}(cost_1(h_\theta(x^{(i)})))+(1-y^{(i)})(cost_0(h_\theta(x^{(i)}))]+\frac{\lambda}{2}\sum_{j=1}^n\theta_j^2\\ =min_\theta C\sum_{i=1}^m[y^{(i)}(cost_1(h_\theta(x^{(i)})))+(1-y^{(i)})(cost_0(h_\theta(x^{(i)}))]+\frac{1}{2}\sum_{j=1}^n\theta_j^2

Large C: Lower bias, high variance

Small C: Higher bias, low variance

# Kernels

Given x, compute some new features depending on proximity to landmarks.

f1=similarity(x,l(1))=exl(1)22σ2=exp(j=1n(xjlj(1))22σ2)f2=similarity(x,l(2))=exl(2)22σ2=exp(j=1n(xjlj(2))22σ2)......hθ(x)=θ0+θ1f1+θ2f2+...predict  y=1  if  hθ(x)0  and  0  otherwisef_1=similarity(x,l^{(1)})=e^{-\frac{|x-l^{(1)}|^2}{2\sigma^2}}=exp(-\frac{\sum_{j=1}^n(x_j-l_j^{(1)})^2}{2\sigma^2})\\ f_2=similarity(x,l^{(2)})=e^{-\frac{|x-l^{(2)}|^2}{2\sigma^2}}=exp(-\frac{\sum_{j=1}^n(x_j-l_j^{(2)})^2}{2\sigma^2})\\ ......\\ h_\theta(x)=\theta_0+\theta_1f_1+\theta_2f_2+...\\ predict\;y=1\;if\;h_\theta(x)\geq 0\;and\;0\;otherwise

  • f is called Gaussian kernel. when |x - l|≈0, f≈1, and |x - l|=∞, f ≈0

  • x which is near l tends to be predicted positive.

We  choose  l(i)=x(i),i=1,2,3,...,mf1(i)=sim(x(i),x(1))f2(i)=sim(x(i),x(2))...fi(i)=sim(x(i),x(i))=1...fm(i)=sim(x(i),x(m))f(i)=[f0(i)f1(i)...fm(m)]Rm+1We\;choose\;l^{(i)}=x^{(i)},i=1,2,3,...,m\\ f_1^{(i)}=sim(x^{(i)},x^{(1)})\\ f_2^{(i)}=sim(x^{(i)},x^{(2)})\\ ...\\ f_i^{(i)}=sim(x^{(i)},x^{(i)})=1\\ ...\\ f_m^{(i)}=sim(x^{(i)},x^{(m)})\\ f^{(i)}=\begin{bmatrix}f_0^{(i)}\\f_1^{(i)}\\...\\f_m^{(m)}\end{bmatrix}\in\mathbb{R}^{m+1}

minθCi=1m[y(i)(cost1(θTf(i)))+(1y(i))(cost0(θTf(i)))]+12j=1mθj2nowθRm+1Large  σ2:Features  fi  vary  more  smoothly.High  bias,lower  varianceSmall  σ2:Features  fi  vary  less  smoothly.Lower  bias,Higher  variancemin_\theta C\sum_{i=1}^m[y^{(i)}(cost_1(\theta^Tf^{(i)}))+(1-y^{(i)})(cost_0(\theta^Tf^{(i)}))]+\frac{1}{2}\sum_{j=1}^m\theta_j^2\\ now\quad\theta\in\mathbb{R}^{m+1}\\ Large\;\sigma^2:Features\;f_i\;vary\;more\;smoothly.High\;bias,lower\;variance\\ Small\;\sigma^2:Features\;f_i\;vary\;less\;smoothly.Lower\;bias,Higher\;variance

  • If n is large (relative to m), Use logistic regression, or SVM without a kernel.

  • If n is small, m is intermediate, Use SVM with Gaussian kernel.

  • If n is small, m is large (n=1~1000, m=50000+), Create/add more features, then use logistic regression or SVM without a kernel.

  • Neural network likely to work well for most of these settings, but may be slower to train.

# Unsupervised learning

  • no given values or classification

  • Cluster data(genes) or Non-cluster data(Cocktail party problem)

  • find some “structure” of the dataset

# Clustering algorithm: K-means algorithm

Repeat:

1. Randomly choose two cluster centroids. The red one and the blue one.

2. Cluster assignment: color each of the data points red or blue depending on which cluster centroid it is more closer to.

3. Move blue cluster centroid to the average point of all blue points.

Move red cluster centroid to the average point of all red points.

More formally:

  • Input:

    • K (number of clusters)
    • Training set{x1, x2, …, xm}
  • Process:

    Randomly  initialize  K  cluster  centroids  μ1,μ2,...,μKRnRepeat:{for  i=1  to  mc(i)=index(from  1  to  K)  of  cluster  centroid  closest  to  x(i)(minc(i)=kx(i)μk2)for  k=1  to  Kμk=average  (mean)  of  points  assigned  to  cluser  k(μk=1mi=1m,c(i)=kx(i)Rn)}Randomly\;initialize\;K\;cluster\;centroids\;\mu_1,\mu_2,...,\mu_K\in\mathbb{R}^n\\ Repeat:\{\\ for\;i=1\;to\;m\\ c^{(i)}=index(from\;1\;to\;K)\;of\;cluster\;centroid\;closest\;to\;x^{(i)}\\ (min_{c^{(i)}=k}|x^{(i)}-\mu_k|^2)\\ for\;k=1\;to\;K\\ \mu_k=average\;(mean)\;of\;points\;assigned\;to\;cluser\;k\\ (\mu_k=\frac{1}{m}\sum_{i=1}^{m,c^{(i)=k}}x^{(i)}\in\mathbb{R}^n)\\ \}\\

Optimization objective(cost function):

J(c(1),...,c(m),μ1,...,μk)=1mi=1mx(i)μc(i)2J(c^{(1)},...,c^{(m)},\mu_1,...,\mu_k)=\frac{1}{m}\sum_{i=1}^m|x^{(i)}-\mu_{c^{(i)}}|^2\\

Randomly initialize and run K-means several times and find the lowest J

Elbow methord

7

*how to choose the number of cluster

# Dimensionality reduction:Principal Component Analysis Algorithm

  • Data preprocessing:

    replace  each  xj(i)  with  xj(i)1mi=1mxj(i)maxi=1:m{xj(i)}mini=1:m{xj(i)}Σ=1mi=1m(x(i))(x(i))TRn×ncompute  eigenvectors  of  matrix  Sigma  by[U,S,V]=svd(Σ)(Singular  value  decomposition)replace\;each\;x_j^{(i)}\;with\;\frac{x_j^{(i)}-\frac{1}{m}\sum_{i=1}^mx_j^{(i)}}{max_{i=1:m}\{x_j^{(i)}\}-min_{i=1:m}\{x_j^{(i)}\}}\\ \Sigma=\frac{1}{m}\sum_{i=1}^m(x^{(i)})(x^{(i)})^T\in\mathbb{R}^{n\times n}\\ compute\;'eigenvectors'\;of\;matrix\;Sigma\;by\\ [U,S,V]=svd(\Sigma)(Singular\;value\;decomposition)\\

  • PCA

    [U,S,V]=svd(Σ)U=[...u(1)u(2)u(3)...u(n)...]Ureduce=U(:,1:k)=[...u(1)u(2)u(3)...u(k)...]Rn×kz(i)=UreduceTx(i)Rk×1xapprox(i)=Ureducez(i)Rn×1(noting  that  UTU1...)[U,S,V]=svd(\Sigma)\\ U=\begin{bmatrix} |&|&|&...&|\\ u^{(1)}&u^{(2)}&u^{(3)}&...&u^{(n)}\\ |&|&|&...&|\\ \end{bmatrix}\\ U_{reduce}=U(:,1:k)=\begin{bmatrix} |&|&|&...&|\\ u^{(1)}&u^{(2)}&u^{(3)}&...&u^{(k)}\\ |&|&|&...&|\\ \end{bmatrix}\in\mathbb{R}^{n\times k}\\ z^{(i)}=U_{reduce}^Tx^{(i)}\in\mathbb{R}^{k\times1}\\ x_{approx}^{(i)}=U_{reduce}z^{(i)}\in\mathbb{R}^{n\times1}\\ (noting\;that\;U^T\neq U^{-1}...)