SHUNMIAN

1 Backpropagation
2 Summary
3 参考资料

1 Backpropagation

前文我们介绍了单个梯度的计算方法，本文在这基础上，介绍如何迭代使用链式原则来计算表达式的梯度，即反向传播(Backpropagation)。这对于我们理解，开发，设计和调试Neural Networks至关重要。

本文要探讨的主要问题如下。

反向传播主要问题：给定某个函数$f(x)$，其中$x$是输入向量，我们如何计算$f$对$x$的梯度，即$\triangledown f(x)$。

该问题在神经网络下的应用是$f$对应loss function，$x$对应training set中的data $(x_i,y_i), i=1 \ldots N$，权重$W$和偏置$b$。由于training set中的data是固定的，因此我们一般只需要计算$W$和$b$的梯度来更新它们的值来减少loss。

1.1 Introduction

1.1.1 Simple Interpreting the Gradient

一些简单的梯度我们总结如下。

$\begin{align} f(x,y) = x y \hspace{0.5in} & \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = y \hspace{0.5in} \frac{\partial f}{\partial y} = x \\ f(x,y) = x + y \hspace{0.5in} & \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = 1 \hspace{0.5in} \frac{\partial f}{\partial y} = 1 \\ f(x,y) = \max(x, y) \hspace{0.5in} & \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = \mathbb{1}(x >= y) \hspace{0.5in} \frac{\partial f}{\partial y} = \mathbb{1}(y >= x) \end{align}$

1.1.2 Compound Expression, Chain Rule, Backpropagation

复合表达式的梯度遵循链式原则。

$\begin{align} f(x,y) = （x + y） \times z \hspace{0.5in} & \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = \frac{\partial f}{\partial (x + y)} \times \frac{\partial (x + y)}{\partial x} = z \times 1 = z \\ & \rightarrow \hspace{0.5in} \frac{\partial f}{\partial y} = \frac{\partial f}{\partial (x + y)} \times \frac{\partial (x + y)}{\partial y} = z \times 1 = z \\ & \rightarrow \hspace{0.5in} \frac{\partial f}{\partial z} = x + y \\ \end{align}$

pixelspace

1.2 Intuition

Backgroupagation is an elegant local process. The basic unit is gate. During forward pass, each gate’s gradient, $\triangledown \frac {G_{out_i}}{G_{in_i}}$ and value ${G_{out_i}}$ is computed with knwon ${G_{in_i}}$. During backward pass, the previous computed gradient is chained by multiplication to calculate the final output’s gradient on initial input, $\triangledown \frac {G_{out_N}}{G_{in_0}}$.

1.3 Best Practice

1.3.1 Modularity: Sigmoid Example

我们可以将多个gate模块化成1个gate，这里用1个Sigmoid函数举例。

pixelspace

1.3.2 Staged Computation

下面这个形式只是用于练习利用中间值(Staged Value)来进行backpropagation。中间值使得我们的计算大大简化，而没有必要算出每一个微小gate的gradient。

$f(x,y) = \frac{x + \sigma(y)}{\sigma(x) + (x+y)^2}$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36 x = 3 # example values
y = -4

# forward pass
sigy = 1.0 / (1 + math.exp(-y)) # sigmoid in numerator   #(1)
num = x + sigy # numerator                               #(2)
sigx = 1.0 / (1 + math.exp(-x)) # sigmoid in denominator #(3)
xpy = x + y                                              #(4)
xpysqr = xpy**2                                          #(5)
den = sigx + xpysqr # denominator                        #(6)
invden = 1.0 / den                                       #(7)
f = num * invden # done!                                 #(8)

# backprop f = num * invden
dnum = invden # gradient on numerator                             #(8)
dinvden = num                                                     #(8)
# backprop invden = 1.0 / den 
dden = (-1.0 / (den**2)) * dinvden                                #(7)
# backprop den = sigx + xpysqr
dsigx = (1) * dden                                                #(6)
dxpysqr = (1) * dden                                              #(6)
# backprop xpysqr = xpy**2
dxpy = (2 * xpy) * dxpysqr                                        #(5)
# backprop xpy = x + y
dx = (1) * dxpy                                                   #(4)
dy = (1) * dxpy                                                   #(4)
# backprop sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx # Notice += !!                  #(3)
# backprop num = x + sigy
dx += (1) * dnum                                                  #(2)
#This follows the multivariable chain rule in Calculus, which states that if a variable branches out to different parts of the circuit, then the gradients that flow back to it will add

dsigy = (1) * dnum                                                #(2)
# backprop sigy = 1.0 / (1 + math.exp(-y))
dy += ((1 - sigy) * sigy) * dsigy                                 #(1)
# done! phew

1.4 Patterns in Backward Flow

pixelspace

1.5 Gradients for Vectorized Operations

前面我们考虑的都是单变量，但是相关概念可以smooth地扩展到多变量(矩阵)。

1
2
3
4
5
6
7
8
9 # forward pass
W = np.random.randn(5, 10)
X = np.random.randn(10, 3)
D = W.dot(X)

# now suppose we had the gradient on D from above in the circuit
dD = np.random.randn(*D.shape) # same shape as D
dW = dD.dot(X.T) #.T gives the transpose of the matrix
dX = W.T.dot(dD)