目录


1 Backpropagation

前文我们介绍了单个梯度的计算方法,本文在这基础上,介绍如何迭代使用链式原则来计算表达式的梯度,即反向传播(Backpropagation)。这对于我们理解,开发,设计和调试Neural Networks至关重要。

本文要探讨的主要问题如下。

反向传播主要问题:给定某个函数$f(x)$,其中$x$是输入向量,我们如何计算$f$对$x$的梯度,即$\triangledown f(x)$。

该问题在神经网络下的应用是$f$对应loss function,$x$对应training set中的data $(x_i,y_i), i=1 \ldots N$,权重$W$和偏置$b$。由于training set中的data是固定的,因此我们一般只需要计算$W$和$b$的梯度来更新它们的值来减少loss。

1.1 Introduction

1.1.1 Simple Interpreting the Gradient

一些简单的梯度我们总结如下。

1.1.2 Compound Expression, Chain Rule, Backpropagation

复合表达式的梯度遵循链式原则。

pixelspace

1.2 Intuition

Backgroupagation is an elegant local process. The basic unit is gate. During forward pass, each gate’s gradient, $\triangledown \frac {G_{out_i}}{G_{in_i}}$ and value ${G_{out_i}}$ is computed with knwon ${G_{in_i}}$. During backward pass, the previous computed gradient is chained by multiplication to calculate the final output’s gradient on initial input, $\triangledown \frac {G_{out_N}}{G_{in_0}}$.

1.3 Best Practice

1.3.1 Modularity: Sigmoid Example

我们可以将多个gate模块化成1个gate,这里用1个Sigmoid函数举例。

pixelspace

1.3.2 Staged Computation

下面这个形式只是用于练习利用中间值(Staged Value)来进行backpropagation。中间值使得我们的计算大大简化,而没有必要算出每一个微小gate的gradient。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
x = 3 # example values
y = -4

# forward pass
sigy = 1.0 / (1 + math.exp(-y)) # sigmoid in numerator   #(1)
num = x + sigy # numerator                               #(2)
sigx = 1.0 / (1 + math.exp(-x)) # sigmoid in denominator #(3)
xpy = x + y                                              #(4)
xpysqr = xpy**2                                          #(5)
den = sigx + xpysqr # denominator                        #(6)
invden = 1.0 / den                                       #(7)
f = num * invden # done!                                 #(8)

# backprop f = num * invden
dnum = invden # gradient on numerator                             #(8)
dinvden = num                                                     #(8)
# backprop invden = 1.0 / den 
dden = (-1.0 / (den**2)) * dinvden                                #(7)
# backprop den = sigx + xpysqr
dsigx = (1) * dden                                                #(6)
dxpysqr = (1) * dden                                              #(6)
# backprop xpysqr = xpy**2
dxpy = (2 * xpy) * dxpysqr                                        #(5)
# backprop xpy = x + y
dx = (1) * dxpy                                                   #(4)
dy = (1) * dxpy                                                   #(4)
# backprop sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx # Notice += !!                  #(3)
# backprop num = x + sigy
dx += (1) * dnum                                                  #(2)
#This follows the multivariable chain rule in Calculus, which states that if a variable branches out to different parts of the circuit, then the gradients that flow back to it will add

dsigy = (1) * dnum                                                #(2)
# backprop sigy = 1.0 / (1 + math.exp(-y))
dy += ((1 - sigy) * sigy) * dsigy                                 #(1)
# done! phew

1.4 Patterns in Backward Flow

pixelspace

1.5 Gradients for Vectorized Operations

前面我们考虑的都是单变量,但是相关概念可以smooth地扩展到多变量(矩阵)。

1
2
3
4
5
6
7
8
9
# forward pass
W = np.random.randn(5, 10)
X = np.random.randn(10, 3)
D = W.dot(X)

# now suppose we had the gradient on D from above in the circuit
dD = np.random.randn(*D.shape) # same shape as D
dW = dD.dot(X.T) #.T gives the transpose of the matrix
dX = W.T.dot(dD)

2 Summary

pixelspace

3 参考资料


Share Post

Twitter Google+

Shunmian

The only programmers in a position to see all the differences in power between the various languages are those who understand the most powerful one.