Deeplearning Algorithms tutorial

谷歌的人工智能位于全球前列,在图像识别、语音识别、无人驾驶等技术上都已经落地。而百度实质意义上扛起了国内的人工智能的大旗,覆盖无人驾驶、智能助手、图像识别等许多层面。苹果业已开始全面拥抱机器学习,新产品进军家庭智能音箱并打造工作站级别Mac。另外,腾讯的深度学习平台Mariana已支持了微信语音识别的语音输入法、语音开放平台、长按语音消息转文本等产品,在微信图像识别中开始应用。全球前十大科技公司全部发力人工智能理论研究和应用的实现,虽然入门艰难,但是一旦入门,高手也就在你的不远处!

机器学习主要有三种方式:监督学习,无监督学习与半监督学习。

(1)监督学习:从给定的训练数据集中学习出一个函数,当新的数据输入时,可以根据函数预测相应的结果。监督学习的训练集要求是包括输入和输出,也就是特征和目标。训练集中的目标是有标注的。如今机器学习已固有的监督学习算法有可以进行分类的,例如贝叶斯分类,SVM,ID3,C4.5以及分类决策树,以及现在最火热的人工神经网络,例如BP神经网络,RBF神经网络,Hopfield神经网络、深度信念网络和卷积神经网络等。人工神经网络是模拟人大脑的思考方式来进行分析,在人工神经网络中有显层,隐层以及输出层,而每一层都会有神经元,神经元的状态或开启或关闭,这取决于大数据。同样监督机器学习算法也可以作回归,最常用便是逻辑回归。

(2)无监督学习:与有监督学习相比,无监督学习的训练集的类标号是未知的,并且要学习的类的个数或集合可能事先不知道。常见的无监督学习算法包括聚类和关联,例如K均值法、Apriori算法。

(3)半监督学习:介于监督学习和无监督学习之间,例如EM算法。

如今的机器学习领域主要的研究工作在三个方面进行:1)面向任务的研究,研究和分析改进一组预定任务的执行性能的学习系统;2)认知模型,研究人类学习过程并进行计算模拟;3)理论的分析,从理论的层面探索可能的算法和独立的应用领域算法。

多层感知器(Multilayer Perceptron)

多层感知器(Multilayer Perceptron,缩写MLP)是一种前向结构的人工神经网络,映射一组输入向量到一组输出向量。MLP可以被看作是一个有向图,由多个的节点层所组成,每一层都全连接到下一层。除了输入节点,每个节点都是一个带有非线性激活函数的神经元(或称处理单元)。一种被称为反向传播算法的监督学习方法常被用来训练MLP。MLP是感知器的推广,克服了感知器不能对线性不可分数据进行识别的弱点

若每个神经元的激活函数都是线性函数,那么,任意层数的MLP都可被约简成一个等价的单层感知器。

实际上,MLP本身可以使用任何形式的激活函数,譬如阶梯函数或逻辑乙形函数(logistic sigmoid function),但为了使用反向传播算法进行有效学习,激活函数必须限制为可微函数。由于具有良好可微性,很多S函数,尤其是双曲正切函数(Hyperbolic tangent)及逻辑函数,被采用为激活函数。

通常MLP用来进行学习的反向传播算法,在模式识别的领域中算是标准监督学习算法,并在计算神经学及并行分布式处理领域中,持续成为被研究的课题。MLP已被证明是一种通用的函数近似方法,可以被用来拟合复杂的函数,或解决分类问题。

MLP在80年代的时候曾是相当流行的机器学习方法,拥有广泛的应用场景,譬如语音识别、图像识别、机器翻译等等,但自90年代以来,MLP遇到来自更为简单的支持向量机的强劲竞争。近来,由于深度学习的成功,MLP又重新得到了关注。

通常一个单一隐藏层的多层感知机(或人工神经网络—ANN)可以用图表现为:

多层感知器(Multilayer Perceptron) - 图1

正式的单一隐藏层的MLP可以表现为:多层感知器(Multilayer Perceptron) - 图2,其中D是输入向量x的大小,L是输出向量f(x)的大小,矩阵表现为:多层感知器(Multilayer Perceptron) - 图3, b是偏差向量,W是权重矩阵,G和s是激活函数。

向量多层感知器(Multilayer Perceptron) - 图4构成隐藏层。多层感知器(Multilayer Perceptron) - 图5是连接输入向量和隐藏层的权重矩阵。Wi代表输入单元到第i个隐藏单元的权重。一般选择tanh作为s的激活函数,使用多层感知器(Multilayer Perceptron) - 图6或者使用逻辑sigmoid函数,多层感知器(Multilayer Perceptron) - 图7

这里我们使用Tanh因为一般它训练速度更快(有时也有利于解决局部最优)。tanh和sigmoid都是标量到标量函数,但通过点积运算向量和张量自然延伸(将向量分解成元素,生成同样大小的向量)。

输出向量通过以下公式得到多层感知器(Multilayer Perceptron) - 图8

我们此前在使用逻辑回归区分MNIST数字时提到过这一公式。如前,在多类区分中,通过使用softmax作为G的函数,可以获得类成员的概率。

训练一个MLP,我们学习模型所有的参数,这里我们使用随机梯度下降和批处理。要学习的参数为:多层感知器(Multilayer Perceptron) - 图9

梯度多层感知器(Multilayer Perceptron) - 图10可以使用反向传播算法获得(连续微分的特殊形式),Theano可以自动计算这一微分过程。

从逻辑回归到多层感知机我们将聚焦单隐藏层的多层感知机。 我们从构建一个单隐藏层的类开始。之后只要在此基础之上加一个逻辑回归层就构建了MLP。

  1. class HiddenLayer(object):
  2. def __init__(self, rng, input, n_in, n_out, W=None, b=None,
  3. activation=T.tanh):
  4. """
  5. Typical hidden layer of a MLP: units are fully-connected and have
  6. sigmoidal activation function. Weight matrix W is of shape (n_in,n_out)
  7. and the bias vector b is of shape (n_out,).
  8. NOTE : The nonlinearity used here is tanh
  9. Hidden unit activation is given by: tanh(dot(input,W) + b)
  10. :type rng: numpy.random.RandomState
  11. :param rng: a random number generator used to initialize weights
  12. :type input: theano.tensor.dmatrix
  13. :param input: a symbolic tensor of shape (n_examples, n_in)
  14. :type n_in: int
  15. :param n_in: dimensionality of input
  16. :type n_out: int
  17. :param n_out: number of hidden units
  18. :type activation: theano.Op or function
  19. :param activation: Non linearity to be applied in the hidden
  20. layer
  21. """
  22. self.input = input

隐藏层i权重的初始值应当根据激活函数以对称间断的方式取得样本。

对于tanh函数,区间在多层感知器(Multilayer Perceptron) - 图11

对于sigmoid函数,区间在多层感知器(Multilayer Perceptron) - 图12

这种初始化方式保证了在训练早期,每一个神经元在它的激活函数内操作,信息可以便利的向上(输入到输出)或反向(输出到输入)传播。

应用示例

  1. """
  2. This tutorial introduces the multilayer perceptron using Theano.
  3. A multilayer perceptron is a logistic regressor where
  4. instead of feeding the input to the logistic regression you insert a
  5. intermediate layer, called the hidden layer, that has a nonlinear
  6. activation function (usually tanh or sigmoid) . One can use many such
  7. hidden layers making the architecture deep. The tutorial will also tackle
  8. the problem of MNIST digit classification.
  9. .. math::
  10. f(x) = G( b^{(2)} + W^{(2)}( s( b^{(1)} + W^{(1)} x))),
  11. References:
  12. - textbooks: "Pattern Recognition and Machine Learning" -
  13. Christopher M. Bishop, section 5
  14. """
  15. from __future__ import print_function
  16. __docformat__ = 'restructedtext en'
  17. import os
  18. import sys
  19. import timeit
  20. import numpy
  21. import theano
  22. import theano.tensor as T
  23. from logistic_sgd import LogisticRegression, load_data
  24. # start-snippet-1
  25. class HiddenLayer(object):
  26. def __init__(self, rng, input, n_in, n_out, W=None, b=None,
  27. activation=T.tanh):
  28. """
  29. Typical hidden layer of a MLP: units are fully-connected and have
  30. sigmoidal activation function. Weight matrix W is of shape (n_in,n_out)
  31. and the bias vector b is of shape (n_out,).
  32. NOTE : The nonlinearity used here is tanh
  33. Hidden unit activation is given by: tanh(dot(input,W) + b)
  34. :type rng: numpy.random.RandomState
  35. :param rng: a random number generator used to initialize weights
  36. :type input: theano.tensor.dmatrix
  37. :param input: a symbolic tensor of shape (n_examples, n_in)
  38. :type n_in: int
  39. :param n_in: dimensionality of input
  40. :type n_out: int
  41. :param n_out: number of hidden units
  42. :type activation: theano.Op or function
  43. :param activation: Non linearity to be applied in the hidden
  44. layer
  45. """
  46. self.input = input
  47. # end-snippet-1
  48. # `W` is initialized with `W_values` which is uniformely sampled
  49. # from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden))
  50. # for tanh activation function
  51. # the output of uniform if converted using asarray to dtype
  52. # theano.config.floatX so that the code is runable on GPU
  53. # Note : optimal initialization of weights is dependent on the
  54. # activation function used (among other things).
  55. # For example, results presented in [Xavier10] suggest that you
  56. # should use 4 times larger initial weights for sigmoid
  57. # compared to tanh
  58. # We have no info for other function, so we use the same as
  59. # tanh.
  60. if W is None:
  61. W_values = numpy.asarray(
  62. rng.uniform(
  63. low=-numpy.sqrt(6. / (n_in + n_out)),
  64. high=numpy.sqrt(6. / (n_in + n_out)),
  65. size=(n_in, n_out)
  66. ),
  67. dtype=theano.config.floatX
  68. )
  69. if activation == theano.tensor.nnet.sigmoid:
  70. W_values *= 4
  71. W = theano.shared(value=W_values, name='W', borrow=True)
  72. if b is None:
  73. b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)
  74. b = theano.shared(value=b_values, name='b', borrow=True)
  75. self.W = W
  76. self.b = b
  77. lin_output = T.dot(input, self.W) + self.b
  78. self.output = (
  79. lin_output if activation is None
  80. else activation(lin_output)
  81. )
  82. # parameters of the model
  83. self.params = [self.W, self.b]
  84. # start-snippet-2
  85. class MLP(object):
  86. """Multi-Layer Perceptron Class
  87. A multilayer perceptron is a feedforward artificial neural network model
  88. that has one layer or more of hidden units and nonlinear activations.
  89. Intermediate layers usually have as activation function tanh or the
  90. sigmoid function (defined here by a ``HiddenLayer`` class) while the
  91. top layer is a softmax layer (defined here by a ``LogisticRegression``
  92. class).
  93. """
  94. def __init__(self, rng, input, n_in, n_hidden, n_out):
  95. """Initialize the parameters for the multilayer perceptron
  96. :type rng: numpy.random.RandomState
  97. :param rng: a random number generator used to initialize weights
  98. :type input: theano.tensor.TensorType
  99. :param input: symbolic variable that describes the input of the
  100. architecture (one minibatch)
  101. :type n_in: int
  102. :param n_in: number of input units, the dimension of the space in
  103. which the datapoints lie
  104. :type n_hidden: int
  105. :param n_hidden: number of hidden units
  106. :type n_out: int
  107. :param n_out: number of output units, the dimension of the space in
  108. which the labels lie
  109. """
  110. # Since we are dealing with a one hidden layer MLP, this will translate
  111. # into a HiddenLayer with a tanh activation function connected to the
  112. # LogisticRegression layer; the activation function can be replaced by
  113. # sigmoid or any other nonlinear function
  114. self.hiddenLayer = HiddenLayer(
  115. rng=rng,
  116. input=input,
  117. n_in=n_in,
  118. n_out=n_hidden,
  119. activation=T.tanh
  120. )
  121. # The logistic regression layer gets as input the hidden units
  122. # of the hidden layer
  123. self.logRegressionLayer = LogisticRegression(
  124. input=self.hiddenLayer.output,
  125. n_in=n_hidden,
  126. n_out=n_out
  127. )
  128. # end-snippet-2 start-snippet-3
  129. # L1 norm ; one regularization option is to enforce L1 norm to
  130. # be small
  131. self.L1 = (
  132. abs(self.hiddenLayer.W).sum()
  133. + abs(self.logRegressionLayer.W).sum()
  134. )
  135. # square of L2 norm ; one regularization option is to enforce
  136. # square of L2 norm to be small
  137. self.L2_sqr = (
  138. (self.hiddenLayer.W ** 2).sum()
  139. + (self.logRegressionLayer.W ** 2).sum()
  140. )
  141. # negative log likelihood of the MLP is given by the negative
  142. # log likelihood of the output of the model, computed in the
  143. # logistic regression layer
  144. self.negative_log_likelihood = (
  145. self.logRegressionLayer.negative_log_likelihood
  146. )
  147. # same holds for the function computing the number of errors
  148. self.errors = self.logRegressionLayer.errors
  149. # the parameters of the model are the parameters of the two layer it is
  150. # made out of
  151. self.params = self.hiddenLayer.params + self.logRegressionLayer.params
  152. # end-snippet-3
  153. # keep track of model input
  154. self.input = input
  155. def test_mlp(learning_rate=0.01, L1_reg=0.00, L2_reg=0.0001, n_epochs=1000,
  156. dataset='mnist.pkl.gz', batch_size=20, n_hidden=500):
  157. """
  158. Demonstrate stochastic gradient descent optimization for a multilayer
  159. perceptron
  160. This is demonstrated on MNIST.
  161. :type learning_rate: float
  162. :param learning_rate: learning rate used (factor for the stochastic
  163. gradient
  164. :type L1_reg: float
  165. :param L1_reg: L1-norm's weight when added to the cost (see
  166. regularization)
  167. :type L2_reg: float
  168. :param L2_reg: L2-norm's weight when added to the cost (see
  169. regularization)
  170. :type n_epochs: int
  171. :param n_epochs: maximal number of epochs to run the optimizer
  172. :type dataset: string
  173. :param dataset: the path of the MNIST dataset file from
  174. http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz
  175. """
  176. datasets = load_data(dataset)
  177. train_set_x, train_set_y = datasets[0]
  178. valid_set_x, valid_set_y = datasets[1]
  179. test_set_x, test_set_y = datasets[2]
  180. # compute number of minibatches for training, validation and testing
  181. n_train_batches = train_set_x.get_value(borrow=True).shape[0] // batch_size
  182. n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] // batch_size
  183. n_test_batches = test_set_x.get_value(borrow=True).shape[0] // batch_size
  184. ######################
  185. # BUILD ACTUAL MODEL #
  186. ######################
  187. print('... building the model')
  188. # allocate symbolic variables for the data
  189. index = T.lscalar() # index to a [mini]batch
  190. x = T.matrix('x') # the data is presented as rasterized images
  191. y = T.ivector('y') # the labels are presented as 1D vector of
  192. # [int] labels
  193. rng = numpy.random.RandomState(1234)
  194. # construct the MLP class
  195. classifier = MLP(
  196. rng=rng,
  197. input=x,
  198. n_in=28 * 28,
  199. n_hidden=n_hidden,
  200. n_out=10
  201. )
  202. # start-snippet-4
  203. # the cost we minimize during training is the negative log likelihood of
  204. # the model plus the regularization terms (L1 and L2); cost is expressed
  205. # here symbolically
  206. cost = (
  207. classifier.negative_log_likelihood(y)
  208. + L1_reg * classifier.L1
  209. + L2_reg * classifier.L2_sqr
  210. )
  211. # end-snippet-4
  212. # compiling a Theano function that computes the mistakes that are made
  213. # by the model on a minibatch
  214. test_model = theano.function(
  215. inputs=[index],
  216. outputs=classifier.errors(y),
  217. givens={
  218. x: test_set_x[index * batch_size:(index + 1) * batch_size],
  219. y: test_set_y[index * batch_size:(index + 1) * batch_size]
  220. }
  221. )
  222. validate_model = theano.function(
  223. inputs=[index],
  224. outputs=classifier.errors(y),
  225. givens={
  226. x: valid_set_x[index * batch_size:(index + 1) * batch_size],
  227. y: valid_set_y[index * batch_size:(index + 1) * batch_size]
  228. }
  229. )
  230. # start-snippet-5
  231. # compute the gradient of cost with respect to theta (sorted in params)
  232. # the resulting gradients will be stored in a list gparams
  233. gparams = [T.grad(cost, param) for param in classifier.params]
  234. # specify how to update the parameters of the model as a list of
  235. # (variable, update expression) pairs
  236. # given two lists of the same length, A = [a1, a2, a3, a4] and
  237. # B = [b1, b2, b3, b4], zip generates a list C of same size, where each
  238. # element is a pair formed from the two lists :
  239. # C = [(a1, b1), (a2, b2), (a3, b3), (a4, b4)]
  240. updates = [
  241. (param, param - learning_rate * gparam)
  242. for param, gparam in zip(classifier.params, gparams)
  243. ]
  244. # compiling a Theano function `train_model` that returns the cost, but
  245. # in the same time updates the parameter of the model based on the rules
  246. # defined in `updates`
  247. train_model = theano.function(
  248. inputs=[index],
  249. outputs=cost,
  250. updates=updates,
  251. givens={
  252. x: train_set_x[index * batch_size: (index + 1) * batch_size],
  253. y: train_set_y[index * batch_size: (index + 1) * batch_size]
  254. }
  255. )
  256. # end-snippet-5
  257. ###############
  258. # TRAIN MODEL #
  259. ###############
  260. print('... training')
  261. # early-stopping parameters
  262. patience = 10000 # look as this many examples regardless
  263. patience_increase = 2 # wait this much longer when a new best is
  264. # found
  265. improvement_threshold = 0.995 # a relative improvement of this much is
  266. # considered significant
  267. validation_frequency = min(n_train_batches, patience // 2)
  268. # go through this many
  269. # minibatche before checking the network
  270. # on the validation set; in this case we
  271. # check every epoch
  272. best_validation_loss = numpy.inf
  273. best_iter = 0
  274. test_score = 0.
  275. start_time = timeit.default_timer()
  276. epoch = 0
  277. done_looping = False
  278. while (epoch < n_epochs) and (not done_looping):
  279. epoch = epoch + 1
  280. for minibatch_index in range(n_train_batches):
  281. minibatch_avg_cost = train_model(minibatch_index)
  282. # iteration number
  283. iter = (epoch - 1) * n_train_batches + minibatch_index
  284. if (iter + 1) % validation_frequency == 0:
  285. # compute zero-one loss on validation set
  286. validation_losses = [validate_model(i) for i
  287. in range(n_valid_batches)]
  288. this_validation_loss = numpy.mean(validation_losses)
  289. print(
  290. 'epoch %i, minibatch %i/%i, validation error %f %%' %
  291. (
  292. epoch,
  293. minibatch_index + 1,
  294. n_train_batches,
  295. this_validation_loss * 100.
  296. )
  297. )
  298. # if we got the best validation score until now
  299. if this_validation_loss < best_validation_loss:
  300. #improve patience if loss improvement is good enough
  301. if (
  302. this_validation_loss < best_validation_loss *
  303. improvement_threshold
  304. ):
  305. patience = max(patience, iter * patience_increase)
  306. best_validation_loss = this_validation_loss
  307. best_iter = iter
  308. # test it on the test set
  309. test_losses = [test_model(i) for i
  310. in range(n_test_batches)]
  311. test_score = numpy.mean(test_losses)
  312. print((' epoch %i, minibatch %i/%i, test error of '
  313. 'best model %f %%') %
  314. (epoch, minibatch_index + 1, n_train_batches,
  315. test_score * 100.))
  316. if patience <= iter:
  317. done_looping = True
  318. break
  319. end_time = timeit.default_timer()
  320. print(('Optimization complete. Best validation score of %f %% '
  321. 'obtained at iteration %i, with test performance %f %%') %
  322. (best_validation_loss * 100., best_iter + 1, test_score * 100.))
  323. print(('The code for file ' +
  324. os.path.split(__file__)[1] +
  325. ' ran for %.2fm' % ((end_time - start_time) / 60.)), file=sys.stderr)
  326. if __name__ == '__main__':
  327. test_mlp()