Deeplearning Algorithms tutorial

谷歌的人工智能位于全球前列,在图像识别、语音识别、无人驾驶等技术上都已经落地。而百度实质意义上扛起了国内的人工智能的大旗,覆盖无人驾驶、智能助手、图像识别等许多层面。苹果业已开始全面拥抱机器学习,新产品进军家庭智能音箱并打造工作站级别Mac。另外,腾讯的深度学习平台Mariana已支持了微信语音识别的语音输入法、语音开放平台、长按语音消息转文本等产品,在微信图像识别中开始应用。全球前十大科技公司全部发力人工智能理论研究和应用的实现,虽然入门艰难,但是一旦入门,高手也就在你的不远处!

机器学习主要有三种方式:监督学习,无监督学习与半监督学习。

(1)监督学习:从给定的训练数据集中学习出一个函数,当新的数据输入时,可以根据函数预测相应的结果。监督学习的训练集要求是包括输入和输出,也就是特征和目标。训练集中的目标是有标注的。如今机器学习已固有的监督学习算法有可以进行分类的,例如贝叶斯分类,SVM,ID3,C4.5以及分类决策树,以及现在最火热的人工神经网络,例如BP神经网络,RBF神经网络,Hopfield神经网络、深度信念网络和卷积神经网络等。人工神经网络是模拟人大脑的思考方式来进行分析,在人工神经网络中有显层,隐层以及输出层,而每一层都会有神经元,神经元的状态或开启或关闭,这取决于大数据。同样监督机器学习算法也可以作回归,最常用便是逻辑回归。

(2)无监督学习:与有监督学习相比,无监督学习的训练集的类标号是未知的,并且要学习的类的个数或集合可能事先不知道。常见的无监督学习算法包括聚类和关联,例如K均值法、Apriori算法。

(3)半监督学习:介于监督学习和无监督学习之间,例如EM算法。

如今的机器学习领域主要的研究工作在三个方面进行:1)面向任务的研究,研究和分析改进一组预定任务的执行性能的学习系统;2)认知模型,研究人类学习过程并进行计算模拟;3)理论的分析,从理论的层面探索可能的算法和独立的应用领域算法。

自动编码器(Autoencoder)

自动编码器(Autoencoder)是一种无监督的学习算法,主要用于数据的降维或者特征的抽取,在深度学习中,自动编码器(Autoencoder)可用于在训练阶段开始前,确定权重矩阵的初始值。

神经网络中的权重矩阵可看作是对输入的数据进行特征转换,即先将数据编码为另一种形式,然后在此基础上进行一系列学习。然而,在对权重初始化时,我们并不知道初始的权重值在训练时会起到怎样的作用,也不知道在训练过程中权重会怎样的变化。因此一种较好的思路是,利用初始化生成的权重矩阵进行编码时,我们希望编码后的数据能够较好的保留原始数据的主要特征。那么,如何衡量码后的数据是否保留了较完整的信息呢?答案是:如果编码后的数据能够较为容易地通过解码恢复成原始数据,我们则认为较好的保留了数据信息。

自动编码器(Autoencoder)中:原始input(设为x)经过加权(W、b)、映射(Sigmoid)之后得到y,再对y反向加权映射回来成为z。

通过反复迭代训练两组(W、b),使得误差函数最小,即尽可能保证z近似于x,即完美重构了x。

那么可以说正向第一组权(W、b)是成功的,很好的学习了input中的关键特征,不然也不会重构得如此完美.

自动编码器(Autoencoder) - 图1

这个过程很有趣,首先,它没有使用数据标签来计算误差update参数,所以是无监督学习。

其次,利用类似神经网络的双隐层的方式,简单粗暴地提取了样本的特征。

这个双隐层是有争议的,最初的编码器确实使用了两组(W,b),但是Vincent在2010年的论文中做了研究,发现只要单组W就可以了。

即W’=W^T, W和W’称为Tied Weights。实验证明,W’真的只是在打酱油,完全没有必要去做训练。

逆向重构矩阵让人想起了逆矩阵,若W^-1=W^T的话,W就是个正交矩阵了,即W是可以训成近似正交阵的。

由于W’就是个酱油,训练完之后就没它事了。正向传播用W即可,相当于为input预先编个码,再导入到下一layer去。所以叫自动编码器,而不叫自动编码解码器。

自动编码器相当于创建了一个隐层,一个简单想法就是加在深度网络的开头,作为原始信号的初级filter,起到降维、提取特征的效果。 当然,这种做法就有一个问题,AutoEncoder可以看作是PCA的非线性补丁加强版,PCA的取得的效果是建立在降维基础上的。

仔细想想CNN这种结构,随着layer的推进,每层的神经元个数在递增,如果用了AutoEncoder去预训练,岂不是增维了?真的没问题?

相关论文中给出的实验结果认为AutoEncoder的增维效果还不赖,原因可能是非线性网络能力很强,尽管神经元个数增多,但是每个神经元的效果在衰减。

同时,随机梯度算法给了后续监督学习一个良好的开端。整体上,增维是利大于弊的。

应用示例

  1. from __future__ import division
  2. import tensorflow as tf
  3. import numpy as np
  4. import logging
  5. import json
  6. import os
  7. class TextAutoencoder(object):
  8. """
  9. Class that encapsulates the encoder-decoder architecture to
  10. reconstruct pieces of text.
  11. """
  12. def __init__(self, lstm_units, embeddings, go, train=True,
  13. train_embeddings=False, bidirectional=True):
  14. """
  15. Initialize the encoder/decoder and creates Tensor objects
  16. :param lstm_units: number of LSTM units
  17. :param embeddings: numpy array with initial embeddings
  18. :param go: index of the GO symbol in the embedding matrix
  19. :param train_embeddings: whether to adjust embeddings during training
  20. :param bidirectional: whether to create a bidirectional autoencoder
  21. (if False, a simple linear LSTM is used)
  22. """
  23. # EOS and GO share the same symbol. Only GO needs to be embedded, and
  24. # only EOS exists as a possible network output
  25. self.go = go
  26. self.eos = go
  27. self.bidirectional = bidirectional
  28. self.vocab_size = embeddings.shape[0]
  29. self.embedding_size = embeddings.shape[1]
  30. self.global_step = tf.Variable(0, name='global_step', trainable=False)
  31. # the sentence is the object to be memorized
  32. self.sentence = tf.placeholder(tf.int32, [None, None], 'sentence')
  33. self.sentence_size = tf.placeholder(tf.int32, [None],
  34. 'sentence_size')
  35. self.l2_constant = tf.placeholder(tf.float32, name='l2_constant')
  36. self.clip_value = tf.placeholder(tf.float32, name='clip')
  37. self.learning_rate = tf.placeholder(tf.float32, name='learning_rate')
  38. self.dropout_keep = tf.placeholder(tf.float32, name='dropout_keep')
  39. self.decoder_step_input = tf.placeholder(tf.int32,
  40. [None],
  41. 'prediction_step')
  42. name = 'decoder_fw_step_state_c'
  43. self.decoder_fw_step_c = tf.placeholder(tf.float32,
  44. [None, lstm_units], name)
  45. name = 'decoder_fw_step_state_h'
  46. self.decoder_fw_step_h = tf.placeholder(tf.float32,
  47. [None, lstm_units], name)
  48. self.decoder_bw_step_c = tf.placeholder(tf.float32,
  49. [None, lstm_units],
  50. 'decoder_bw_step_state_c')
  51. self.decoder_bw_step_h = tf.placeholder(tf.float32,
  52. [None, lstm_units],
  53. 'decoder_bw_step_state_h')
  54. with tf.variable_scope('autoencoder') as self.scope:
  55. self.embeddings = tf.Variable(embeddings, name='embeddings',
  56. trainable=train_embeddings)
  57. initializer = tf.glorot_normal_initializer()
  58. self.lstm_fw = tf.nn.rnn_cell.LSTMCell(lstm_units,
  59. initializer=initializer)
  60. self.lstm_bw = tf.nn.rnn_cell.LSTMCell(lstm_units,
  61. initializer=initializer)
  62. embedded = tf.nn.embedding_lookup(self.embeddings, self.sentence)
  63. embedded = tf.nn.dropout(embedded, self.dropout_keep)
  64. # encoding step
  65. if bidirectional:
  66. bdr = tf.nn.bidirectional_dynamic_rnn
  67. ret = bdr(self.lstm_fw, self.lstm_bw,
  68. embedded, dtype=tf.float32,
  69. sequence_length=self.sentence_size,
  70. scope=self.scope)
  71. else:
  72. ret = tf.nn.dynamic_rnn(self.lstm_fw, embedded,
  73. dtype=tf.float32,
  74. sequence_length=self.sentence_size,
  75. scope=self.scope)
  76. _, self.encoded_state = ret
  77. if bidirectional:
  78. encoded_state_fw, encoded_state_bw = self.encoded_state
  79. # set the scope name used inside the decoder.
  80. # maybe there's a more elegant way to do it?
  81. fw_scope_name = self.scope.name + '/fw'
  82. bw_scope_name = self.scope.name + '/bw'
  83. else:
  84. encoded_state_fw = self.encoded_state
  85. fw_scope_name = self.scope
  86. self.scope.reuse_variables()
  87. # generate a batch of embedded GO
  88. # sentence_size has the batch dimension
  89. go_batch = self._generate_batch_go(self.sentence_size)
  90. embedded_eos = tf.nn.embedding_lookup(self.embeddings,
  91. go_batch)
  92. embedded_eos = tf.reshape(embedded_eos,
  93. [-1, 1, self.embedding_size])
  94. decoder_input = tf.concat([embedded_eos, embedded], axis=1)
  95. # decoding step
  96. # We give the same inputs to the forward and backward LSTMs,
  97. # but each one has its own hidden state
  98. # their outputs are concatenated and fed to the softmax layer
  99. if bidirectional:
  100. outputs, _ = tf.nn.bidirectional_dynamic_rnn(
  101. self.lstm_fw, self.lstm_bw, decoder_input,
  102. self.sentence_size, encoded_state_fw, encoded_state_bw)
  103. # concat fw and bw outputs
  104. outputs = tf.concat(outputs, -1)
  105. else:
  106. outputs, _ = tf.nn.dynamic_rnn(
  107. self.lstm_fw, decoder_input, self.sentence_size,
  108. encoded_state_fw)
  109. self.decoder_outputs = outputs
  110. # now project the outputs to the vocabulary
  111. with tf.variable_scope('projection') as self.projection_scope:
  112. # decoder_outputs has shape (batch, max_sentence_size, vocab_size)
  113. self.logits = tf.layers.dense(outputs, self.vocab_size)
  114. # tensors for running a model
  115. embedded_step = tf.nn.embedding_lookup(self.embeddings,
  116. self.decoder_step_input)
  117. state_fw = tf.nn.rnn_cell.LSTMStateTuple(self.decoder_fw_step_c,
  118. self.decoder_fw_step_h)
  119. state_bw = tf.nn.rnn_cell.LSTMStateTuple(self.decoder_bw_step_c,
  120. self.decoder_bw_step_h)
  121. with tf.variable_scope(fw_scope_name, reuse=True):
  122. ret_fw = self.lstm_fw(embedded_step, state_fw)
  123. step_output_fw, self.decoder_fw_step_state = ret_fw
  124. if bidirectional:
  125. with tf.variable_scope(bw_scope_name, reuse=True):
  126. ret_bw = self.lstm_bw(embedded_step, state_bw)
  127. step_output_bw, self.decoder_bw_step_state = ret_bw
  128. step_output = tf.concat(axis=1, values=[step_output_fw,
  129. step_output_bw])
  130. else:
  131. step_output = step_output_fw
  132. with tf.variable_scope(self.projection_scope, reuse=True):
  133. self.projected_step_output = tf.layers.dense(step_output,
  134. self.vocab_size)
  135. if train:
  136. self._create_training_tensors()
  137. def _create_training_tensors(self):
  138. """
  139. Create member variables related to training.
  140. """
  141. eos_batch = self._generate_batch_go(self.sentence_size)
  142. eos_batch = tf.reshape(eos_batch, [-1, 1])
  143. decoder_labels = tf.concat([self.sentence, eos_batch], -1)
  144. projection_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
  145. scope=self.projection_scope.name)
  146. # a bit ugly, maybe we should improve this?
  147. projection_w = [var for var in projection_vars
  148. if 'kernel' in var.name][0]
  149. projection_b = [var for var in projection_vars
  150. if 'bias' in var.name][0]
  151. # set the importance of each time step
  152. # 1 if before sentence end or EOS itself; 0 otherwise
  153. max_len = tf.shape(self.sentence)[1]
  154. mask = tf.sequence_mask(self.sentence_size + 1, max_len + 1, tf.float32)
  155. num_actual_labels = tf.reduce_sum(mask)
  156. projection_w_t = tf.transpose(projection_w)
  157. # reshape to have batch and time steps in the same dimension
  158. decoder_outputs2d = tf.reshape(self.decoder_outputs,
  159. [-1, tf.shape(self.decoder_outputs)[-1]])
  160. labels = tf.reshape(decoder_labels, [-1, 1])
  161. sampled_loss = tf.nn.sampled_softmax_loss(
  162. projection_w_t, projection_b, labels, decoder_outputs2d, 100,
  163. self.vocab_size)
  164. masked_loss = tf.reshape(mask, [-1]) * sampled_loss
  165. self.loss = tf.reduce_sum(masked_loss) / num_actual_labels
  166. optimizer = tf.train.AdamOptimizer(self.learning_rate)
  167. gradients, v = zip(*optimizer.compute_gradients(self.loss))
  168. gradients, _ = tf.clip_by_global_norm(gradients, self.clip_value)
  169. self.train_op = optimizer.apply_gradients(zip(gradients, v),
  170. global_step=self.global_step)
  171. def get_trainable_variables(self):
  172. """
  173. Return all trainable variables inside the model
  174. """
  175. return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
  176. def train(self, session, save_path, train_data, valid_data,
  177. batch_size, epochs, learning_rate, dropout_keep,
  178. clip_value, report_interval):
  179. """
  180. Train the model
  181. :param session: tensorflow session
  182. :param train_data: Dataset object with training data
  183. :param valid_data: Dataset object with validation data
  184. :param batch_size: batch size
  185. :param learning_rate: initial learning rate
  186. :param dropout_keep: the probability that each LSTM input/output is kept
  187. :param epochs: how many epochs to train for
  188. :param clip_value: value to clip tensor norm during training
  189. :param save_path: folder to save the model
  190. :param report_interval: report after that many batches
  191. """
  192. saver = tf.train.Saver(self.get_trainable_variables(),
  193. max_to_keep=1)
  194. best_loss = 10000
  195. accumulated_loss = 0
  196. batch_counter = 0
  197. num_sents = 0
  198. # get all data at once. we need all matrices with the same size,
  199. # or else they don't fit the placeholders
  200. # train_sents, train_sizes = train_data.join_all(self.go,
  201. # self.num_time_steps,
  202. # shuffle=True)
  203. # del train_data # save memory...
  204. valid_sents, valid_sizes = valid_data.join_all(self.go,
  205. shuffle=True)
  206. train_data.reset_epoch_counter()
  207. feeds = {self.clip_value: clip_value,
  208. self.dropout_keep: dropout_keep,
  209. self.learning_rate: learning_rate}
  210. while train_data.epoch_counter < epochs:
  211. batch_counter += 1
  212. train_sents, train_sizes = train_data.next_batch(batch_size)
  213. feeds[self.sentence] = train_sents
  214. feeds[self.sentence_size] = train_sizes
  215. _, loss = session.run([self.train_op, self.loss], feeds)
  216. # multiply by len because some batches may be smaller
  217. # (due to bucketing), then take the average
  218. accumulated_loss += loss * len(train_sents)
  219. num_sents += len(train_sents)
  220. if batch_counter % report_interval == 0:
  221. avg_loss = accumulated_loss / num_sents
  222. accumulated_loss = 0
  223. num_sents = 0
  224. # we can't use all the validation at once, since it would
  225. # take too much memory. running many small batches would
  226. # instead take too much time. So let's just sample it.
  227. sample_indices = np.random.randint(0, len(valid_data),
  228. 5000)
  229. validation_feeds = {
  230. self.sentence: valid_sents[sample_indices],
  231. self.sentence_size: valid_sizes[sample_indices],
  232. self.dropout_keep: 1}
  233. loss = session.run(self.loss, validation_feeds)
  234. msg = '%d epochs, %d batches\t' % (train_data.epoch_counter,
  235. batch_counter)
  236. msg += 'Avg batch loss: %f\t' % avg_loss
  237. msg += 'Validation loss: %f' % loss
  238. if loss < best_loss:
  239. best_loss = loss
  240. self.save(saver, session, save_path)
  241. msg += '\t(saved model)'
  242. logging.info(msg)
  243. def save(self, saver, session, directory):
  244. """
  245. Save the autoencoder model and metadata to the specified
  246. directory.
  247. """
  248. model_path = os.path.join(directory, 'model')
  249. saver.save(session, model_path)
  250. metadata = {'vocab_size': self.vocab_size,
  251. 'embedding_size': self.embedding_size,
  252. 'num_units': self.lstm_fw.output_size,
  253. 'go': self.go,
  254. 'bidirectional': self.bidirectional
  255. }
  256. metadata_path = os.path.join(directory, 'metadata.json')
  257. with open(metadata_path, 'wb') as f:
  258. json.dump(metadata, f)
  259. @classmethod
  260. def load(cls, directory, session, train=False):
  261. """
  262. Load an instance of this class from a previously saved one.
  263. :param directory: directory with the model files
  264. :param session: tensorflow session
  265. :param train: if True, also create training tensors
  266. :return: a TextAutoencoder instance
  267. """
  268. model_path = os.path.join(directory, 'model')
  269. metadata_path = os.path.join(directory, 'metadata.json')
  270. with open(metadata_path, 'rb') as f:
  271. metadata = json.load(f)
  272. dummy_embeddings = np.empty((metadata['vocab_size'],
  273. metadata['embedding_size'],),
  274. dtype=np.float32)
  275. ae = TextAutoencoder(metadata['num_units'], dummy_embeddings,
  276. metadata['go'], train=train,
  277. bidirectional=metadata['bidirectional'])
  278. vars_to_load = ae.get_trainable_variables()
  279. if not train:
  280. # if not flagged for training, the embeddings won't be in
  281. # the list
  282. vars_to_load.append(ae.embeddings)
  283. saver = tf.train.Saver(vars_to_load)
  284. saver.restore(session, model_path)
  285. return ae
  286. def encode(self, session, inputs, sizes):
  287. """
  288. Run the encoder to obtain the encoded hidden state
  289. :param session: tensorflow session
  290. :param inputs: 2-d array with the word indices
  291. :param sizes: 1-d array with size of each sentence
  292. :return: a 2-d numpy array with the hidden state
  293. """
  294. feeds = {self.sentence: inputs,
  295. self.sentence_size: sizes,
  296. self.dropout_keep: 1}
  297. state = session.run(self.encoded_state, feeds)
  298. if self.bidirectional:
  299. state_fw, state_bw = state
  300. return np.hstack((state_fw.c, state_bw.c))
  301. return state.c
  302. def run(self, session, inputs, sizes):
  303. """
  304. Run the autoencoder with the given data
  305. :param session: tensorflow session
  306. :param inputs: 2-d array with the word indices
  307. :param sizes: 1-d array with size of each sentence
  308. :return: a 2-d array (batch, output_length) with the answer
  309. produced by the autoencoder. The output length is not
  310. fixed; it stops after producing EOS for all items in the
  311. batch or reaching two times the maximum number of time
  312. steps in the inputs.
  313. """
  314. feeds = {self.sentence: inputs,
  315. self.sentence_size: sizes,
  316. self.dropout_keep: 1}
  317. state = session.run(self.encoded_state, feeds)
  318. if self.bidirectional:
  319. state_fw, state_bw = state
  320. else:
  321. state_fw = state
  322. time_steps = 0
  323. max_time_steps = 2 * len(inputs[0])
  324. answer = []
  325. input_symbol = self.go * np.ones_like(sizes, dtype=np.int32)
  326. # this array control which sequences have already been finished by the
  327. # decoder, i.e., for which ones it already produced the END symbol
  328. sequences_done = np.zeros_like(sizes, dtype=np.bool)
  329. while True:
  330. # we could use tensorflow's rnn_decoder, but this gives us
  331. # finer control
  332. feeds = {self.decoder_fw_step_c: state_fw.c,
  333. self.decoder_fw_step_h: state_fw.h,
  334. self.decoder_step_input: input_symbol,
  335. self.dropout_keep: 1}
  336. if self.bidirectional:
  337. feeds[self.decoder_bw_step_c] = state_bw.c
  338. feeds[self.decoder_bw_step_h] = state_bw.h
  339. ops = [self.projected_step_output,
  340. self.decoder_fw_step_state,
  341. self.decoder_bw_step_state]
  342. outputs, state_fw, state_bw = session.run(ops, feeds)
  343. else:
  344. ops = [self.projected_step_output,
  345. self.decoder_fw_step_state]
  346. outputs, state_fw = session.run(ops, feeds)
  347. input_symbol = outputs.argmax(1)
  348. answer.append(input_symbol)
  349. # use an "additive" or in order to avoid infinite loops
  350. sequences_done |= (input_symbol == self.eos)
  351. if sequences_done.all() or time_steps > max_time_steps:
  352. break
  353. else:
  354. time_steps += 1
  355. return np.hstack(answer)
  356. def _generate_batch_go(self, like):
  357. """
  358. Generate a 1-d tensor with copies of EOS as big as the batch size,
  359. :param like: a tensor whose shape the returned embeddings should match
  360. :return: a tensor with shape as `like`
  361. """
  362. ones = tf.ones_like(like)
  363. return ones * self.go