1-3 Example: Modeling Procedure for Texts

1. Data Preparation

The purpose of the imdb dataset is to predict the sentiment label according to the movie reviews.

There are 20000 text reviews in the training dataset and 5000 in the testing dataset, with half positive and half negative, respectively.

The pre-processing of the text dataset is a little bit complex, which includes word division (for Chinese only, not relevant to this demonstration), dictionary construction, encoding, sequence filling, and data pipeline construction, etc.

There are two popular mothods of text preparation in TensorFlow.

The first one is constructing the text data generator using Tokenizer in tf.keras.preprocessing, together with tf.keras.utils.Sequence.

The second one is using tf.data.Dataset, together with the pre-processing layer tf.keras.layers.experimental.preprocessing.TextVectorization.

The former is more complex and is demonstrated here.

The latter is the original method of TensorFlow, which is simpler.

Below is the introduction to the second method.

1-3 Example: Modeling Procedure for Texts - 图1

  1. import numpy as np
  2. import pandas as pd
  3. from matplotlib import pyplot as plt
  4. import tensorflow as tf
  5. from tensorflow.keras import models,layers,preprocessing,optimizers,losses,metrics
  6. from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
  7. import re,string
  8. train_data_path = "../data/imdb/train.csv"
  9. test_data_path = "../data/imdb/test.csv"
  10. MAX_WORDS = 10000 # Consider the 10000 words with the highest frequency of appearance
  11. MAX_LEN = 200 # For each sample, preserve the first 200 words
  12. BATCH_SIZE = 20
  13. #Constructing data pipeline
  14. def split_line(line):
  15. arr = tf.strings.split(line,"\t")
  16. label = tf.expand_dims(tf.cast(tf.strings.to_number(arr[0]),tf.int32),axis = 0)
  17. text = tf.expand_dims(arr[1],axis = 0)
  18. return (text,label)
  19. ds_train_raw = tf.data.TextLineDataset(filenames = [train_data_path]) \
  20. .map(split_line,num_parallel_calls = tf.data.experimental.AUTOTUNE) \
  21. .shuffle(buffer_size = 1000).batch(BATCH_SIZE) \
  22. .prefetch(tf.data.experimental.AUTOTUNE)
  23. ds_test_raw = tf.data.TextLineDataset(filenames = [test_data_path]) \
  24. .map(split_line,num_parallel_calls = tf.data.experimental.AUTOTUNE) \
  25. .batch(BATCH_SIZE) \
  26. .prefetch(tf.data.experimental.AUTOTUNE)
  27. #Constructing dictionary
  28. def clean_text(text):
  29. lowercase = tf.strings.lower(text)
  30. stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  31. cleaned_punctuation = tf.strings.regex_replace(stripped_html,
  32. '[%s]' % re.escape(string.punctuation),'')
  33. return cleaned_punctuation
  34. vectorize_layer = TextVectorization(
  35. standardize=clean_text,
  36. split = 'whitespace',
  37. max_tokens=MAX_WORDS-1, #Leave one item for the placeholder
  38. output_mode='int',
  39. output_sequence_length=MAX_LEN)
  40. ds_text = ds_train_raw.map(lambda text,label: text)
  41. vectorize_layer.adapt(ds_text)
  42. print(vectorize_layer.get_vocabulary()[0:100])
  43. #Word encoding
  44. ds_train = ds_train_raw.map(lambda text,label:(vectorize_layer(text),label)) \
  45. .prefetch(tf.data.experimental.AUTOTUNE)
  46. ds_test = ds_test_raw.map(lambda text,label:(vectorize_layer(text),label)) \
  47. .prefetch(tf.data.experimental.AUTOTUNE)
  1. [b'the', b'and', b'a', b'of', b'to', b'is', b'in', b'it', b'i', b'this', b'that', b'was', b'as', b'for', b'with', b'movie', b'but', b'film', b'on', b'not', b'you', b'his', b'are', b'have', b'be', b'he', b'one', b'its', b'at', b'all', b'by', b'an', b'they', b'from', b'who', b'so', b'like', b'her', b'just', b'or', b'about', b'has', b'if', b'out', b'some', b'there', b'what', b'good', b'more', b'when', b'very', b'she', b'even', b'my', b'no', b'would', b'up', b'time', b'only', b'which', b'story', b'really', b'their', b'were', b'had', b'see', b'can', b'me', b'than', b'we', b'much', b'well', b'get', b'been', b'will', b'into', b'people', b'also', b'other', b'do', b'bad', b'because', b'great', b'first', b'how', b'him', b'most', b'dont', b'made', b'then', b'them', b'films', b'movies', b'way', b'make', b'could', b'too', b'any', b'after', b'characters']

2. Model Definition

Usually there are three ways of modeling using APIs of Keras: sequential modeling using Sequential() function, arbitrary modeling using functional API, and customized modeling by inheriting base class Model.

In this example, we use customized modeling by inheriting base class Model.

  1. # Actually, modeling with sequential() or API functions should be priorized.
  2. tf.keras.backend.clear_session()
  3. class CnnModel(models.Model):
  4. def __init__(self):
  5. super(CnnModel, self).__init__()
  6. def build(self,input_shape):
  7. self.embedding = layers.Embedding(MAX_WORDS,7,input_length=MAX_LEN)
  8. self.conv_1 = layers.Conv1D(16, kernel_size= 5,name = "conv_1",activation = "relu")
  9. self.pool = layers.MaxPool1D()
  10. self.conv_2 = layers.Conv1D(128, kernel_size=2,name = "conv_2",activation = "relu")
  11. self.flatten = layers.Flatten()
  12. self.dense = layers.Dense(1,activation = "sigmoid")
  13. super(CnnModel,self).build(input_shape)
  14. def call(self, x):
  15. x = self.embedding(x)
  16. x = self.conv_1(x)
  17. x = self.pool(x)
  18. x = self.conv_2(x)
  19. x = self.pool(x)
  20. x = self.flatten(x)
  21. x = self.dense(x)
  22. return(x)
  23. model = CnnModel()
  24. model.build(input_shape =(None,MAX_LEN))
  25. model.summary()
  1. Model: "cnn_model"
  2. _________________________________________________________________
  3. Layer (type) Output Shape Param #
  4. =================================================================
  5. embedding (Embedding) multiple 70000
  6. _________________________________________________________________
  7. conv_1 (Conv1D) multiple 576
  8. _________________________________________________________________
  9. max_pooling1d (MaxPooling1D) multiple 0
  10. _________________________________________________________________
  11. conv_2 (Conv1D) multiple 4224
  12. _________________________________________________________________
  13. flatten (Flatten) multiple 0
  14. _________________________________________________________________
  15. dense (Dense) multiple 6145
  16. =================================================================
  17. Total params: 80,945
  18. Trainable params: 80,945
  19. Non-trainable params: 0
  20. _________________________________________________________________

3. Model Training

There are three usual ways for model training: use internal function fit, use internal function train_on_batch, and customized training loop. Here we use the customized training loop.

  1. # Time Stamp
  2. @tf.function
  3. def printbar():
  4. ts = tf.timestamp()
  5. today_ts = tf.timestamp()%(24*60*60)
  6. hour = tf.cast(today_ts//3600+8,tf.int32)%tf.constant(24)
  7. minite = tf.cast((today_ts%3600)//60,tf.int32)
  8. second = tf.cast(tf.floor(today_ts%60),tf.int32)
  9. def timeformat(m):
  10. if tf.strings.length(tf.strings.format("{}",m))==1:
  11. return(tf.strings.format("0{}",m))
  12. else:
  13. return(tf.strings.format("{}",m))
  14. timestring = tf.strings.join([timeformat(hour),timeformat(minite),
  15. timeformat(second)],separator = ":")
  16. tf.print("=========="*8+timestring)
  1. optimizer = optimizers.Nadam()
  2. loss_func = losses.BinaryCrossentropy()
  3. train_loss = metrics.Mean(name='train_loss')
  4. train_metric = metrics.BinaryAccuracy(name='train_accuracy')
  5. valid_loss = metrics.Mean(name='valid_loss')
  6. valid_metric = metrics.BinaryAccuracy(name='valid_accuracy')
  7. @tf.function
  8. def train_step(model, features, labels):
  9. with tf.GradientTape() as tape:
  10. predictions = model(features,training = True)
  11. loss = loss_func(labels, predictions)
  12. gradients = tape.gradient(loss, model.trainable_variables)
  13. optimizer.apply_gradients(zip(gradients, model.trainable_variables))
  14. train_loss.update_state(loss)
  15. train_metric.update_state(labels, predictions)
  16. @tf.function
  17. def valid_step(model, features, labels):
  18. predictions = model(features,training = False)
  19. batch_loss = loss_func(labels, predictions)
  20. valid_loss.update_state(batch_loss)
  21. valid_metric.update_state(labels, predictions)
  22. def train_model(model,ds_train,ds_valid,epochs):
  23. for epoch in tf.range(1,epochs+1):
  24. for features, labels in ds_train:
  25. train_step(model,features,labels)
  26. for features, labels in ds_valid:
  27. valid_step(model,features,labels)
  28. #The logs template should be modified according to metric
  29. logs = 'Epoch={},Loss:{},Accuracy:{},Valid Loss:{},Valid Accuracy:{}'
  30. if epoch%1==0:
  31. printbar()
  32. tf.print(tf.strings.format(logs,
  33. (epoch,train_loss.result(),train_metric.result(),valid_loss.result(),valid_metric.result())))
  34. tf.print("")
  35. train_loss.reset_states()
  36. valid_loss.reset_states()
  37. train_metric.reset_states()
  38. valid_metric.reset_states()
  39. train_model(model,ds_train,ds_test,epochs = 6)
  1. ================================================================================13:54:08
  2. Epoch=1,Loss:0.442317516,Accuracy:0.7695,Valid Loss:0.323672801,Valid Accuracy:0.8614
  3. ================================================================================13:54:20
  4. Epoch=2,Loss:0.245737702,Accuracy:0.90215,Valid Loss:0.356488883,Valid Accuracy:0.8554
  5. ================================================================================13:54:32
  6. Epoch=3,Loss:0.17360799,Accuracy:0.93455,Valid Loss:0.361132562,Valid Accuracy:0.8674
  7. ================================================================================13:54:44
  8. Epoch=4,Loss:0.113476314,Accuracy:0.95975,Valid Loss:0.483677238,Valid Accuracy:0.856
  9. ================================================================================13:54:57
  10. Epoch=5,Loss:0.0698405355,Accuracy:0.9768,Valid Loss:0.607856631,Valid Accuracy:0.857
  11. ================================================================================13:55:15
  12. Epoch=6,Loss:0.0366807655,Accuracy:0.98825,Valid Loss:0.745884955,Valid Accuracy:0.854

4. Model Evaluation

The model trained by the customized looping is not compiled, so the method model.evaluate(ds_valid) can not be applied directly.

  1. def evaluate_model(model,ds_valid):
  2. for features, labels in ds_valid:
  3. valid_step(model,features,labels)
  4. logs = 'Valid Loss:{},Valid Accuracy:{}'
  5. tf.print(tf.strings.format(logs,(valid_loss.result(),valid_metric.result())))
  6. valid_loss.reset_states()
  7. train_metric.reset_states()
  8. valid_metric.reset_states()
  1. evaluate_model(model,ds_test)
  1. Valid Loss:0.745884418,Valid Accuracy:0.854

5. Model Application

Below are the available methods:

  • model.predict(ds_test)
  • model(x_test)
  • model.call(x_test)
  • model.predict_on_batch(x_test)

We recommend the method model.predict(ds_test) since it can be applied to both Dataset and Tensor.

  1. model.predict(ds_test)
  1. array([[0.7864823 ],
  2. [0.9999901 ],
  3. [0.99944776],
  4. ...,
  5. [0.8498302 ],
  6. [0.13382755],
  7. [1. ]], dtype=float32)
  1. for x_test,_ in ds_test.take(1):
  2. print(model(x_test))
  3. #Indentical expressions:
  4. #print(model.call(x_test))
  5. #print(model.predict_on_batch(x_test))
  1. tf.Tensor(
  2. [[7.8648227e-01]
  3. [9.9999011e-01]
  4. [9.9944776e-01]
  5. [3.7153201e-09]
  6. [9.4462049e-01]
  7. [2.3522753e-04]
  8. [1.2044354e-04]
  9. [9.3752089e-07]
  10. [9.9996352e-01]
  11. [9.3435925e-01]
  12. [9.8746723e-01]
  13. [9.9908626e-01]
  14. [4.1563155e-08]
  15. [4.1808244e-03]
  16. [8.0184749e-05]
  17. [8.3910513e-01]
  18. [3.5167937e-05]
  19. [7.2113985e-01]
  20. [4.5228912e-03]
  21. [9.9942589e-01]], shape=(20, 1), dtype=float32)

6. Model Saving

Model saving with the original way of TensorFlow is recommended.

  1. model.save('../data/tf_model_savedmodel', save_format="tf")
  2. print('export saved model.')
  3. model_loaded = tf.keras.models.load_model('../data/tf_model_savedmodel')
  4. model_loaded.predict(ds_test)
  1. array([[0.7864823 ],
  2. [0.9999901 ],
  3. [0.99944776],
  4. ...,
  5. [0.8498302 ],
  6. [0.13382755],
  7. [1. ]], dtype=float32)

Please leave comments in the WeChat official account “Python与算法之美” (Elegance of Python and Algorithms) if you want to communicate with the author about the content. The author will try best to reply given the limited time available.

You are also welcomed to join the group chat with the other readers through replying 加群 (join group) in the WeChat official account.

image.png