使用Word2Vec进行文本语义相似度计算

本示例展示利用PaddleHub“端到端地”完成文本相似度计算。

一、准备文本数据

  1. 驾驶违章一次扣12分用两个驾驶证处理可以吗 一次性扣12分的违章,能用不满十二分的驾驶证扣分吗
  2. 水果放冰箱里储存好吗 中国银行纪念币网上怎么预约
  3. 电脑反应很慢怎么办 反应速度慢,电脑总是卡是怎么回事

二、分词

利用PaddleHub Module LAC对文本数据进行分词。

  1. # coding:utf-8
  2. # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
  3. #
  4. # Licensed under the Apache License, Version 2.0 (the "License"
  5. # you may not use this file except in compliance with the License.
  6. # You may obtain a copy of the License at
  7. #
  8. # http://www.apache.org/licenses/LICENSE-2.0
  9. #
  10. # Unless required by applicable law or agreed to in writing, software
  11. # distributed under the License is distributed on an "AS IS" BASIS,
  12. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  13. # See the License for the specific language governing permissions and
  14. # limitations under the License.
  15. """similarity between two sentences"""
  16. import numpy as np
  17. import scipy
  18. from scipy.spatial import distance
  19. from paddlehub.reader.tokenization import load_vocab
  20. import paddle.fluid as fluid
  21. import paddlehub as hub
  22. raw_data = [
  23. ["驾驶违章一次扣12分用两个驾驶证处理可以吗", "一次性扣12分的违章,能用不满十二分的驾驶证扣分吗"],
  24. ["水果放冰箱里储存好吗", "中国银行纪念币网上怎么预约"],
  25. ["电脑反应很慢怎么办", "反应速度慢,电脑总是卡是怎么回事"]
  26. ]
  27. lac = hub.Module(name="lac")
  28. processed_data = []
  29. for text_pair in raw_data:
  30. inputs = {"text" : text_pair}
  31. results = lac.lexical_analysis(data=inputs, use_gpu=True, batch_size=2)
  32. data = []
  33. for result in results:
  34. data.append(" ".join(result["word"]))
  35. processed_data.append(data)

三、计算文本语义相似度

将分词文本中的单词相应替换为wordid,之后输入wor2vec module中计算两个文本语义相似度。

  1. def convert_tokens_to_ids(vocab, text):
  2. wids = []
  3. tokens = text.split(" ")
  4. for token in tokens:
  5. wid = vocab.get(token, None)
  6. if not wid:
  7. wid = vocab["unknown"]
  8. wids.append(wid)
  9. return wids
  10. module = hub.Module(name="word2vec_skipgram")
  11. inputs, outputs, program = module.context(trainable=False)
  12. vocab = load_vocab(module.get_vocab_path())
  13. word_ids = inputs["word_ids"]
  14. embedding = outputs["word_embs"]
  15. place = fluid.CPUPlace()
  16. exe = fluid.Executor(place)
  17. feeder = fluid.DataFeeder(feed_list=[word_ids], place=place)
  18. for item in processed_data:
  19. text_a = convert_tokens_to_ids(vocab, item[0])
  20. text_b = convert_tokens_to_ids(vocab, item[1])
  21. vecs_a, = exe.run(
  22. program,
  23. feed=feeder.feed([[text_a]]),
  24. fetch_list=[embedding.name],
  25. return_numpy=False)
  26. vecs_a = np.array(vecs_a)
  27. vecs_b, = exe.run(
  28. program,
  29. feed=feeder.feed([[text_b]]),
  30. fetch_list=[embedding.name],
  31. return_numpy=False)
  32. vecs_b = np.array(vecs_b)
  33. sent_emb_a = np.sum(vecs_a, axis=0)
  34. sent_emb_b = np.sum(vecs_b, axis=0)
  35. cos_sim = 1 - distance.cosine(sent_emb_a, sent_emb_b)
  36. print("text_a: %s; text_b: %s; cosine_similarity: %.5f" %
  37. (item[0], item[1], cos_sim))