Word Break

  • tags: [DP_Sequence]

Question

  1. Given a string s and a dictionary of words dict, determine if s can be
  2. segmented into a space-separated sequence of one or more dictionary words.
  3. For example, given
  4. s = "leetcode",
  5. dict = ["leet", "code"].
  6. Return true because "leetcode" can be segmented as "leet code".

题解

单序列(DP_Sequence) DP 题,由单序列动态规划的四要素可大致写出:

  1. State: f[i] 表示前i个字符能否根据词典中的词被成功分词。
  2. Function: f[i] = or{f[j], j < i, letter in [j+1, i] can be found in dict}, 含义为小于i的索引j中只要有一个f[j]为真且j+1i中组成的字符能在词典中找到时,f[i]即为真,否则为假。具体实现可分为自顶向下或者自底向上。
  3. Initialization: f[0] = true, 数组长度为字符串长度 + 1,便于处理。
  4. Answer: f[s.length]

考虑到单词长度通常不会太长,故在s较长时使用自底向上效率更高。

Python

  1. class Solution:
  2. # @param s, a string
  3. # @param wordDict, a set<string>
  4. # @return a boolean
  5. def wordBreak(self, s, wordDict):
  6. if not s:
  7. return True
  8. if not wordDict:
  9. return False
  10. max_word_len = max([len(w) for w in wordDict])
  11. can_break = [True]
  12. for i in xrange(len(s)):
  13. can_break.append(False)
  14. for j in xrange(i, -1, -1):
  15. # optimize for too long interval
  16. if i - j + 1 > max_word_len:
  17. break
  18. if can_break[j] and s[j:i + 1] in wordDict:
  19. can_break[i + 1] = True
  20. break
  21. return can_break[-1]

C++

  1. class Solution {
  2. public:
  3. bool wordBreak(string s, unordered_set<string>& wordDict) {
  4. if (s.empty()) return true;
  5. if (wordDict.empty()) return false;
  6. // get the max word length of wordDict
  7. int max_word_len = 0;
  8. for (unordered_set<string>::iterator it = wordDict.begin();
  9. it != wordDict.end(); ++it) {
  10. max_word_len = max(max_word_len, (*it).size());
  11. }
  12. vector<bool> can_break(s.size() + 1, false);
  13. can_break[0] = true;
  14. for (int i = 1; i <= s.size(); ++i) {
  15. for (int j = i - 1; j >= 0; --j) {
  16. // optimize for too long interval
  17. if (i - j > max_word_len) break;
  18. if (can_break[j] &&
  19. wordDict.find(s.substr(j, i - j)) != wordDict.end()) {
  20. can_break[i] = true;
  21. break;
  22. }
  23. }
  24. }
  25. return can_break[s.size()];
  26. }
  27. };

Java

  1. public class Solution {
  2. public boolean wordBreak(String s, Set<String> wordDict) {
  3. if (s == null || s.length() == 0) return true;
  4. if (wordDict == null || wordDict.isEmpty()) return false;
  5. // get the max word length of wordDict
  6. int max_word_len = 0;
  7. for (String word : wordDict) {
  8. max_word_len = Math.max(max_word_len, word.length());
  9. }
  10. boolean[] can_break = new boolean[s.length() + 1];
  11. can_break[0] = true;
  12. for (int i = 1; i <= s.length(); i++) {
  13. for (int j = i - 1; j >= 0; j--) {
  14. // optimize for too long interval
  15. if (i - j > max_word_len) break;
  16. String word = s.substring(j, i);
  17. if (can_break[j] && wordDict.contains(word)) {
  18. can_break[i] = true;
  19. break;
  20. }
  21. }
  22. }
  23. return can_break[s.length()];
  24. }
  25. }

源码分析

Python 之类的动态语言无需初始化指定大小的数组,使用时下标i比 C++和 Java 版的程序少1。使用自底向上的方法求解状态转移,首先遍历一次词典求得单词最大长度以便后续优化。

复杂度分析

  1. 求解词典中最大单词长度,时间复杂度为词典长度乘上最大单词长度 $$O(L_D \cdot L_w)$$
  2. 词典中找单词的时间复杂度为 $$O(1)$$(哈希表结构)
  3. 两重 for 循环,内循环在超出最大单词长度时退出,故最坏情况下两重 for 循环的时间复杂度为 $$O(n L_w)$$.
  4. 故总的时间复杂度近似为 $$O(n L_w)$$.
  5. 使用了与字符串长度几乎等长的布尔数组和临时单词word,空间复杂度近似为 $$O(n)$$.