The Chinese Dictionary

Chinese is a complex language (aren’t they all :-( ). The written form is hieroglyphic, that is “pictograms” instead of using an alphabet. But this written form has evolved over time, and even recently split into two forms: “traditional” Chinese as used in Taiwan and Hong Kong, and “simplified” Chinese as used in mainland China. While most of the characters are the same, about 1,000 are different. Thus a Chinese dictionary will often have two written forms of the same character.

Most Westerners like me can’t understand these characters. So there is a “Latinised” form called Pinyin which writes the characters in a phonetic alphabet based on the Latin alphabet. It isn’t quite the Latin alphabet, because Chinese is a tonal language, and the Pinyin form has to show the tones (much like acccents in French and other European languages). So a typical dictionary has to show four things: the traditional form, the simplified form, the Pinyin and the English. For example,

Traditional Simplified Pinyin English
hǎo good

But again there is a little complication. There is a free Chinese/English dictionary and even better, you can download it as a UTF-8 file, which Go is well suited to handle. In this, the Chinese characters are written in Unicode but the Pinyin characters are not: although there are Unicode characters for letters such as ‘ǎ’, many dictionaries including this one use the Latin ‘a’ and place the tone at the end of the word. Here it is the third tone, so “hǎo” is written as “hao3”. This makes it easier for those who only have US keyboards and no Unicode editor to still communicate in Pinyin.

This data format mismatch is not a big deal: just that somewhere along the line, between the original text dictionary and the display in the browser, a data massage has to be performed. Go templates allow this to be done by defining a custom template, so I chose that route. Alternatives could have been to do this as the dictionary is read in, or in the Javascript to display the final characters.

The code for the Pinyin formatter is given below. Please don’t bother reading it unless you are really interested in knowing the rules for Pinyin formatting.

  1. package pinyin
  2. import (
  3. "io"
  4. "strings"
  5. )
  6. func PinyinFormatter(w io.Writer, format string, value ...interface{}) {
  7. line := value[0].(string)
  8. words := strings.Fields(line)
  9. for n, word := range words {
  10. // convert "u:" to "ü" if present
  11. uColon := strings.Index(word, "u:")
  12. if uColon != -1 {
  13. parts := strings.SplitN(word, "u:", 2)
  14. word = parts[0] + "ü" + parts[1]
  15. }
  16. println(word)
  17. // get last character, will be the tone if present
  18. chars := []rune(word)
  19. tone := chars[len(chars)-1]
  20. if tone == '5' {
  21. words[n] = string(chars[0 : len(chars)-1])
  22. println("lost accent on", words[n])
  23. continue
  24. }
  25. if tone < '1' || tone > '4' {
  26. continue
  27. }
  28. words[n] = addAccent(word, int(tone))
  29. }
  30. line = strings.Join(words, ` `)
  31. w.Write([]byte(line))
  32. }
  33. var (
  34. // maps 'a1' to '\u0101' etc
  35. aAccent = map[int]rune{
  36. '1': '\u0101',
  37. '2': '\u00e1',
  38. '3': '\u01ce', // '\u0103',
  39. '4': '\u00e0'}
  40. eAccent = map[int]rune{
  41. '1': '\u0113',
  42. '2': '\u00e9',
  43. '3': '\u011b', // '\u0115',
  44. '4': '\u00e8'}
  45. iAccent = map[int]rune{
  46. '1': '\u012b',
  47. '2': '\u00ed',
  48. '3': '\u01d0', // '\u012d',
  49. '4': '\u00ec'}
  50. oAccent = map[int]rune{
  51. '1': '\u014d',
  52. '2': '\u00f3',
  53. '3': '\u01d2', // '\u014f',
  54. '4': '\u00f2'}
  55. uAccent = map[int]rune{
  56. '1': '\u016b',
  57. '2': '\u00fa',
  58. '3': '\u01d4', // '\u016d',
  59. '4': '\u00f9'}
  60. üAccent = map[int]rune{
  61. '1': 'ǖ',
  62. '2': 'ǘ',
  63. '3': 'ǚ',
  64. '4': 'ǜ'}
  65. )
  66. func addAccent(word string, tone int) string {
  67. /*
  68. * Based on "Where do the tone marks go?"
  69. * at http://www.pinyin.info/rules/where.html
  70. */
  71. n := strings.Index(word, "a")
  72. if n != -1 {
  73. aAcc := aAccent[tone]
  74. // replace 'a' with its tone version
  75. word = word[0:n] + string(aAcc) + word[(n+1):len(word)-1]
  76. } else {
  77. n := strings.Index(word, "e")
  78. if n != -1 {
  79. eAcc := eAccent[tone]
  80. word = word[0:n] + string(eAcc) +
  81. word[(n+1):len(word)-1]
  82. } else {
  83. n = strings.Index(word, "ou")
  84. if n != -1 {
  85. oAcc := oAccent[tone]
  86. word = word[0:n] + string(oAcc) + "u" +
  87. word[(n+2):len(word)-1]
  88. } else {
  89. chars := []rune(word)
  90. length := len(chars)
  91. // put tone onthe last vowel
  92. L:
  93. for n, _ := range chars {
  94. m := length - n - 1
  95. switch chars[m] {
  96. case 'i':
  97. chars[m] = iAccent[tone]
  98. break L
  99. case 'o':
  100. chars[m] = oAccent[tone]
  101. break L
  102. case 'u':
  103. chars[m] = uAccent[tone]
  104. break L
  105. case 'ü':
  106. chars[m] = üAccent[tone]
  107. break L
  108. default:
  109. }
  110. }
  111. word = string(chars[0 : len(chars)-1])
  112. }
  113. }
  114. }
  115. return word
  116. }

How this is used is illustrated by the function lookupWord. This is called in response to an HTML Form request to find the English words in a dictionary.

  1. func lookupWord(rw http.ResponseWriter, req *http.Request) {
  2. word := req.FormValue("word")
  3. words := d.LookupEnglish(word)
  4. pinyinMap := template.FormatterMap {"pinyin": pinyin.PinyinFormatter}
  5. t, err := template.ParseFile("html/DictionaryEntry.html", pinyinMap)
  6. if err != nil {
  7. http.Error(rw, err.String(), http.StatusInternalServerError)
  8. return
  9. }
  10. t.Execute(rw, words)
  11. }

The HTML code is

  1. <html>
  2. <body>
  3. <table border="1">
  4. <tr>
  5. <th>Word</th>
  6. <th>Traditional</th>
  7. <th>Simplified</th>
  8. <th>Pinyin</th>
  9. <th>English</th>
  10. </tr>
  11. {{with .Entries}}
  12. {{range .}}
  13. {.repeated section Entries}
  14. <tr>
  15. <td>{{.Word}}</td>
  16. <td>{{.Traditional}}</td>
  17. <td>{{.Simplified}}</td>
  18. <td>{{.Pinyin|pinyin}}</td>
  19. <td>
  20. <pre>
  21. {.repeated section Translations}
  22. {@|html}
  23. {.end}
  24. </pre>
  25. </td>
  26. </tr>
  27. {.end}
  28. {{end}}
  29. {{end}}
  30. </table>
  31. </body>
  32. </html>

The Dictionary type

The text file containing the dictionary has lines of the form

traditional simplified [pinyin] /translation/translation/…/

For example,

好 好 [hao3] /good/well/proper/good to/easy to/very/so/(suffix indicating completion or readiness)/

We store each line as an Entry within the Dictionary package:

  1. type Entry struct {
  2. Traditional string
  3. Simplified string
  4. Pinyin string
  5. Translations []string
  6. }

The dictionary itself is just an array of these entries:

  1. type Dictionary struct {
  2. Entries []*Entry
  3. }

Building the dictionary is easy enough. Just read each line and break the line into its various bits using simple string methods. Then add the line to the dictionary slice.

Looking up entries in this dictionary is straightforward: just search through until we find the appropriate key. There are about 100,000 entries in this dictionary: brute force by a linear search is fast enough. If it were necessary, faster storage and search mechanisms could easily be used.

The original dictionary grows by people on the Web adding in entries as they see fit. Consequently it isn’t that well organised and contains repetitions and multiple entries. So looking up any word - either by Pinyin or by English - may return multiple matches. To cater for this, each lookup returns a “mini dictionary”, just those lines in the full dictionary that match.

The Dictionary code is

  1. package dictionary
  2. import (
  3. "bufio"
  4. //"fmt"
  5. "os"
  6. "strings"
  7. )
  8. type Entry struct {
  9. Traditional string
  10. Simplified string
  11. Pinyin string
  12. Translations []string
  13. }
  14. func (de Entry) String() string {
  15. str := de.Traditional + ` ` + de.Simplified + ` ` + de.Pinyin
  16. for _, t := range de.Translations {
  17. str = str + "\n " + t
  18. }
  19. return str
  20. }
  21. type Dictionary struct {
  22. Entries []*Entry
  23. }
  24. func (d *Dictionary) String() string {
  25. str := ""
  26. for n := 0; n < len(d.Entries); n++ {
  27. de := d.Entries[n]
  28. str += de.String() + "\n"
  29. }
  30. return str
  31. }
  32. func (d *Dictionary) LookupPinyin(py string) *Dictionary {
  33. newD := new(Dictionary)
  34. v := make([]*Entry, 0, 100)
  35. for n := 0; n < len(d.Entries); n++ {
  36. de := d.Entries[n]
  37. if de.Pinyin == py {
  38. v = append(v, de)
  39. }
  40. }
  41. newD.Entries = v
  42. return newD
  43. }
  44. func (d *Dictionary) LookupEnglish(eng string) *Dictionary {
  45. newD := new(Dictionary)
  46. v := make([]*Entry, 0, 100)
  47. for n := 0; n < len(d.Entries); n++ {
  48. de := d.Entries[n]
  49. for _, e := range de.Translations {
  50. if e == eng {
  51. v = append(v, de)
  52. }
  53. }
  54. }
  55. newD.Entries = v
  56. return newD
  57. }
  58. func (d *Dictionary) LookupSimplified(simp string) *Dictionary {
  59. newD := new(Dictionary)
  60. v := make([]*Entry, 0, 100)
  61. for n := 0; n < len(d.Entries); n++ {
  62. de := d.Entries[n]
  63. if de.Simplified == simp {
  64. v = append(v, de)
  65. }
  66. }
  67. newD.Entries = v
  68. return newD
  69. }
  70. func (d *Dictionary) Load(path string) {
  71. f, err := os.Open(path)
  72. r := bufio.NewReader(f)
  73. if err != nil {
  74. println(err.Error())
  75. os.Exit(1)
  76. }
  77. v := make([]*Entry, 0, 100000)
  78. numEntries := 0
  79. for {
  80. line, err := r.ReadString('\n')
  81. if err != nil {
  82. break
  83. }
  84. if line[0] == '#' {
  85. continue
  86. }
  87. // fmt.Println(line)
  88. trad, simp, pinyin, translations := parseDictEntry(line)
  89. de := Entry{
  90. Traditional: trad,
  91. Simplified: simp,
  92. Pinyin: pinyin,
  93. Translations: translations}
  94. v = append(v, &de)
  95. numEntries++
  96. }
  97. // fmt.Printf("Num entries %d\n", numEntries)
  98. d.Entries = v
  99. }
  100. func parseDictEntry(line string) (string, string, string, []string) {
  101. // format is
  102. // trad simp [pinyin] /trans/trans/.../
  103. tradEnd := strings.Index(line, " ")
  104. trad := line[0:tradEnd]
  105. line = strings.TrimSpace(line[tradEnd:])
  106. simpEnd := strings.Index(line, " ")
  107. simp := line[0:simpEnd]
  108. line = strings.TrimSpace(line[simpEnd:])
  109. pinyinEnd := strings.Index(line, "]")
  110. pinyin := line[1:pinyinEnd]
  111. line = strings.TrimSpace(line[pinyinEnd+1:])
  112. translations := strings.Split(line, "/")
  113. // includes empty at start and end, so
  114. translations = translations[1 : len(translations)-1]
  115. return trad, simp, pinyin, translations
  116. }