UTF-16 and Go

UTF-16 deals with arrays of short 16-bit unsigned integers. The package utf16 is designed to manage such arrays. To convert a normal Go string, that is a UTF-8 string, into UTF-16, you first extract the code points by coercing it into a []rune and then use utf16.Encode to produce an array of type uint16.

Similarly, to decode an array of unsigned short UTF-16 values into a Go string, you use utf16.Decode to convert it into code points as type []rune and then to a string. The following code fragment illustrates this

  1. str := "百度一下,你就知道"
  2. runes := utf16.Encode([]rune(str))
  3. ints := utf16.Decode(runes)
  4. str = string(ints)

These type conversions need to be applied by clients or servers as appropriate, to read and write 16-bit short integers, as shown below.

Little-endian and big-endian

Unfortunately, there is a little devil lurking behind UTF-16. It is basically an encoding of characters into 16-bit short integers. The big question is: for each short, how is it written as two bytes? The top one first, or the top one second? Either way is fine, as long as the receiver uses the same convention as the sender.

Unicode has addressed this with a special character known as the BOM (byte order marker). This is a zero-width non-printing character, so you never see it in text. But its value 0xfffe is chosen so that you can tell the byte-order:

  • In a big-endian system it is FF FE
  • In a little-endian system it is FE FF

Text will sometimes place the BOM as the first character in the text. The reader can then examine these two bytes to determine what endian-ness has been used.

UTF-16 client and server

Using the BOM convention, we can write a server that prepends a BOM and writes a string in UTF-16 as

  1. /* UTF16 Server
  2. */
  3. package main
  4. import (
  5. "fmt"
  6. "net"
  7. "os"
  8. "unicode/utf16"
  9. )
  10. const BOM = '\ufffe'
  11. func main() {
  12. service := "0.0.0.0:1210"
  13. tcpAddr, err := net.ResolveTCPAddr("tcp", service)
  14. checkError(err)
  15. listener, err := net.ListenTCP("tcp", tcpAddr)
  16. checkError(err)
  17. for {
  18. conn, err := listener.Accept()
  19. if err != nil {
  20. continue
  21. }
  22. str := "j'ai arrêté"
  23. shorts := utf16.Encode([]rune(str))
  24. writeShorts(conn, shorts)
  25. conn.Close() // we're finished
  26. }
  27. }
  28. func writeShorts(conn net.Conn, shorts []uint16) {
  29. var bytes [2]byte
  30. // send the BOM as first two bytes
  31. bytes[0] = BOM >> 8
  32. bytes[1] = BOM & 255
  33. _, err := conn.Write(bytes[0:])
  34. if err != nil {
  35. return
  36. }
  37. for _, v := range shorts {
  38. bytes[0] = byte(v >> 8)
  39. bytes[1] = byte(v & 255)
  40. _, err = conn.Write(bytes[0:])
  41. if err != nil {
  42. return
  43. }
  44. }
  45. }
  46. func checkError(err error) {
  47. if err != nil {
  48. fmt.Println("Fatal error ", err.Error())
  49. os.Exit(1)
  50. }
  51. }

while a client that reads a byte stream, extracts and examines the BOM and then decodes the rest of the stream is

  1. /* UTF16 Client
  2. */
  3. package main
  4. import (
  5. "fmt"
  6. "net"
  7. "os"
  8. "unicode/utf16"
  9. )
  10. const BOM = '\ufffe'
  11. func main() {
  12. if len(os.Args) != 2 {
  13. fmt.Println("Usage: ", os.Args[0], "host:port")
  14. os.Exit(1)
  15. }
  16. service := os.Args[1]
  17. conn, err := net.Dial("tcp", service)
  18. checkError(err)
  19. shorts := readShorts(conn)
  20. ints := utf16.Decode(shorts)
  21. str := string(ints)
  22. fmt.Println(str)
  23. os.Exit(0)
  24. }
  25. func readShorts(conn net.Conn) []uint16 {
  26. var buf [512]byte
  27. // read everything into the buffer
  28. n, err := conn.Read(buf[0:2])
  29. for true {
  30. m, err := conn.Read(buf[n:])
  31. if m == 0 || err != nil {
  32. break
  33. }
  34. n += m
  35. }
  36. checkError(err)
  37. var shorts []uint16
  38. shorts = make([]uint16, n/2)
  39. if buf[0] == 0xff && buf[1] == 0xfe {
  40. // big endian
  41. for i := 2; i < n; i += 2 {
  42. shorts[i/2] = uint16(buf[i])<<8 + uint16(buf[i+1])
  43. }
  44. } else if buf[1] == 0xfe && buf[0] == 0xff {
  45. // little endian
  46. for i := 2; i < n; i += 2 {
  47. shorts[i/2] = uint16(buf[i+1])<<8 + uint16(buf[i])
  48. }
  49. } else {
  50. // unknown byte order
  51. fmt.Println("Unknown order")
  52. }
  53. return shorts
  54. }
  55. func checkError(err error) {
  56. if err != nil {
  57. fmt.Println("Fatal error ", err.Error())
  58. os.Exit(1)
  59. }
  60. }