Parsing XML

Go has an XML parser which is created using NewParser. This takes an io.Reader as parameter and returns a pointer to Parser. The main method of this type is Token which returns the next token in the input stream. The token is one of the types StartElement, EndElement, CharData, Comment, ProcInst or Directive.

The types are

StartElement

The type StartElement is a structure with two field types:

  1. type StartElement struct {
  2. Name Name
  3. Attr []Attr
  4. }
  5. type Name struct {
  6. Space, Local string
  7. }
  8. type Attr struct {
  9. Name Name
  10. Value string
  11. }

EndElement

This is also a structure

  1. type EndElement struct {
  2. Name Name
  3. }

CharData

This type represents the text content enclosed by a tag and is a simple type

  1. type CharData []byte

Comment

Similarly for this type

type Comment []byte

ProcInst

A ProcInst represents an XML processing instruction of the form <?target inst?>

  1. type ProcInst struct {
  2. Target string
  3. Inst []byte
  4. }

Directive

A Directive represents an XML directive of the form <!text>. The bytes do not include the <! and > markers.

  1. type Directive []byte

A program to print out the tree structure of an XML document is

  1. /* Parse XML
  2. */
  3. package main
  4. import (
  5. "encoding/xml"
  6. "fmt"
  7. "io/ioutil"
  8. "os"
  9. "strings"
  10. )
  11. func main() {
  12. if len(os.Args) != 2 {
  13. fmt.Println("Usage: ", os.Args[0], "file")
  14. os.Exit(1)
  15. }
  16. file := os.Args[1]
  17. bytes, err := ioutil.ReadFile(file)
  18. checkError(err)
  19. r := strings.NewReader(string(bytes))
  20. parser := xml.NewDecoder(r)
  21. depth := 0
  22. for {
  23. token, err := parser.Token()
  24. if err != nil {
  25. break
  26. }
  27. switch t := token.(type) {
  28. case xml.StartElement:
  29. elmt := xml.StartElement(t)
  30. name := elmt.Name.Local
  31. printElmt(name, depth)
  32. depth++
  33. case xml.EndElement:
  34. depth--
  35. elmt := xml.EndElement(t)
  36. name := elmt.Name.Local
  37. printElmt(name, depth)
  38. case xml.CharData:
  39. bytes := xml.CharData(t)
  40. printElmt("\""+string([]byte(bytes))+"\"", depth)
  41. case xml.Comment:
  42. printElmt("Comment", depth)
  43. case xml.ProcInst:
  44. printElmt("ProcInst", depth)
  45. case xml.Directive:
  46. printElmt("Directive", depth)
  47. default:
  48. fmt.Println("Unknown")
  49. }
  50. }
  51. }
  52. func printElmt(s string, depth int) {
  53. for n := 0; n < depth; n++ {
  54. fmt.Print(" ")
  55. }
  56. fmt.Println(s)
  57. }
  58. func checkError(err error) {
  59. if err != nil {
  60. fmt.Println("Fatal error ", err.Error())
  61. os.Exit(1)
  62. }
  63. }

Note that the parser includes all CharData, including the whitespace between tags.

If we run this program against the person data structure given earlier, it produces

  1. person
  2. "
  3. "
  4. name
  5. "
  6. "
  7. family
  8. " Newmarch "
  9. family
  10. "
  11. "
  12. personal
  13. " Jan "
  14. personal
  15. "
  16. "
  17. name
  18. "
  19. "
  20. email
  21. "
  22. jan@newmarch.name
  23. "
  24. email
  25. "
  26. "
  27. email
  28. "
  29. j.newmarch@boxhill.edu.au
  30. "
  31. email
  32. "
  33. "
  34. person
  35. "
  36. "

Note that as no DTD or other XML specification has been used, the tokenizer correctly prints out all the white space (a DTD may specify that the whitespace can be ignored, but without it that assumption cannot be made.)

There is a potential trap in using this parser. It re-uses space for strings, so that once you see a token you need to copy its value if you want to refer to it later. Go has methods such as
func (c CharData) Copy() CharData to make a copy of data.