Parsing HTML XML JSON in Ruby


Generally speaking the best and easiest way for parsing HTML and XML is using Nokogiri library

  • To install Nokogiri
    1. gem install nokogiri


Here we’ll use nokogiri to list our contents list from

Using CSS selectors

  1. require 'nokogiri'
  2. require 'open-uri'
  3. page = Nokogiri::HTML(open(""))
  4. page.css(".book .book-summary ul.summary li a, .book .book-summary ul.summary li span").each { |css| puts css.text.strip.squeeze.gsub("\n", '')}


There are 2 ways we’d like to show here, the standard library rexml and nokogiri external library

We’ve the following XML file

  1. <?xml version="1.0"?>
  2. <collection shelf="New Arrivals">
  3. <movie title="Enemy Behind">
  4. <type>War, Thriller</type>
  5. <format>DVD</format>
  6. <year>2003</year>
  7. <rating>PG</rating>
  8. <stars>10</stars>
  9. <description>Talk about a US-Japan war</description>
  10. </movie>
  11. <movie title="Transformers">
  12. <type>Anime, Science Fiction</type>
  13. <format>DVD</format>
  14. <year>1989</year>
  15. <rating>R</rating>
  16. <stars>8</stars>
  17. <description>A scientific fiction</description>
  18. </movie>
  19. <movie title="Trigun">
  20. <type>Anime, Action</type>
  21. <format>DVD</format>
  22. <episodes>4</episodes>
  23. <rating>PG</rating>
  24. <stars>10</stars>
  25. <description>Vash the Stampede!</description>
  26. </movie>
  27. <movie title="Ishtar">
  28. <type>Comedy</type>
  29. <format>VHS</format>
  30. <rating>PG</rating>
  31. <stars>2</stars>
  32. <description>Viewable boredom</description>
  33. </movie>
  34. </collection>


  1. require 'rexml/document'
  2. include REXML
  3. file = "file.xml"
  4. xmldoc =
  5. # Get the root element
  6. root = xmldoc.root
  7. puts "Root element : " + root.attributes["shelf"]
  8. # List of movie titles.
  9. xmldoc.elements.each("collection/movie") do |e|
  10. puts "Movie Title : " + e.attributes["title"]
  11. end
  12. # List of movie types.
  13. xmldoc.elements.each("collection/movie/type") do |e|
  14. puts "Movie Type : " + e.text
  15. end
  16. # List of movie description.
  17. xmldoc.elements.each("collection/movie/description") do |e|
  18. puts "Movie Description : " + e.text
  19. end
  20. # List of movie stars
  21. xmldoc.elements.each("collection/movie/stars") do |e|
  22. puts "Movie Stars : " + e.text
  23. end


  1. require 'nokogiri'


  1. require 'nokogiri'
  2. # Parse XML file
  3. doc = Nokogiri::Slop file
  4. puts"type").map {|f| t.text} # List of Types
  5. puts"format").map {|f| f.text} # List of Formats
  6. puts"year").map {|y| y.text} # List of Year
  7. puts"rating").map {|r| r.text} # List of Rating
  8. puts"stars").map {|s| s.text} # List of Stars
  9."description").map {|d| d.text} # List of Descriptions


Assume you have a small vulnerability database in a json file like follows

  1. {
  2. "Vulnerability":
  3. [
  4. {
  5. "name": "SQLi",
  6. "details:":
  7. {
  8. "full_name": "SQL injection",
  9. "description": "An injection attack wherein an attacker can execute malicious SQL statements",
  10. "references": [
  11. "",
  12. ""
  13. ],
  14. "type": "web"
  15. }
  16. }
  17. ]
  18. }

To parse it

  1. require 'json'
  2. vuln_json = JSON.parse('vulnerabilities.json'))

Returns a hash

  1. {"Vulnerability"=>`
  2. [{"name"=>"SQLi",
  3. "details:"=>
  4. {"full_name"=>"SQL injection",
  5. "description"=>"An injection attack wherein an attacker can execute malicious SQL statements",
  6. "references"=>["", ""],
  7. "type"=>"web"}}]}

Now you can retrieve and data as you do with hash

  1. vuln_json["Vulnerability"].each {|vuln| puts vuln['name']}

If you want to add to this database, just create a hash with the same struction.

  1. xss = {"name"=>"XSS", "details:"=>{"full_name"=>"Corss Site Scripting", "description"=>" is a type of computer security vulnerability typically found in web applications", "references"=>["", ""], "type"=>"web"}}

You can convert it to json just by using `.to_json` method

  1. xss.to_json