Description: Parsing HTML XML JSOM in Ruby

Parsing HTML, XML, JSON

Generally speaking the best and easiest way for parsing HTML and XML is using Nokogiri library

  • To install Nokogiri
    1. gem install nokogiri

HTML

Here we’ll use nokogiri to list our contents list from http://rubyfu.net/content/

Using CSS selectors

  1. require 'nokogiri'
  2. require 'open-uri'
  3. page = Nokogiri::HTML(open("http://rubyfu.net/content/"))
  4. page.css(".book .book-summary ul.summary li a, .book .book-summary ul.summary li span").each { |css| puts css.text.strip.squeeze.gsub("\n", '')}

Returns

  1. RubyFu
  2. Module 0x0 | Introduction
  3. 0.1. Contribution
  4. 0.2. Beginners
  5. 0.3. Required Gems
  6. 1. Module 0x1 | Basic Ruby Kung Fu
  7. 1.1. String
  8. 1.1.1. Conversion
  9. 1.1.2. Extraction
  10. 1.2. Array
  11. 2. Module 0x2 | System Kung Fu
  12. 2.1. Command Execution
  13. 2.2. File manipulation
  14. 2.2.1. Parsing HTML, XML, JSON
  15. 2.3. Cryptography
  16. 2.4. Remote Shell
  17. 2.4.1. Ncat.rb
  18. 2.5. VirusTotal
  19. 3. Module 0x3 | Network Kung Fu
  20. 3.1. Ruby Socket
  21. 3.2. FTP
  22. 3.3. SSH
  23. 3.4. Email
  24. 3.4.1. SMTP Enumeration
  25. 3.5. Network Scanning
  26. .
  27. .
  28. ..snippet..

XML

There are 2 ways we’d like to show here, the standard library rexml and nokogiri external library

We’ve the following XML file

  1. <?xml version="1.0"?>
  2. <collection shelf="New Arrivals">
  3. <movie title="Enemy Behind">
  4. <type>War, Thriller</type>
  5. <format>DVD</format>
  6. <year>2003</year>
  7. <rating>PG</rating>
  8. <stars>10</stars>
  9. <description>Talk about a US-Japan war</description>
  10. </movie>
  11. <movie title="Transformers">
  12. <type>Anime, Science Fiction</type>
  13. <format>DVD</format>
  14. <year>1989</year>
  15. <rating>R</rating>
  16. <stars>8</stars>
  17. <description>A scientific fiction</description>
  18. </movie>
  19. <movie title="Trigun">
  20. <type>Anime, Action</type>
  21. <format>DVD</format>
  22. <episodes>4</episodes>
  23. <rating>PG</rating>
  24. <stars>10</stars>
  25. <description>Vash the Stampede!</description>
  26. </movie>
  27. <movie title="Ishtar">
  28. <type>Comedy</type>
  29. <format>VHS</format>
  30. <rating>PG</rating>
  31. <stars>2</stars>
  32. <description>Viewable boredom</description>
  33. </movie>
  34. </collection>

REXML

  1. require 'rexml/document'
  2. include REXML
  3. file = File.read "file.xml"
  4. xmldoc = Document.new(xmlfile)
  5. # Get the root element
  6. root = xmldoc.root
  7. puts "Root element : " + root.attributes["shelf"]
  8. # List of movie titles.
  9. xmldoc.elements.each("collection/movie") do |e|
  10. puts "Movie Title : " + e.attributes["title"]
  11. end
  12. # List of movie types.
  13. xmldoc.elements.each("collection/movie/type") do |e|
  14. puts "Movie Type : " + e.text
  15. end
  16. # List of movie description.
  17. xmldoc.elements.each("collection/movie/description") do |e|
  18. puts "Movie Description : " + e.text
  19. end
  20. # List of movie stars
  21. xmldoc.elements.each("collection/movie/stars") do |e|
  22. puts "Movie Stars : " + e.text
  23. end

Nokogiri

  1. require 'nokogiri'

Slop

  1. require 'nokogiri'
  2. # Parse XML file
  3. doc = Nokogiri::Slop file
  4. puts doc.search("type").map {|f| t.text} # List of Types
  5. puts doc.search("format").map {|f| f.text} # List of Formats
  6. puts doc.search("year").map {|y| y.text} # List of Year
  7. puts doc.search("rating").map {|r| r.text} # List of Rating
  8. puts doc.search("stars").map {|s| s.text} # List of Stars
  9. doc.search("description").map {|d| d.text} # List of Descriptions

JSON

Assume you have a small vulnerability database in a json file like follows

  1. {
  2. "Vulnerability":
  3. [
  4. {
  5. "name": "SQLi",
  6. "details:":
  7. {
  8. "full_name": "SQL injection",
  9. "description": "An injection attack wherein an attacker can execute malicious SQL statements",
  10. "references": [
  11. "https://www.owasp.org/index.php/SQL_Injection",
  12. "https://cwe.mitre.org/data/definitions/89.html"
  13. ],
  14. "type": "web"
  15. }
  16. }
  17. ]
  18. }

To parse it

  1. require 'json'
  2. vuln_json = JSON.parse(File.read('vulnerabilities.json'))

Returns a hash

  1. {"Vulnerability"=>`
  2. [{"name"=>"SQLi",
  3. "details:"=>
  4. {"full_name"=>"SQL injection",
  5. "description"=>"An injection attack wherein an attacker can execute malicious SQL statements",
  6. "references"=>["https://www.owasp.org/index.php/SQL_Injection", "https://cwe.mitre.org/data/definitions/89.html"],
  7. "type"=>"web"}}]}

Now you can retrieve and data as you do with hash

  1. vuln_json["Vulnerability"].each {|vuln| puts vuln['name']}

If you want to add to this database, just create a hash with the same struction.

  1. xss = {"name"=>"XSS", "details:"=>{"full_name"=>"Corss Site Scripting", "description"=>" is a type of computer security vulnerability typically found in web applications", "references"=>["https://www.owasp.org/index.php/Cross-site_Scripting_(XSS)", "https://cwe.mitre.org/data/definitions/79.html"], "type"=>"web"}}

You can convert it to json just by using `.to_json` method

  1. xss.to_json