Use DOM methods to navigate a document

Problem

You have a HTML document that you want to extract data from. You know generally the structure of the HTML document.

Solution

Use the DOM-like methods available after parsing HTML into a Document.

  1. File input = new File("/tmp/input.html");
  2. Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
  3. Element content = doc.getElementById("content");
  4. Elements links = content.getElementsByTag("a");
  5. for (Element link : links) {
  6. String linkHref = link.attr("href");
  7. String linkText = link.text();
  8. }

Description

Elements provide a range of DOM-like methods to find elements, and extract and manipulate their data. The DOM getters are contextual: called on a parent Document they find matching elements under the document; called on a child element they find elements under that child. In this way you can winnow in on the data you want.

Finding elements

Element data

Manipulating HTML and text