Example program: list links

This example program demonstrates how to fetch a page from a URL; extract links, images, and other pointers; and examine their URLs and text.

Specify the URL to fetch as the program's sole argument.

  1. package org.jsoup.examples;
  2. import org.jsoup.Jsoup;
  3. import org.jsoup.helper.Validate;
  4. import org.jsoup.nodes.Document;
  5. import org.jsoup.nodes.Element;
  6. import org.jsoup.select.Elements;
  7. import java.io.IOException;
  8. /**
  9. * Example program to list links from a URL.
  10. */
  11. public class ListLinks {
  12. public static void main(String[] args) throws IOException {
  13. Validate.isTrue(args.length == 1, "usage: supply url to fetch");
  14. String url = args[0];
  15. print("Fetching %s...", url);
  16. Document doc = Jsoup.connect(url).get();
  17. Elements links = doc.select("a[href]");
  18. Elements media = doc.select("[src]");
  19. Elements imports = doc.select("link[href]");
  20. print("\nMedia: (%d)", media.size());
  21. for (Element src : media) {
  22. if (src.normalName().equals("img"))
  23. print(" * %s: <%s> %sx%s (%s)",
  24. src.tagName(), src.attr("abs:src"), src.attr("width"), src.attr("height"),
  25. trim(src.attr("alt"), 20));
  26. else
  27. print(" * %s: <%s>", src.tagName(), src.attr("abs:src"));
  28. }
  29. print("\nImports: (%d)", imports.size());
  30. for (Element link : imports) {
  31. print(" * %s <%s> (%s)", link.tagName(),link.attr("abs:href"), link.attr("rel"));
  32. }
  33. print("\nLinks: (%d)", links.size());
  34. for (Element link : links) {
  35. print(" * a: <%s> (%s)", link.attr("abs:href"), trim(link.text(), 35));
  36. }
  37. }
  38. private static void print(String msg, Object... args) {
  39. System.out.println(String.format(msg, args));
  40. }
  41. private static String trim(String s, int width) {
  42. if (s.length() > width)
  43. return s.substring(0, width-1) + ".";
  44. else
  45. return s;
  46. }
  47. }
  48. org/jsoup/examples/ListLinks.java

Example output (trimmed)

  1. Fetching http://news.ycombinator.com/...
  2. Media: (38)
  3. * img: <http://ycombinator.com/images/y18.gif> 18x18 ()
  4. * img: <http://ycombinator.com/images/s.gif> 10x1 ()
  5. * img: <http://ycombinator.com/images/grayarrow.gif> x ()
  6. * img: <http://ycombinator.com/images/s.gif> 0x10 ()
  7. * script: <http://www.co2stats.com/propres.php?s=1138>
  8. * img: <http://ycombinator.com/images/s.gif> 15x1 ()
  9. * img: <http://ycombinator.com/images/hnsearch.png> x ()
  10. * img: <http://ycombinator.com/images/s.gif> 25x1 ()
  11. * img: <http://mixpanel.com/site_media/images/mixpanel_partner_logo_borderless.gif> x (Analytics by Mixpan.)
  12. Imports: (2)
  13. * link <http://ycombinator.com/news.css> (stylesheet)
  14. * link <http://ycombinator.com/favicon.ico> (shortcut icon)
  15. Links: (141)
  16. * a: <http://ycombinator.com> ()
  17. * a: <http://news.ycombinator.com/news> (Hacker News)
  18. * a: <http://news.ycombinator.com/newest> (new)
  19. * a: <http://news.ycombinator.com/newcomments> (comments)
  20. * a: <http://news.ycombinator.com/leaders> (leaders)
  21. * a: <http://news.ycombinator.com/jobs> (jobs)
  22. * a: <http://news.ycombinator.com/submit> (submit)
  23. * a: <http://news.ycombinator.com/x?fnid=JKhQjfU7gW> (login)
  24. * a: <http://news.ycombinator.com/vote?for=1094578&dir=up&whence=%6e%65%77%73> ()
  25. * a: <http://www.readwriteweb.com/archives/facebook_gets_faster_debuts_homegrown_php_compiler.php?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+readwriteweb+%28ReadWriteWeb%29&utm_content=Twitter> (Facebook speeds up PHP)
  26. * a: <http://news.ycombinator.com/user?id=mcxx> (mcxx)
  27. * a: <http://news.ycombinator.com/item?id=1094578> (9 comments)
  28. * a: <http://news.ycombinator.com/vote?for=1094649&dir=up&whence=%6e%65%77%73> ()
  29. * a: <http://groups.google.com/group/django-developers/msg/a65fbbc8effcd914> ("Tough. Django produces XHTML.")
  30. * a: <http://news.ycombinator.com/user?id=andybak> (andybak)
  31. * a: <http://news.ycombinator.com/item?id=1094649> (3 comments)
  32. * a: <http://news.ycombinator.com/vote?for=1093927&dir=up&whence=%6e%65%77%73> ()
  33. * a: <http://news.ycombinator.com/x?fnid=p2sdPLE7Ce> (More)
  34. * a: <http://news.ycombinator.com/lists> (Lists)
  35. * a: <http://news.ycombinator.com/rss> (RSS)
  36. * a: <http://ycombinator.com/bookmarklet.html> (Bookmarklet)
  37. * a: <http://ycombinator.com/newsguidelines.html> (Guidelines)
  38. * a: <http://ycombinator.com/newsfaq.html> (FAQ)
  39. * a: <http://ycombinator.com/newsnews.html> (News News)
  40. * a: <http://news.ycombinator.com/item?id=363> (Feature Requests)
  41. * a: <http://ycombinator.com> (Y Combinator)
  42. * a: <http://ycombinator.com/w2010.html> (Apply)
  43. * a: <http://ycombinator.com/lib.html> (Library)
  44. * a: <http://www.webmynd.com/html/hackernews.html> ()
  45. * a: <http://mixpanel.com/?from=yc> ()