Chapter 3 Obtaining Data - 3.6 Downloading from the Internet - 《Data Science at the Command Line》

3.6 Downloading from the Internet

3.6 Downloading from the Internet

The Internet provides without a doubt the largest resource for data. This data is available in various forms, using various protocols. The command-line tool cURL (Stenberg 2012) can be considered the command line’s Swiss Army knife when it comes to downloading data from the Internet.

When you access a URL, which stands for uniform resource locator, through your browser, the data that is being downloaded can be interpreted. For example, an HTML file is rendered as a website, an MP3 file may be automatically played, and a PDF file may be automatically downloaded or opened by a viewer. However, when cURL is used to access a URL, the data is downloaded as is printed to standard output. Other command-line tools may then be used to process this data further.

The easiest invocation of cURL is to simply specify a URL as a command-line argument. For example, to download the book Adventures of Huckleberry Finn by Mark Twain from Project Gutenberg, we can run the following command:

$ curl -s http://www.gutenberg.org/files/76/76-0.txt | head -n 10
The Project Gutenberg EBook of Adventures of Huckleberry Finn, Complete
by Mark Twain (Samuel Clemens)
This eBook is for the use of anyone anywhere at no cost and with almost
no restrictions whatsoever. You may copy it, give it away or re-use
it under the terms of the Project Gutenberg License included with this
eBook or online at www.gutenberg.net

By default, cURL outputs a progress meter that shows how the download rate and the expected time of completion. If you are piping the output directly to another command-line tool, such as head, be sure to specify the -s command-line argument, which stands for silent, so that the progress meter is disabled. Compare, for example, the output with the following command:

$ curl http://www.gutenberg.org/files/76/76-0.txt | head -n 10
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--
The Project Gutenberg EBook of Adventures of Huckleberry Finn, Complete
by Mark Twain (Samuel Clemens)
This eBook is for the use of anyone anywhere at no cost and with almost
no restrictions whatsoever. You may copy it, give it away or re-use
it under the terms of the Project Gutenberg License included with this
eBook or online at www.gutenberg.net

Note that the output of the second command, where we do not disable the progress meter, contains the unwanted text and even an error message. If you save the data to a file, then you do not need to necessarily specify the -s option:

$ curl http://www.gutenberg.org/files/76/76-0.txt > data/finn.txt

You can also save the data by explicitly specifying the output file with the -o option:

$ curl -s http://www.gutenberg.org/files/76/76-0.txt -o data/finn.txt

When downloading data from the Internet, the URL will most likely use the protocols HTTP or HTTPS. To download from an FTP server, which stands for File Transfer Protocol, you use cURL in exactly the same way. When the URL is password protected, you can specify a username and a password as follows:

$ curl -u username:password ftp://host/file

If the specified URL is a directory, curl will list the contents of that directory.

When you access a shortened URL, such as the ones that start with http://bit.ly/* or http://t.co/*, your browser automatically redirects you to the correct location. With curl, however, you need to specify the -L or —location option in order to be redirected:

$ curl -L j.mp/locatbbar

If you do not specify the -L or —location option, you may get something like:

$ curl j.mp/locatbbar
<html>
<head>
<title>bit.ly</title>
</head>
<body>
<a href="http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_bo
rder/area_ratio">moved here</a>
</body>

By specifying the -I or —head option, curl fetches only the HTTP header of the response:

$ curl -I j.mp/locatbbar
HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Wed, 21 May 2014 18:50:28 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Cache-Control: private; max-age=90
Content-Length: 175
Location: http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_bo
Mime-Version: 1.0
Set-Cookie: _bit=537cf574-002ba-07d79-2e1cf10a;domain=.j.mp;expires=Mon Nov 17

The first line indicates the HTTP status code, which is 301 (moved permanently) in this case. You can also see the location this URL redirects to: http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_border/area_ratio. Inspecting the header and getting the status code is a useful debugging tool in case curl does not give you the expected result. Other common HTTP status codes include 404 (not found) and 403 (forbidden). This page lists all HTTP status codes: http://en.wikipedia.org/wiki/List_of_HTTP_status_codes.

To conclude this section, cURL is a straight-forward command-line tool for downloading data from the Internet. Its three most common command-line arguments are -s to suppress the progress meter, -u to specify a username and password, and -L to automatically follow redirects. See its man page for more information.