Chapter 3 Obtaining Data - 3.4 Converting Microsoft Excel Spreadsheets - 《Data Science at the Command Line》

3.4 Converting Microsoft Excel Spreadsheets

3.4 Converting Microsoft Excel Spreadsheets

For many people, Microsoft Excel offers an intuitive way to work with small data sets and perform calculations on them. As a result, a lot of data is embedded into Microsoft Excel spreadsheets. These spreadsheets are, depending on the extension of the filename, stored in either a proprietary binary format (.xls) or as a collection of compressed XML files (.xlsx). In both cases, the data is not readily usable by most command-line tools. It would be a shame if we could not use those valuable data sets just because they are stored this way.

Luckily, there is a command-line tool called in2csv (Groskopf 2014 b), which is able to convert Microsoft Excel spreadsheets to CSV files. CSV stands for comma-separated values. Working with CSV can be tricky because it lacks a formal specification. RFC 4180 defines the CSV format according to the following three points:

Each record is located on a separate line, delimited by a line break (CRLF). For example:

aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF

The last record in the file may or may not have an ending line break. For example:

aaa,bbb,ccc CRLF
zzz,yyy,xxx

There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file (the presence or absence of the header line should be indicated via the optional header parameter of this MIME type). For example:

field_name,field_name,field_name CRLF
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF

Let’s demonstrate in2csv using a spreadsheet that contains the top 250 movies from the Internet Movie Database (IMDb). The file is named imdb-250.xlsx and can be obtained from http://www.overthinkingit.com/2011/10/11/imdb-top-250-movies-4th-edition/2. To extract its data, we invoke in2csv as follows:

$ cd book/ch03
$ in2csv data/imdb-250.xlsx > data/imdb-250.csv

The format of the file is automatically determined by the extension, .xlsx in this case. If we were to pipe the data into in2csv, we would have to specify the format explicitly. Let’s look at the data:

$ in2csv imdb-250.xlsx | head | cut -c1-80
Title,title trim,Year,Rank,Rank (desc),Rating,New in 2011 from 2010?,2010 rank,R
Sherlock Jr. (1924),SherlockJr.(1924),1924,221,30,8,y,n/a,n/a,
The Passion of Joan of Arc (1928),ThePassionofJoanofArc(1928),1928,212,39,8,y,n/
His Girl Friday (1940),HisGirlFriday(1940),1940,250,1,8,y,n/a,n/a,
Tokyo Story (1953),TokyoStory(1953),1953,248,3,8,y,n/a,n/a,
The Man Who Shot Liberty Valance (1962),TheManWhoShotLibertyValance(1962),1962,2
Persona (1966),Persona(1966),1966,200,51,8,y,n/a,n/a,
Stalker (1979),Stalker(1979),1979,243,8,8,y,n/a,n/a,
Fanny and Alexander (1982),FannyandAlexander(1982),1982,210,41,8,y,n/a,n/a,
Beauty and the Beast (1991),BeautyandtheBeast(1991),1991,249,2,8,y,n/a,n/a,

As you can see, CSV by default is not too readable. You can pipe the data to a tool called csvlook (Groskopf 2014 d), which will nicely format the data into a table. Here, we’ll display a subset of the columns using csvcut such that the table fits on the page:

$ in2csv data/imdb-250.xlsx | head | csvcut -c Title,Year,Rating | csvlook
|------------------------------------------+------+---------|
|  Title                                   | Year | Rating  |
|------------------------------------------+------+---------|
|  Sherlock Jr. (1924)                     | 1924 | 8       |
|  The Passion of Joan of Arc (1928)       | 1928 | 8       |
|  His Girl Friday (1940)                  | 1940 | 8       |
|  Tokyo Story (1953)                      | 1953 | 8       |
|  The Man Who Shot Liberty Valance (1962) | 1962 | 8       |
|  Persona (1966)                          | 1966 | 8       |
|  Stalker (1979)                          | 1979 | 8       |
|  Fanny and Alexander (1982)              | 1982 | 8       |
|  Beauty and the Beast (1991)             | 1991 | 8       |
|------------------------------------------+------+---------|

A spreadsheet can contain multiple worksheets. By default, in2csv extracts the first worksheet. To extract a different worksheet, you need to pass the name of worksheet to the —sheet option.

The tools in2csv, csvcut, and csvlook are actually part of Csvkit, which is collection of command-line tools to work with CSV data. Csvkit will be used quite often in this book because it has so many valuable tools. If you’re running the Data Science Toolbox, you already have Csvkit installed. Otherwise, see the appendix for instructions on how to install it.

An alternative approach to in2csv is to open the spreadsheet in Microsoft Excel or an open source variant such as LibreOffice Calc, and manually export it to CSV. While this works as a one-off solution, the disadvantage is that it does not scale well to multiple files and is not automatable. Furthermore, when you are working on the command line of a remote server, chances are that you don’t have such an application available.