Chapter 3 Obtaining Data - 3.5 Querying Relational Databases - 《Data Science at the Command Line》

3.5 Querying Relational Databases

3.5 Querying Relational Databases

Most companies store their data in a relational database. Examples of relational databases are MySQL, PostgreSQL, and SQLite. These databases all have a slightly different way of interfacing with them. Some provide a command-line tool or a command-line interface, while others do not. Moreover, they are not very consistent when it comes to their usage and output.

Fortunately, there is a command-line tool called sql2csv, which is part of the Csvkit suite. Because it leverages the Python SQLAlchemy package, we only have to use one tool to execute queries on many different databases through a common interface, including MySQL, Oracle, PostgreSQL, SQLite, Microsoft SQL Server, and Sybase. The output of sql2csv is, as its name suggests, in CSV format.

We can obtain data from relational databases by executing a SELECT query on them. (sql2csv also support INSERT, UPDATE, and DELETE queries, but that’s not the purpose of this chapter.) To select a specific set of data from an SQLite database named iris.db, sql2csv can be invoked as follows:

$ sql2csv --db 'sqlite:///data/iris.db' --query 'SELECT * FROM iris '\
> 'WHERE sepal_length > 7.5'
sepal_length,sepal_width,petal_length,petal_width,species
7.6,3.0,6.6,2.1,Iris-virginica
7.7,3.8,6.7,2.2,Iris-virginica
7.7,2.6,6.9,2.3,Iris-virginica
7.7,2.8,6.7,2.0,Iris-virginica
7.9,3.8,6.4,2.0,Iris-virginica
7.7,3.0,6.1,2.3,Iris-virginica

Here, we are selecting all rows where sepal_length is larger than 7.5. The —db option specifies the database URL, of which the typical form is: dialect+driver://username:password@host:port/database.