1.2 Data Science is OSEMN

The field of data science is still in its infancy, and as such, there exist various definitions of what it encompasses. Throughout this book we employ a very practical definition by Mason and Wiggins (2010). They define data science according to the following five steps: (1) obtaining data, (2) scrubbing data, (3) exploring data, (4) modeling data, and (5) interpreting data. Together, these steps form the OSEMN model (which is pronounced as awesome). This definition serves as the backbone of this book because each step, (except step 5, interpreting data, which we explain below) has its own chapter. Below we explain what each step entails.

Although the five steps are discussed in a linear and incremental fashion, in practice it is very common to move back and forth between them or to perform multiple steps at the same time. Doing data science is an iterative and non-linear process. For example, once you have modeled your data, and you look at the results, you may decide to go back to the scrubbing step to the adjust the features of the data set.

1.2.1 Obtaining Data

Without any data, there is little data science you can do. So the first step is the obtain data. Unless you are fortunate enough to already possess data, you may need to do one or more of the following:

  • Download data from another location (e.g., a webpage or server)
  • Query data from a database or API (e.g., MySQL or Twitter)
  • Extract data from another file (e.g., an HTML file or spreadsheet)
  • Generate data yourself (e.g., reading sensors or taking surveys)In Chapter 3, we discuss several methods for obtaining data using the command line. The obtained data will most likely be in either text, CSV, JSON of HTML/XML format. The next step is to scrub this data.

1.2.2 Scrubbing Data

It is not uncommon that the obtained data has missing values, inconsistencies, errors, weird characters, or uninteresting columns. In that case, you have to scrub, or clean, the data before you can do anything interesting with it. Common scrubbing operations include:

  • Filtering lines
  • Extracting certain columns
  • Replacing values
  • Extracting words
  • Handling missing values
  • Converting data from one format to anotherWhile we data scientists love to create exciting data visualizations and insightful models (steps 3 and 4), usually much effort goes into obtaining and scrubbing the required data first (steps 1 and 2). In Data Jujitsu, Patil (2012) states that “80% of the work in any data project is in cleaning the data”. In Chapter 5, we demonstrate how the command line can help accomplish such data scrubbing operations.

1.2.3 Exploring Data

Once you have scrubbed your data, you are ready to explore it. This is where it gets interesting because here you will get really into your data. In Chapter 7 we show you how the command line can be used to:

  • Look at your data
  • Derive statistics from your data
  • Create interesting visualizationsCommand-line tools introduced in Chapter 7 include: csvstat (Groskopf 2014a), feedgnuplot (Kogan 2014), and Rio (Janssens 2014e).

1.2.4 Modeling Data

If you want to explain the data or predict what will happen, you probably want to create a statistical model of your data. Techniques to create a model include clustering, classification, regression, and dimensionality reduction. The command line is not suitable for implementing a new model from scratch. It is, however, very useful to be able to build a model from the command line. In Chapter 9 we will introduce several command-line tools that either build a model locally or employ an API to perform the computation in the cloud.

1.2.5 Interpreting Data

The final and perhaps most important step in the OSEMN model is interpreting data. This step involves:

  • Drawing conclusions from your data
  • Evaluating what your results mean
  • Communicating your resultTo be honest, the computer is of little use here, and the command line does not really come into play at this stage. Once you have reached this step, it is up to you. This is the only step in the OSEMN model which does not have its own chapter. Instead, we kindly refer you to Thinking with Data by Shron (2014).