Chapter 9 Modeling Data

In this chapter we’re going to perform the fourth and last step of the OSEMN model that we can do on a computer: modeling data. Generally speaking, to model data is to create an abstract or higher-level description of your data. Just like with creating visualizations, it’s like taking a step back from the individual data points.

However, visualizations, on the one hand, are characterized by shapes, positions, and colors such that we can interpret them by looking at them. Models, on the other hand, are internally characterized by a bunch of numbers, which means that computers can use them, for example, to make predictions about a new data points. (We can still visualize models so that we can try to understand them and see how they are performing.)

In this chapter we’ll consider four common types of algorithms to model data:

  • Dimensionality reduction.
  • Clustering.
  • Regression.
  • Classification.

These four algorithms come from the field of machine learning. As such, we’re going to change our vocabulary a bit. Let’s assume that we have a CSV file, also known as a data set. Each row, except for the header, is considered to be a data point. For simplicity we assume that each column that contains numerical values is an input feature. If a data point also contains a non-numerical field, such as the species column in the Iris data set, then that is known as the data point’s label.

The first two types of algorithms (dimensionality reduction and clustering) are most often unsupervised, which means that they create a model based on the features of the data set only. The last two types of algorithms (regression and classification) are by definition supervised algorithms, which means that they also incorporate the labels into the model.

This is by no means an introduction to machine learning. That implies that we must skim over many details. We strongly advise that you become familiar with an algorithm before applying it blindly to your data.