Chapter 9 Modeling Data - 9.3 Dimensionality Reduction with Tapkee - 《Data Science at the Command Line》

9.3 Dimensionality Reduction with Tapkee

9.3 Dimensionality Reduction with Tapkee

The goal of dimensionality reduction is to map high-dimensional data points onto a lower dimensional mapping. The challenge is to keep similar data points close together on the lower-dimensional mapping. As we’ve seen in the previous section, our wine data set contained 13 features. We’ll stick with two dimensions because that’s straight forward to visualize.

Dimensionality reduction is often regarded as being part of exploring step. It’s useful for when there are too many features for plotting. You could do a scatter-plot matrix, but that only shows you two features at a time. It’s also useful as a pre-processing step for other machine learning algorithms.

Most dimensionality reduction algorithms are unsupervised. This means that they don’t employ the labels of the data points in order to construct the lower-dimensional mapping.

In this section we’ll look at two techniques: PCA, which stands for Principal Components Analysis (Pearson 1901) and t-SNE, which stands for t-distributed Stochastic Neighbor Embedding (Maaten and Hinton 2008).

9.3.1 Introducing Tapkee

Tapkee is a C++ template library for dimensionality reduction (Lisitsyn, Widmer, and Garcia 2013). The library contains implementations of many dimensionality reduction algorithms, including:

Locally Linear Embedding
Isomap
Multidimensional scaling
PCA
t-SNETapkee’s website: http://tapkee.lisitsyn.me/, contains more information about these algorithms. Although Tapkee is mainly a library that can be included in other applications, it also offers a command-line tool. We’ll use this to perform dimensionality reduction on our wine data set.

9.3.2 Installing Tapkee

If you aren’t running the Data Science Toolbox, you’ll need to download and compile Tapkee yourself. First make sure that you have CMake installed. On Ubuntu, you simply run:

$ sudo apt-get install cmake

Please consult Tapkee’s website for instructions for other operating systems. Then execute the following commands to download the source and compile it:

$ curl -sL https://github.com/lisitsyn/tapkee/archive/master.tar.gz > \
> tapkee-master.tar.gz
$ tar -xzf tapkee-master.tar.gz
$ cd tapkee-master
$ mkdir build && cd build
$ cmake ..
$ make

This creates a binary executable named tapkee.

9.3.3 Linear and Non-linear Mappings

First, we’ll scale the features using standardization such that each feature is equally important. This generally leads to better results when applying machine learning algorithms.

To scale we use a combination of cols and Rio:

$ < wine-both.csv cols -C type Rio -f scale > wine-both-scaled.csv

Now we apply both dimensionality reduction techniques and visualize the mapping using Rio-scatter:

$ < wine-both-scaled.csv cols -C type body tapkee --method pca |
> header -r x,y,type | Rio-scatter x y type |
> tee tapkee-wine-pca.png | display

Figure 9.1: PCA

$ < wine-both-scaled.csv cols -C type body tapkee --method t-sne |
> header -r x,y,type | Rio-scatter x y type |
> tee tapkee-wine-t-sne.png | display

Figure 9.2: t-SNE

Note that there’s not a single GNU core util (i.e., classic command-line tool) in this one-liner. Now that’s the power of the command line!