Chapter 10 Conclusion - 10.1 Let’s Recap - 《Data Science at the Command Line》

10.1 Let’s Recap

10.1 Let’s Recap

This book explored the power of employing the command line to perform data science tasks. It is an interesting observation that the challenges posed by this relatively young field can be tackled by such a time-tested technology. It is our hope that you now see what the command line is capable of. The many command-line tools offer all sorts of possibilities that are well suited to the variety of tasks encompassing data science.

There are many definitions for data science available. In Chapter 1, we introduced the OSEMN model as defined by Mason and Wiggens, because it is a very practical one that translates to very specific tasks. The acronym OSEMN stands for obtaining, scrubbing, exploring, modeling, and interpreting data. Chapter 1 also explained why the command line is very suitable for doing these data science tasks.

In Chapter 2, we explained how you can set up your own Data Science Toolbox and install the bundle that is associated with this book. Chapter 2 also provided an introduction to the essential tools and concepts of the command line.

The OSEMN model chapters—Chapter 3 (obtaining), Chapter 5 (scrubbing), Chapter 7 (exploring), and Chapter 9 (modeling)—focused on performing those practical tasks using the command line. We haven’t devoted a chapter to the fifth step, interpreting data, because, quite frankly, the computer, let alone the command line, is of very little use here. We have, however, provided some pointers for further reading on this topic.

In the three intermezzo chapters, we looked at some broader topics of doing data science at the command line, topics which are not really specific to one particular step. In Chapter 4, we explained how you can turn one-liners and existing code into reusable command-line tools. In Chapter 6, we described how you can manage your data workflow using a command-line tool called Drake. In Chapter 8, we demonstrated how ordinary command-line tools and pipelines can be run in parallel using GNU Parallel. These topics can be applied at any point in your data workflow.

It is impossible to demonstrate all command-line tools that are available and relevant for doing data science. New command-line tools are created on a daily basis. As you may have come to understand by now, this book is more about the idea of using the command line, rather than giving you an exhaustive list of tools.