7.4 Creating Visualizations

In this section we’re going to discuss how to create visualizations at the command line. We’ll be looking at two different software packages: gnuplot and ggplot. First, we’ll introduce both packages. Then, we’ll demonstrate how to create several different types of visualizations using both of them.

7.4.1 Introducing Gnuplot and Feedgnuplot

The first software package to create visualizations that we’re discussing in this chapter is Gnuplot. Gnuplot has been around since 1986. Despite being rather old, its visualization capabilities are quite extensive. As such, it’s impossible to do it justice. There are other good resources available, including Gnuplot in Action by Janert (2009).

To demonstrate the flexibility (and its archaic notation), consider Example 7.2, which is copied from the Gnuplot website (http://gnuplot.sourceforge.net/demo/histograms.6.gnu).

Example 7.2 (Creating a histogram using Gnuplot)

  1. # set terminal pngcairo transparent enhanced font "arial,10" fontscale 1.0 size
  2. # set output 'histograms.6.png'
  3. set border 3 front linetype -1 linewidth 1.000
  4. set boxwidth 0.75 absolute
  5. set style fill solid 1.00 border lt -1
  6. set grid nopolar
  7. set grid noxtics nomxtics ytics nomytics noztics nomztics \
  8. nox2tics nomx2tics noy2tics nomy2tics nocbtics nomcbtics
  9. set grid layerdefault linetype 0 linewidth 1.000, linetype 0 linewidth 1.000
  10. set key outside right top vertical Left reverse noenhanced autotitles columnhead
  11. set style histogram columnstacked title offset character 0, 0, 0
  12. set datafile missing '-'
  13. set style data histograms
  14. set xtics border in scale 1,0.5 nomirror norotate offset character 0, 0, 0 auto
  15. set xtics norangelimit
  16. set xtics ()
  17. set ytics border in scale 0,0 mirror norotate offset character 0, 0, 0 autojust
  18. set ztics border in scale 0,0 nomirror norotate offset character 0, 0, 0 autoju
  19. set cbtics border in scale 0,0 mirror norotate offset character 0, 0, 0 autojus
  20. set rtics axis in scale 0,0 nomirror norotate offset character 0, 0, 0 autojust
  21. set title "Immigration from Northern Europe\n(columstacked histogram)"
  22. set xlabel "Country of Origin"
  23. set ylabel "Immigration by decade"
  24. set yrange [ 0.00000 : * ] noreverse nowriteback
  25. i = 23
  26. plot 'immigration.dat' using 6 ti col, '' using 12 ti col, '' using 13 ti c

Please note that this is trimmed to 80 characters wide. The above script generates the following image:

Immigration Plot by GnuplotFigure 7.1: Immigration Plot by Gnuplot

Gnuplot is different from most command-line tools we’ve been using for two reasons. First, it uses a script instead of command-line arguments. Second, the output is always written to a file and not printed to standard output.

One great advantage of Gnuplot being around for so long, and the main reason we’ve included it in this book, is that it’s able to produce visualizations for the command line. That is, it’s able to print its output to the terminal without the need for a graphical user interface (GUI). Even then, you would need to set up a script.

Luckily, there is a command-line tool called feedgnuplot (Kogan 2014), which can help us with setting up a script for Gnuplot. feedgnuplot is entirely configurable through command-line arguments. Plus, it reads from standard input. After we have introduced ggplot2, we’re going to create a few visualizations using feedgnuplot.

One great feature of feedgnuplot that we would like to mention here, is that it allows you to plot streaming data. The following is a snapshot of a continuously updated plot based on random input data:

  1. $ while true; do echo $RANDOM; done | sample -d 10 | feedgnuplot --stream \
  2. > --terminal 'dumb 80,25' --lines --xlen 10
  3. 30000 ++-----+------------+-------------+-------------+------------+-----++
  4. | + * + + + |
  5. | : ** : ******* : *
  6. 25000 ++.................*.*..........................*.....*............+*
  7. | : *: * : *: * : *|
  8. | : *: * : *: * : *|
  9. | : * : * : * : * : * |
  10. 20000 ++................*....*......................*.........*.........*++
  11. | : * : * : * : * : * |
  12. | : * : * : * : * : * |
  13. 15000 ++....**.........*.......*..................*............*.......*.++
  14. | **** :* * : * : * : * : * |
  15. ** :* * : * **** * : * : * |
  16. 10000 ++.......*......*.........*....**....*.....*..............*.....*..++
  17. | : * * : * ** : * * : * : * |
  18. | : * * : ** : ** * : * : * |
  19. | : * * : : * : * : * |
  20. 5000 ++..........*..*.........................*..................*.*....++
  21. | : * * : : : *:* |
  22. | + ** + + + * |
  23. 0 ++-----+------*-----+-------------+-------------+------------*-----++
  24. 2350 2352 2354 2356 2358

7.4.2 Introducing ggplot2

A more modern software package for creating visualizations is ggplot, which is an implementation of the grammar of graphics in R (Wickham 2009).

Thanks to the grammar of graphics and using sensible defaults, ggplot2 commands tend to be very short and expressive. When used through Rio, this is a very convenient way of creating visualizations from the command line.

To demonstrate it’s expressiveness, we’ll recreate the histogram plot generated above by gnuplot, with the help of Rio. Because Rio expects the data set to be comma-delimited, and because ggplot2 expects the data in long format, we first need to scrub and transform the data a little bit:

  1. $ < data/immigration.dat sed -re '/^#/d;s/\t/,/g;s/,-,/,0,/g;s/Region/'\
  2. > 'Period/' | tee data/immigration.csv | head | cut -c1-80
  3. Period,Austria,Hungary,Belgium,Czechoslovakia,Denmark,France,Germany,Greece,Irel
  4. 1891-1900,234081,181288,18167,0,50231,30770,505152,15979,388416,651893,26758,950
  5. 1901-1910,668209,808511,41635,0,65285,73379,341498,167519,339065,2045877,48262,1
  6. 1911-1920,453649,442693,33746,3426,41983,61897,143945,184201,146181,1109524,4371
  7. 1921-1930,32868,30680,15846,102194,32430,49610,412202,51084,211234,455315,26948,
  8. 1931-1940,3563,7861,4817,14393,2559,12623,144058,9119,10973,68028,7150,4740,3960
  9. 1941-1950,24860,3469,12189,8347,5393,38809,226578,8973,19789,57661,14860,10100,1
  10. 1951-1960,67106,36637,18575,918,10984,51121,477765,47608,43362,185491,52277,2293
  11. 1961-1970,20621,5401,9192,3273,9201,45237,190796,85969,32966,214111,30606,15484,

The sed expression consists of four parts, delimited by semicolons:

  • Remove lines that start with #.

  • Convert tabs to commas.

  • Change dashes (missing values) into zero’s.

  • Change the feature name Region into Period.

We then select only the columns that matter using csvcut and subsequently convert the data from a wide format to a long one using the Rio and the melt function which part of the R package reshape2:

  1. $ < data/immigration.csv csvcut -c Period,Denmark,Netherlands,Norway,\
  2. > Sweden | Rio -re 'melt(df, id="Period", variable.name="Country", '\
  3. > 'value.name="Count")' | tee data/immigration-long.csv | head | csvlook
  4. |------------+-------------+--------|
  5. | Period | Country | Count |
  6. |------------+-------------+--------|
  7. | 1891-1900 | Denmark | 50231 |
  8. | 1901-1910 | Denmark | 65285 |
  9. | 1911-1920 | Denmark | 41983 |
  10. | 1921-1930 | Denmark | 32430 |
  11. | 1931-1940 | Denmark | 2559 |
  12. | 1941-1950 | Denmark | 5393 |
  13. | 1951-1960 | Denmark | 10984 |
  14. | 1961-1970 | Denmark | 9201 |
  15. | 1891-1900 | Netherlands | 26758 |
  16. |------------+-------------+--------|

Now, we can use Rio again, but then with an expression that builds up a ggplot2 visualization:

  1. $ < data/immigration-long.csv Rio -ge 'g + geom_bar(aes(Country, Count,'\
  2. > ' fill=Period), stat="identity") + scale_fill_brewer(palette="Set1") '\
  3. > '+ labs(x="Country of origin", y="Immigration by decade", title='\
  4. > '"Immigration from Northern Europe\n(columstacked histogram)")' | display

Immigration plot by Rio and ggplot2Figure 7.2: Immigration plot by Rio and ggplot2

The -g command-line argument indicates that Rio should load the ggplot2 package. The output is an image in PNG format. You can either view the PNG image via display, which is part of ImageMagick (LLC 2009) or you can redirect the output to a PNG file. If you’re on a remote terminal then you probably won’t be able to see any graphics. A workaround for this is to start a webserver from a particular directory:

  1. $ python -m SimpleHTTPServer 8000

Make sure that you have access to the port (8000 in this case). If you save the PNG image to the directory from which the webserver was launched, then you can access the image from your browser at http://localhost:8000/file.png.

7.4.3 Histograms

Using Rio:

  1. $ < data/tips.csv Rio -ge 'g+geom_histogram(aes(bill))' | display

HistogramFigure 7.3: Histogram

Using feedgnuplot:

  1. < data/tips.csv csvcut -c bill | feedgnuplot --terminal 'dumb 80,25' \
  2. --histogram 0 --with boxes --ymin 0 --binwidth 1.5 --unset grid --exit
  3. 25 ++----+------+-----+--***-+-----+------+-----+------+-----+------+----++
  4. + + + +*** * + + + + + + + +
  5. | * * * |
  6. | *** * * * |
  7. 20 ++ * * * * * ++
  8. | **** * * * * |
  9. | * ** *** * * *** |
  10. | * ** * * * * * * |
  11. 15 ++ * ** * * * * * * ++
  12. | * ** * * * * * * |
  13. | * ** * * * * * * |
  14. | * ** * * * * * * *** |
  15. 10 ++ * ** * * * *** *** * ++
  16. | * ** * * * * * * * * |
  17. | *** ** * * * * * * * ***** *** |
  18. | * * ** * * * * * * * * * *** * |
  19. 5 ++ *** * ** * * * * * * * * * * * * *** ++
  20. | * * * ** * * * * * * * * * * * * *** * |
  21. | * * * ** * * * * * * * * * * * *** * ******** *** *** |
  22. + ***+*** * * ** *+* * * * * * * * * *+* * *+** * *+* ***+* * * *** +
  23. 0 ++-***+***********************************************-*****-***-***--++
  24. 0 5 10 15 20 25 30 35 40 45 50 55

7.4.4 Bar Plots

Using Rio:

  1. $ < data/tips.csv Rio -ge 'g+geom_bar(aes(factor(size)))' | display

Bar PlotFigure 7.4: Bar Plot

Using feedgnuplot:

  1. $ < data/tips.csv | csvcut -c size | header -d | feedgnuplot --terminal \
  2. > 'dumb 80,25' --histogram 0 --with boxes --unset grid --exit
  3. 160 ++--------+----***********----+---------+---------+---------+--------++
  4. + + * + * + + + + +
  5. 140 ++ * * ++
  6. | * * |
  7. | * * |
  8. 120 ++ * * ++
  9. | * * |
  10. 100 ++ * * ++
  11. | * * |
  12. | * * |
  13. 80 ++ * * ++
  14. | * * |
  15. 60 ++ * * ++
  16. | * * |
  17. | * * |
  18. 40 ++ * ********************* ++
  19. | * * * * |
  20. 20 ++ * * * * ++
  21. | * * * * |
  22. + *********** + * + * + ********************* +
  23. 0 ++---*************************************************************---++
  24. 0 1 2 3 4 5 6 7

7.4.5 Density Plots

Using Rio:

  1. $ < data/tips.csv Rio -ge 'g+geom_density(aes(tip / bill * 100, fill=sex), '\
  2. > 'alpha=0.3) + xlab("percent")' | display

Density PlotFigure 7.5: Density Plot

Since feedgnuplot cannot generate density plots, it’s best to just generate a histogram.

7.4.6 Box Plots

Using Rio:

  1. $ < data/tips.csv Rio -ge 'g+geom_boxplot(aes(time, bill))' | display

Box PlotFigure 7.6: Box Plot

Drawing a box plot is unfortunately not possible with feedgnuplot.

7.4.7 Scatter Plots

Using Rio:

  1. $ < data/tips.csv Rio -ge 'g+geom_point(aes(bill, tip, color=time))' | display

Scatter PlotFigure 7.7: Scatter Plot

Using feedgnuplot:

  1. < data/tips.csv csvcut -c bill,tip | tr , ' ' | header -d | feedgnuplot \
  2. --terminal 'dumb 80,25' --points --domain --unset grid --exit --style 'pt' '14'
  3. 10 ++----+------+-----+------+-----+------+-----+------+-----+------+A---++
  4. + + + + + + + + + + + +
  5. 9 ++ A ++
  6. | |
  7. 8 ++ ++
  8. | A |
  9. | |
  10. 7 ++ A A ++
  11. | A A |
  12. 6 ++ A A A ++
  13. | A A |
  14. 5 ++ A A A A A AA A AA A A A ++
  15. | A A A A |
  16. 4 ++ A A AAAA AAA A A A A A ++
  17. | A AAAAA AAA AA A A |
  18. | A AAAAAAA AA A A AA A AA |
  19. 3 ++ A AAAAAAAAAAA A A AA AA A ++
  20. | AAAAAAA AA A A A A A |
  21. 2 ++ AA AAAAAAAAA A A A AA A A A ++
  22. + + AAAAAAAA +A AA+ + A + + + + + +
  23. 1 ++--A-+A-A---+--AA-+--A---+-----+------+--A--+------+-----+------+----++
  24. 0 5 10 15 20 25 30 35 40 45 50 55

7.4.8 Line Graphs

  1. $ < data/immigration-long.csv Rio -ge 'g+geom_line(aes(x=Period, '\
  2. > 'y=Count, group=Country, color=Country)) + theme(axis.text.x = '\
  3. > 'element_text(angle = -45, hjust = 0))' | display

Line GraphFigure 7.8: Line Graph

  1. $ < data/immigration.csv | csvcut -c Period,Denmark,Netherlands,Norway,Sweden |
  2. > header -d | tr , ' ' | feedgnuplot --terminal 'dumb 80,25' --lines \
  3. > --autolegend --domain --legend 0 "Denmark" --legend 1 "Netherlands" \
  4. > --legend 2 "Norway" --legend 3 "Sweden" --xlabel "Period" --unset grid --exit
  5. 250000 ++-----%%%-------+-------+--------+-------+-------+--------+------++
  6. + %%%% + % + + + + + Denmark+****** +
  7. |%% % Netherlands ###### |
  8. | % Norway $$$$$$ |
  9. 200000 ++ % Sweden %%%%%%++
  10. | $ % |
  11. | $ $ % |
  12. | $ $ % |
  13. 150000 ++ $$ $ % ++
  14. | $ $ % |
  15. | $ $ % |
  16. 100000 ++$ $ % ++
  17. |$ $ %%%%%%%%%% |
  18. | $ % |
  19. | *********** $$$$$$$$$$$% |
  20. 50000 +**** #########** $%% ####### ++
  21. | #### ******** $$% ### ## |
  22. |## ******## ##$$$$$$$$$$$$# |
  23. + + + + **###########$$************* +
  24. 0 ++------+--------+-------+--------*************---+--------+------++
  25. 1890 1900 1910 1920 1930 1940 1950 1960 1970
  26. Period

7.4.9 Summary

Both Rio with ggplot2 and feedgnuplot with Gnuplot have their advantages. The plots generated by Rio are obviously of much higher quality. It offers a consistent syntax that lends itself well for the command line. The only down-side would be that the output is not viewable from the command line. This is where feedgnuplot may come in handy. Each plot has roughly the same command-line arguments. As such, it would be straightforward to create a small Bash script that would make generating plots from and for the command line even easier. After all, with the command line having such a low resolution, we don’t need a lot of flexibility.