3.3 Decompressing Files

If the original data set is very large or it’s a collection of many files, the file may be a (compressed) archive. Data sets which contain many repeated values (such as the words in a text file or the keys in a JSON file) are especially well suited for compression.

Common file extensions of compressed archives are: .tar.gz, .zip, and .rar. To decompress these, you would use the command-line tools tar (Bailey, Eggert, and Poznyakoff 2014), unzip (Smith 2009), and unrar (Asselstine, Scheurer, and Winkelmann 2014), respectively. There exists a few more, though less common, file extensions for which you would need yet other tools. For example, in order to extract a file named logs.tar.gz, you would use:

  1. $ cd ~/book/ch03
  2. $ tar -xzvf data/logs.tar.gz

Indeed, tar is notorious for its many command-line arguments. In this case, the four command-line arguments x, z, v, and f specify that tar should extract files from an archive, use gzip as the decompression algorithm, be verbose and use file logs.tar.gz. In time, you’ll get used to typing these four characters, but there’s a more convenient way.

Rather than remembering the different command-line tools and their options, there’s a handy script called unpack (Brisbin 2013), which will decompress many different formats. unpack looks at the extension of the file that you want to decompress, and calls the appropriate command-line tool.

The unpack tool is part of the Data Science Toolbox. Remember that you can look up how it can be installed in the appendix. Example 3.1 shows the source of unpack. Although Bash scripting is not the focus of this book, it’s still useful to take a moment to figure out how it works.

Example 3.1 (Decompress various file formats)

  1. #!/usr/bin/env bash
  2. # unpack: Extract common file formats
  3. # Display usage if no parameters given
  4. if [[ -z "$@" ]]; then
  5. echo " ${0##*/} <archive> - extract common file formats)"
  6. exit
  7. fi
  8. # Required program(s)
  9. req_progs=(7z unrar unzip)
  10. for p in ${req_progs[@]}; do
  11. hash "$p" 2>&- || \
  12. { echo >&2 " Required program \"$p\" not installed."; exit 1; }
  13. done
  14. # Test if file exists
  15. if [ ! -f "$@" ]; then
  16. echo "File "$@" doesn't exist"
  17. exit
  18. fi
  19. # Extract file by using extension as reference
  20. case "$@" in
  21. *.7z ) 7z x "$@" ;;
  22. *.tar.bz2 ) tar xvjf "$@" ;;
  23. *.bz2 ) bunzip2 "$@" ;;
  24. *.deb ) ar vx "$@" ;;
  25. *.tar.gz ) tar xvf "$@" ;;
  26. *.gz ) gunzip "$@" ;;
  27. *.tar ) tar xvf "$@" ;;
  28. *.tbz2 ) tar xvjf "$@" ;;
  29. *.tar.xz ) tar xvf "$@" ;;
  30. *.tgz ) tar xvzf "$@" ;;
  31. *.rar ) unrar x "$@" ;;
  32. *.zip ) unzip "$@" ;;
  33. *.Z ) uncompress "$@" ;;
  34. * ) echo " Unsupported file format" ;;
  35. esac

Now, in order to decompress this same file, you would simply use:

  1. $ unpack logs.tar.gz