4.1 Load and Save Files

Having only data inside Julia programs and not being able to load or save it would be very limiting. Therefore, we start by mentioning how to store files to and load files from disk. We focus on CSV, see Section 4.1.1, and Excel, see Section 4.1.2, file formats since those are the most common data storage formats for tabular data.

4.1.1 CSV

Comma-separated values (CSV) files are are very effective way to store tables. CSV files have two advantages over other data storage files. First, it does exactly what the name indicates it does, namely storing values by separating them using commas ,. This acronym is also used as the file extension. So, be sure that you save your files using the “.csv” extension such as “myfile.csv.” To demonstrate how a CSV file looks, we can add the CSV.jl package using the Pkg REPL mode (Section 3.5.4.2):

  1. julia> ]
  2. pkg> add CSV

and load it via:

  1. using CSV

We can now use our previous data:

  1. grades_2020()
namegrade_2020
Sally1.0
Bob5.0
Alice8.5
Hank4.0

and read it from a file after writing it:

  1. function write_grades_csv()
  2. path = "grades.csv"
  3. CSV.write(path, grades_2020())
  4. end
  1. path = write_grades_csv()
  2. read(path, String)
  1. name,grade_2020
  2. Sally,1.0
  3. Bob,5.0
  4. Alice,8.5
  5. Hank,4.0

Here, we also see the second benefit of CSV data format: the data can be read by using a simple text editor. This differs from many alternative data formats which require proprietary software, e.g. Excel.

This works wonders, but what if our data contains commas , as values? If we were to naively write data with commas, it would make the files very hard to convert back to a table. Luckily, CSV.jl handles this for us automatically. Consider the following data with commas ,:

  1. function grades_with_commas()
  2. df = grades_2020()
  3. df[3, :name] = "Alice,"
  4. df
  5. end
  6. grades_with_commas()
Table 5: Grades with commas.
namegrade_2020
Sally1.0
Bob5.0
Alice,8.5
Hank4.0

If we write this, we get:

  1. function write_comma_csv()
  2. path = "grades-commas.csv"
  3. CSV.write(path, grades_with_commas())
  4. end
  5. path = write_comma_csv()
  6. read(path, String)
  1. name,grade_2020
  2. Sally,1.0
  3. Bob,5.0
  4. "Alice,",8.5
  5. Hank,4.0

So, CSV.jl adds quotation marks " around the comma-containing values. Another common way to solve this problem is to write the data to a tab-separated values (TSV) file format. This assumes that the data doesn’t contain tabs, which holds in most cases.

Also, note that TSV files can also be read using a simple text editor, and these files use the “.tsv” extension.

  1. function write_comma_tsv()
  2. path = "grades-comma.tsv"
  3. CSV.write(path, grades_with_commas(); delim='\t')
  4. end
  5. read(write_comma_tsv(), String)
  1. name grade_2020
  2. Sally 1.0
  3. Bob 5.0
  4. Alice, 8.5
  5. Hank 4.0

Text file formats like CSV and TSV files can also be found that use other delimiters, such as semicolons “;” spaces “ ,” or even something as unusual as “π.”

  1. function write_space_separated()
  2. path = "grades-space-separated.csv"
  3. CSV.write(path, grades_2020(); delim=' ')
  4. end
  5. read(write_space_separated(), String)
  1. name grade_2020
  2. Sally 1.0
  3. Bob 5.0
  4. Alice 8.5
  5. Hank 4.0

By convention, it’s still best to give files with special delimiters, such as “;” the “.csv” extension.

Loading CSV files using CSV.jl is done in a similar way. You can use CSV.read and specify in what kind of format you want the output. We specify a DataFrame.

  1. path = write_grades_csv()
  2. CSV.read(path, DataFrame)
namegrade_2020
Sally1.0
Bob5.0
Alice8.5
Hank4.0

Conveniently, CSV.jl will automatically infer column types for us:

  1. path = write_grades_csv()
  2. df = CSV.read(path, DataFrame)
  1. 4×2 DataFrame
  2. Row name grade_2020
  3. String7 Float64
  4. ─────┼─────────────────────
  5. 1 Sally 1.0
  6. 2 Bob 5.0
  7. 3 Alice 8.5
  8. 4 Hank 4.0

It works even for far more complex data:

  1. my_data = """
  2. a,b,c,d,e
  3. Kim,2018-02-03,3,4.0,2018-02-03T10:00
  4. """
  5. path = "my_data.csv"
  6. write(path, my_data)
  7. df = CSV.read(path, DataFrame)
  1. 1×5 DataFrame
  2. Row a b c d e
  3. String3 Date Int64 Float64 DateTime
  4. ─────┼──────────────────────────────────────────────────────────
  5. 1 Kim 2018-02-03 3 4.0 2018-02-03T10:00:00

These CSV basics should cover most use cases. For more information, see the CSV.jl documentation and especially the CSV.File constructor docstring.

4.1.2 Excel

There are multiple Julia packages to read Excel files. In this book, we will only look at XLSX.jl, because it is the most actively maintained package in the Julia ecosystem that deals with Excel data. As a second benefit, XLSX.jl is written in pure Julia, which makes it easy for us to inspect and understand what’s going on under the hood.

Load XLSX.jl via

  1. using XLSX:
  2. eachtablerow,
  3. readxlsx,
  4. writetable

To write files, we define a little helper function for data and column names:

  1. function write_xlsx(name, df::DataFrame)
  2. path = "$name.xlsx"
  3. data = collect(eachcol(df))
  4. cols = names(df)
  5. writetable(path, data, cols)
  6. end

Now, we can easily write the grades to an Excel file:

  1. function write_grades_xlsx()
  2. path = "grades"
  3. write_xlsx(path, grades_2020())
  4. "$path.xlsx"
  5. end

When reading it back, we will see that XLSX.jl puts the data in a XLSXFile type and we can access the desired sheet much like a Dict:

  1. path = write_grades_xlsx()
  2. xf = readxlsx(path)
  1. XLSXFile("grades.xlsx") containing 1 Worksheet
  2. sheetname size range
  3. -------------------------------------------------
  4. Sheet1 5x2 A1:B5
  1. xf = readxlsx(write_grades_xlsx())
  2. sheet = xf["Sheet1"]
  3. eachtablerow(sheet) |> DataFrame
namegrade_2020
Sally1.0
Bob5.0
Alice8.5
Hank4.0

Notice that we cover just the basics of XLSX.jl but more powerful usage and customizations are available. For more information and options, see the XLSX.jl documentation.

4.1 Load and Save Files - 图1 Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso