Reading CSV files in R and Racket

2019 Jun 09 @travishinkelman.com

In a previous post, I wrote about reading and writing data to file while retaining the structure and attributes of the data (i.e., data serialization). However, I more commonly pass data around as text files (usually, CSV files). For this post, I created an example CSV file with a tool for generating test data, which allows for including different data types (e.g., dates, integers, names, phone numbers) in the output file.

R

In base R, read.csv [1] includes sensible defaults that make reading a CSV file relatively straightforward.

> str(read.csv("example.csv"))
'data.frame':	25 obs. of  7 variables:
 $ date   : Factor w/ 25 levels "01/23/2054","01/30/2023",..: 23 3 21 6 11 9 19 13 24 2 ...
 $ integer: int  -958838 -165090 -207296 316236 722897 -903116 -350444 -498413 63304 -644668 ...
 $ float  : num  5.90e+11 5.28e+11 9.94e+10 5.90e+11 -3.09e+11 ...
 $ bool   : Factor w/ 2 levels "false","true": 2 2 1 1 2 2 1 1 1 1 ...
 $ char   : Factor w/ 20 levels ")","$","0","1",..: 17 4 18 19 5 3 20 9 2 11 ...
 $ word   : Factor w/ 25 levels "ca","da","doptah",..: 24 4 3 8 10 6 21 7 18 20 ...
 $ yn     : Factor w/ 2 levels "N","Y": 2 2 2 1 1 2 1 1 2 2 ...

The str function shows the structure of an R object. For a data.frame object, each column in the data.frame is previewed in a row with the column name, followed by the column type, followed by the values for the first few rows in that column. Most of the columns in example.csv were converted to factors by read.csv. The factor levels are encoded as integers (e.g., "true" and "false" as 1 and 2). I won't re-hash the explanation for this default behavior, but I now typically set stringsAsFactors = FALSE to read non-numeric data as strings.

> str(read.csv("example.csv", stringsAsFactors = FALSE))
'data.frame':	25 obs. of  7 variables:
 $ date   : chr  "12/25/2060" "02/18/2023" "11/27/2064" "06/28/2044" ...
 $ integer: int  -958838 -165090 -207296 316236 722897 -903116 -350444 -498413 63304 -644668 ...
 $ float  : num  5.90e+11 5.28e+11 9.94e+10 5.90e+11 -3.09e+11 ...
 $ bool   : chr  "true" "true" "false" "false" ...
 $ char   : chr  "u" "1" "w" "x" ...
 $ word   : chr  "zosu" "em" "doptah" "gutijenel" ...
 $ yn     : chr  "Y" "Y" "Y" "N" ...

However, I'm even more likely to use the read_csv function from the readr package.

> library(readr)
> str(read_csv("example.csv"))
Parsed with column specification:
cols(
  date = col_character(),
  integer = col_double(),
  float = col_double(),
  bool = col_logical(),
  char = col_character(),
  word = col_character(),
  yn = col_character()
)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':	25 obs. of  7 variables:
 $ date   : chr  "12/25/2060" "02/18/2023" "11/27/2064" "06/28/2044" ...
 $ integer: num  -958838 -165090 -207296 316236 722897 ...
 $ float  : num  5.90e+11 5.28e+11 9.94e+10 5.90e+11 -3.09e+11 ...
 $ bool   : logi  TRUE TRUE FALSE FALSE TRUE TRUE ...
 $ char   : chr  "u" "1" "w" "x" ...
 $ word   : chr  "zosu" "em" "doptah" "gutijenel" ...
 $ yn     : chr  "Y" "Y" "Y" "N" ...
 - attr(*, "spec")=
  .. cols(
  ..   date = col_character(),
  ..   integer = col_double(),
  ..   float = col_double(),
  ..   bool = col_logical(),
  ..   char = col_character(),
  ..   word = col_character(),
  ..   yn = col_character()
  .. )

readr flexibly parses many data types. In our example file, readr converts the bool column of "true" and "false" to R's logical type represented as TRUE and FALSE. readr is also capable of automatically parsing dates, but, apparently, not when the dates are in the mm/dd/yyyy format [2]. Interestingly, readr parses the integer column in this example file as a double rather than integer, perhaps because the integer values are so large.

If you are dealing with large datasets with slow read times, fread from the data.table package and the vroom package provide fast reading of delimited files. Jim Hester, maintainer of both the readr and vroom packages, made a helpful video comparing the performance of the readr, vroom, and data.table packages on a variety of tasks related to reading delimited files.

Racket

The csv-reading package provides a high-level convenience function, csv->list, for reading a CSV file.

> (require csv-reading)
> (call-with-input-file "example.csv" csv->list)
'(("date" " integer" " float" " bool" " char" " word" " yn")
  ("12/25/2060" "-958838" "590131109036.032" "true" "u" "zosu" "Y")
  ("02/18/2023" "-165090" "528052918838.8864" "true" "1" "em" "Y")
  ("11/27/2064" "-207296" "99397538938.88" "false" "w" "doptah" "Y")
  ("06/28/2044" "316236" "590216172037.7344" "false" "x" "gutijenel" "N")
  ("08/12/2045" "722897" "-309360363516.7232" "true" "7" "hueh" "N")
  ("07/23/1904" "-903116" "-509332808531.968" "true" "0" "fufcora" "Y")
  ("11/16/2021" "-350444" "-660010367870.5664" "false" "Y" "we" "N")
  ("08/30/1906" "-498413" "464406135282.0736" "false" "E" "gilveodo" "N")
  ("12/28/2059" "63304" "-89202181223.2192" "false" "$" "teocon" "Y")
  ("01/30/2023" "-644668" "777959066776.3712" "false" "J" "vettocfa" "Y")
  ("11/15/1952" "-157838" "-367124812713.1648" "false" "9" "ponposa" "N")
  ("07/14/1943" "526756" "-838433449495.7568" "true" "p" "ni" "Y")
  ("09/21/2032" "-397651" "492526730687.2832" "false" "9" "fa" "N")
  ("10/31/1997" "127459" "586856424642.9696" "true" "N" "rifot" "N")
  ("03/30/1971" "-430726" "816426901097.6768" "false" "C" "wihpubun" "N")
  ("12/10/2002" "986178" "390724723094.3232" "false" "0" "ca" "N")
  ("01/23/2054" "-761264" "167446627483.648" "false" "B" "ku" "Y")
  ("07/16/1999" "373063" "-487648905710.7968" "false" ")" "da" "Y")
  ("08/25/2041" "326420" "382617699273.9328" "false" "L" "kegtizub" "Y")
  ("07/29/1984" "335329" "-511355736870.0928" "false" "M" "kiili" "N")
  ("10/24/2027" "-149539" "244888640893.7472" "true" "g" "ufhuk" "N")
  ("12/30/2009" "258694" "-261971544060.7232" "false" "g" "zemegrif" "N")
  ("11/22/1920" "-553309" "-687846506836.7872" "false" "m" "hetige" "Y")
  ("09/04/2007" "-686574" "779963704593.6128" "false" "L" "zule" "N")
  ("03/07/2060" "372912" "605244557046.5792" "true" "N" "sez" "Y"))

csv->list reads all values as strings into a list [3] (i.e., no automated parsing of values). I think that reading CSV files provides an informative contrast between the R and Racket communities [4]. By and large, R users are not programmers but end users who want to expeditiously perform tasks related to cleaning, analyzing, and visualizing their data. There is a large, and growing, industry around building R packages and tools that facilitate those end users. My early impression of the Racket community is that packages are generally written at a lower level of abstraction than R packages because the target audience is other programmers [5]. I think this discussion on the racket-users mailing list illustrates this difference reasonably well.

Our example data was handled well by csv->list, but if your data is more unruly, then you can turn to the lower-level functions in the csv-reading package, which involves making a csv-reader and then iteratively calling that reader.

(define make-example-csv-reader
  (make-csv-reader-maker
   '((separator-chars #\,))))

(define next-row
  (make-example-csv-reader (open-input-file "example.csv")))

The reader function, next-row, returns each row until it has reached the end of the file and then returns a null list ('()).

> (next-row)
'("date" " integer" " float" " bool" " char" " word" " yn")
> (next-row)
'("12/25/2060" "-958838" "590131109036.032" "true" "u" "zosu" "Y")

We can use this behavior to create a function for previewing a CSV file that uses our previously defined make-example-csv-reader function and then iterates through the file with a recursive function.

(define (preview-csv filename [rows 6])
  (define next-row
    (make-example-csv-reader (open-input-file filename)))
  (define (iterate iter [result '()])
    (define nr (next-row))
    (if (or (< iter 1) (empty? nr))
        (reverse result)
        (iterate (- iter 1) (cons nr result))))
  (iterate rows))

preview-csv allows the user to specify the maximum number of rows to preview (with a default of 6).

> (preview-csv "example.csv")
'(("date" " integer" " float" " bool" " char" " word" " yn")
  ("12/25/2060" "-958838" "590131109036.032" "true" "u" "zosu" "Y")
  ("02/18/2023" "-165090" "528052918838.8864" "true" "1" "em" "Y")
  ("11/27/2064" "-207296" "99397538938.88" "false" "w" "doptah" "Y")
  ("06/28/2044" "316236" "590216172037.7344" "false" "x" "gutijenel" "N")
  ("08/12/2045" "722897" "-309360363516.7232" "true" "7" "hueh" "N"))

With a small modification, we can change preview-csv to a function that reads the whole file.

(define (read-csv filename)
  (define next-row
    (make-example-csv-reader (open-input-file filename)))
  (define (iterate [result '()])
    (define nr (next-row))
    (if (empty? nr)
        (reverse result)
        (iterate (cons nr result))))
  (iterate))

And, for this example, read-csv returns the same result as csv->list.

> (equal? (call-with-input-file "example.csv" csv->list) (read-csv "example.csv"))
#t

[1] read.csv is a wrapper function to read.table, which adds type conversions (and other conveniences) to the scan function.

[2] readr correctly parses dates stored as strings in the ISO 8601 format.

[3] A Racket list is a linked list, which is different than the list data structure in R (see also).

[4] Of course, it is also relevant that the central data structure in R (i.e., data frame) is tabular. That is not true of Racket but there have been a couple of recent discussions on the racket-users mailing list (here and here) about tabular data structures in Racket. And there is a data frame package that implements a data frame data structure in Racket.

[5] This comparison is possibly made more stark by my choice to learn Racket rather than Clojure, which have different philosophies on how to build a community.