Data serialization in R and Racket

2019 Mar 26 @travishinkelman.com

When programming in R, I generally pass data around by reading and writing text files (typically, CSV files). The ubiquity of CSV files means that many different types of software will open them easily (e.g., Notepad, Excel, TextEdit, etc.). However, if the data structure is not flat or contains other attributes, then writing to CSV requires flattening and/or dropping attributes. The general solution to writing data to a file while retaining structure and attributes is serialization.

R

In R, I use readRDS and saveRDS for reading and writing serialized objects. Let's pull a nested list from a recent post as an example.

nested_list = list("female" = list("0-1" = 0, 
                                   "1-2" = 0,
                                   "2-3" = 0.15,
                                   "3-4" = 0.4,
                                   "4+" = 0.8),
                   "male" = list("0-1" = 0, 
                                 "1-2" = 0.1,
                                 "2-3" = 0.4,
                                 "3-4" = 0.6,
                                 "4+" = 0.9))

saveRDS(nested_list, "list.rds")                # write list object to file
identical(nested_list, readRDS("list.rds"))     # read list object from file and compare to original

Even with a flat object, serialization can be useful. For example, we could write a matrix to CSV with write.table but it won't be exactly the same when reading from the CSV file [1]. The approach with saveRDS/readRDS is more straightforward.

mat = matrix(data = 0, nrow = 10, ncol = 5)

write.table(mat, "matrix.csv", row.names = FALSE, col.names = FALSE)
mat_csv = as.matrix(read.table("matrix.csv"))
identical(mat, mat_csv)

saveRDS(mat, "matrix.rds")
mat_rds = readRDS("matrix.rds")
identical(mat, mat_rds)

Racket

Racket has a serialization library for handling this type of task. The following code is based on this Stack Overflow answer and uses a hash table from the same post as the nested list above.

#lang racket

(require racket/serialize)

(define nested-hash
  (hash "female" (hash "0-1" 0
                       "1-2" 0
                       "2-3" 0.15
                       "3-4" 0.4
                       "4+" 0.8)
        "male" (hash "0-1" 0
                     "1-2" 0.1
                     "2-3" 0.4
                     "3-4" 0.6
                     "4+" 0.9)))

; define function for saving data to a rkdt file                     
(define (save-rktd data path)
  (if (serializable? data)
      (with-output-to-file path
        (lambda () (write (serialize data)))
        #:exists 'replace)                       
      (error "Data is not serializable")))
      
; define function for reading data from a rkdt file      
(define (read-rktd path)
  (with-input-from-file path
    (lambda () (deserialize (read)))))
    
(save-rktd nested-hash "hash.rktd")            ; write hash table to file
(equal? nested-hash (read-rktd "hash.rktd"))   ; read hash table from file and compare to original

The save-rktd function first checks if the data structure is serializable and returns an error if it is not. The with-output-to-file function handles opening and closing a port to the file. While the port is open, the anonymous function (specified with (lambda)) indicates that the data should first be serialized and then written to the output port. The keyword argument #:exists is set to 'replace because overwriting the existing file is familiar to me from saveRDS.

In read-rktd, a port is opened and closed with with-input-from-file and the anonymous function first reads the file and then deserializes the data. The save-rktd and read-rktd functions are then applied in the same way as saveRDS and readRDS.

save-rktd and read-rktd should work on any data structures that are serializable. Here is an example with a "matrix" comprised of a vector of vectors.

(define matrix
  (for/vector ([i (in-range 10)])
    (make-vector 5 0)))

(save-rktd matrix "matrix.rktd")
(equal? matrix (read-rktd "matrix.rktd"))

One key difference between the R functions and my Racket versions is that the R functions apply compression by default and no compression is applied in the Racket functions. In fact, relative to writing/reading CSV files, using saveRDS and readRDS provides the benefits of small size on disk and fast read/write operations in addition to retaining data structure and attributes.


[1] I didn't spend much time trying to work through how the objects are different or how to make them the same.