Data serialization in R and Racket
When programming in R, I generally pass data around by reading and writing text files (typically, CSV files). The ubiquity of CSV files means that many different types of software will open them easily (e.g., Notepad, Excel, TextEdit, etc.). However, if the data structure is not flat or contains other attributes, then writing to CSV requires flattening and/or dropping attributes. The general solution to writing data to a file while retaining structure and attributes is serialization.
R
In R, I use readRDS
and saveRDS
for reading and writing serialized objects. Let's pull a nested list from a recent post as an example.
nested_list = list("female" = list("0-1" = 0,
"1-2" = 0,
"2-3" = 0.15,
"3-4" = 0.4,
"4+" = 0.8),
"male" = list("0-1" = 0,
"1-2" = 0.1,
"2-3" = 0.4,
"3-4" = 0.6,
"4+" = 0.9))
saveRDS(nested_list, "list.rds") # write list object to file
identical(nested_list, readRDS("list.rds")) # read list object from file and compare to original
Even with a flat object, serialization can be useful. For example, we could write a matrix to CSV with write.table
but it won't be exactly the same when reading from the CSV file [1]. The approach with saveRDS
/readRDS
is more straightforward.
mat = matrix(data = 0, nrow = 10, ncol = 5)
write.table(mat, "matrix.csv", row.names = FALSE, col.names = FALSE)
mat_csv = as.matrix(read.table("matrix.csv"))
identical(mat, mat_csv)
saveRDS(mat, "matrix.rds")
mat_rds = readRDS("matrix.rds")
identical(mat, mat_rds)
Racket
Racket has a serialization library for handling this type of task. The following code is based on this Stack Overflow answer and uses a hash table from the same post as the nested list above.
#lang racket
(require racket/serialize)
(define nested-hash
(hash "female" (hash "0-1" 0
"1-2" 0
"2-3" 0.15
"3-4" 0.4
"4+" 0.8)
"male" (hash "0-1" 0
"1-2" 0.1
"2-3" 0.4
"3-4" 0.6
"4+" 0.9)))
; define function for saving data to a rkdt file
(define (save-rktd data path)
(if (serializable? data)
(with-output-to-file path
(lambda () (write (serialize data)))
#:exists 'replace)
(error "Data is not serializable")))
; define function for reading data from a rkdt file
(define (read-rktd path)
(with-input-from-file path
(lambda () (deserialize (read)))))
(save-rktd nested-hash "hash.rktd") ; write hash table to file
(equal? nested-hash (read-rktd "hash.rktd")) ; read hash table from file and compare to original
The save-rktd
function first checks if the data structure is serializable and returns an error if it is not. The with-output-to-file
function handles opening and closing a port to the file. While the port is open, the anonymous function (specified with (lambda)
) indicates that the data should first be serialized and then written to the output port. The keyword argument #:exists
is set to 'replace
because overwriting the existing file is familiar to me from saveRDS
.
In read-rktd
, a port is opened and closed with with-input-from-file
and the anonymous function first reads the file and then deserializes the data. The save-rktd
and read-rktd
functions are then applied in the same way as saveRDS
and readRDS
.
save-rktd
and read-rktd
should work on any data structures that are serializable. Here is an example with a "matrix" comprised of a vector of vectors.
(define matrix
(for/vector ([i (in-range 10)])
(make-vector 5 0)))
(save-rktd matrix "matrix.rktd")
(equal? matrix (read-rktd "matrix.rktd"))
One key difference between the R functions and my Racket versions is that the R functions apply compression by default and no compression is applied in the Racket functions. In fact, relative to writing/reading CSV files, using saveRDS
and readRDS
provides the benefits of small size on disk and fast read/write operations in addition to retaining data structure and attributes.
[1] I didn't spend much time trying to work through how the objects are different or how to make them the same.