Data transformation in Scheme

2023 Jul 07 Updated 2024 Sep 17 @travishinkelman.com

I have done some recent work on my dataframe library for Scheme (R6RS) and thought I would run through the examples in the Data Transformation chapter of R for Data Science (R4DS). In this post, I won't reproduce any of the R code and will provide limited commentary on the Scheme code (which is also available via this gist).

Setup

The nycflights13::flights dataset is used for all of the examples shown below. I've written it to a file and posted it here.

(import (dataframe))

(define flights (tsv->dataframe "nycflights.tsv"))

The flights dataset has 336,776 rows and 19 columns. Datasets of this size strain the dataframe library and provide a suboptimal experience, especially compared to R. Skipped sections indicate that the dataframe library has no equivalent capabilities.

3.1 Introduction

3.1.3

(-> flights
    (dataframe-filter* (dest) (string=? dest "IAH")) 
    (dataframe-aggregate*
     (year month day)
     (arr_delay (arr_delay) (inexact (mean arr_delay))))
    (dataframe-display))

3.2 Rows

3.2.1

Piping all of the output to a dataframe-display will often not show clearly the operation was successful. The one liners below show that the operation yielded the expected result. Need to remove 'na from the dep_delay column or the > operation will fail. The version with dataframe-remove-na is more readable, but requires two passes through the data.

;; filter only version
(define delayed-flights
  (-> flights
      (dataframe-filter*
       (dep_delay)
       (and (not (na? dep_delay))
            (> dep_delay 120)))))

;; remove na first
(define delayed-flights
  (-> flights
      (dataframe-remove-na 'dep_delay)
      (dataframe-filter* (dep_delay) (> dep_delay 120))))

(apply min ($ delayed-flights 'dep_delay))

(-> flights
    (dataframe-remove-na 'month 'day)
    (dataframe-filter* (month day) (and (= month 1) (= day 1)))
    (dataframe-glimpse))

(define jan-feb-flights
  (-> flights
      (dataframe-remove-na 'month)
      (dataframe-filter* (month) (or (= month 1) (= month 2)))))

(remove-duplicates ($ jan-feb-flights 'month))

(-> flights
    (dataframe-remove-na 'month)
    (dataframe-filter* (month) (member month '(1 2)))
    (dataframe-glimpse))

3.2.3

(-> flights
    (dataframe-remove-na 'year 'month 'day 'dep_time)
    (dataframe-sort* (< year) (< month) (< day) (< dep_time))
    (dataframe-glimpse))

(-> flights
    (dataframe-remove-na 'dep_delay)
    (dataframe-sort* (> dep_delay))
    (dataframe-glimpse))

3.2.4

dplyr::distinct has a parameter, .keep_all, that keeps all columns for the first occurrence of each unique combo. dataframe-unique does not have that functionality.

(-> flights
    (dataframe-unique)
    (dataframe-glimpse))

(-> flights
    (dataframe-select* origin dest)
    (dataframe-unique)
    (dataframe-glimpse))

The dataframe library does not include a procedure comparable to dplyr::count, but the same result can be obtained with dataframe-aggregate*.

(-> flights
    (dataframe-aggregate*
     (origin dest)
     (n (origin) (length origin)))
    (dataframe-sort*
     (> n))
    (dataframe-display))

3.3 Columns

3.3.1

(-> flights
    (dataframe-remove-na 'dep_delay 'arr_delay 'distance 'air_time)
    (dataframe-modify*
     (gain (dep_delay arr_delay) (- dep_delay arr_delay))
     (speed (distance air_time) (* (/ distance air_time) 60)))
    (dataframe-glimpse))

dataframe-modify* has no equivalent to the .before and .after parameters of dplyr::mutate. Using a dataframe-select to get focal columns but this drops all other columns.

(-> flights
    (dataframe-remove-na 'dep_delay 'arr_delay 'distance 'air_time)
    (dataframe-modify*
     (gain (dep_delay arr_delay) (- dep_delay arr_delay))
     (speed (distance air_time) (inexact (* (/ distance air_time) 60))))
    (dataframe-select* year month day gain speed)
    (dataframe-display))

3.3.2

dplyr::select has many wonderful helper functions to do complicated select operations easily. dataframe-select* is very simple.

(-> flights
    (dataframe-select* year month day)
    (dataframe-display))

3.3.3

(-> flights
    (dataframe-rename* (tailnum tail_num))
    (dataframe-glimpse))

3.4 The Pipe

(-> flights
    (dataframe-remove-na 'dest 'distance 'air_time)
    (dataframe-filter* (dest) (string=? dest "IAH"))
    (dataframe-modify*
     (speed
      (distance air_time)
      (inexact (* (/ distance air_time) 60))))
    (dataframe-select*
     year month day dep_time carrier flight speed)
    (dataframe-sort*
     (> speed))
    (dataframe-display))

(dataframe-display
 (dataframe-sort*
  (dataframe-select*
   (dataframe-modify*
    (dataframe-filter*
     (dataframe-remove-na flights 'dest 'distance 'air_time)
     (dest) (string=? dest "IAH"))
    (speed
     (distance air_time)
     (inexact (* (/ distance air_time) 60))))
   year month day dep_time carrier flight speed)
  (> speed)))

(define flights1
  (dataframe-filter*
   (dataframe-remove-na flights 'dest 'distance 'air_time)
   (dest distance air_time)
   (string=? dest "IAH")))

(define flights2
  (dataframe-modify*
   flights1
   (speed
    (distance air_time)
    (inexact (* (/ distance air_time) 60)))))

(define flights3
  (dataframe-select*
   flights2
   year month day dep_time carrier flight speed))

(dataframe-display
 (dataframe-sort* flights3 (> speed)))

3.5 Groups

3.5.2

dplyr::group_by automatically sorts by grouping variables, but that step needs to be done explicitly in dataframe-aggregate*.

(-> flights
    (dataframe-aggregate*
     (month)
     (avg_delay (dep_delay) (inexact (mean dep_delay))))
    (dataframe-sort*
     (< month))
    (dataframe-display))

(-> flights
    (dataframe-remove-na 'dep_delay)
    (dataframe-aggregate*
     (month)
     (avg_delay (dep_delay) (inexact (mean dep_delay)))
     (n (dep_delay) (length dep_delay)))
    (dataframe-sort*
     (< month))
    (dataframe-display))

4.5.5

dataframe-aggregate* struggles when splitting a large dataframe into many groups (in this case, 365). The operation to summarize monthly results (not shown) is reasonably quick. In this case, I didn't bother to filter because I knew there were no missing values for year, month, and day.

(define daily_flights
  (dataframe-aggregate*
   flights
   (year month day)
   (n (dep_time) (length dep_time))))