Dataframe manipulation in Scheme
This is yet another post where I try out my dataframe
library for Scheme (R6RS) on examples found in blog posts. This blog post demonstrates data manipulation with dplyr
(R). I provide only the Scheme code and some commentary. You will have to click through to the original post to see how it compares to R.
Library and Data
First, we import the dataframe
library and the shuffle
procedure from chez-stats
. The original post uses the Palmer Penguins dataset, which I wrote to a csv file from R. We read the data from file and select only the columns used in this post. dataframe-display
is similar to head
in R and dataframe-glimpse
is similar to str
, but dataframe
currently provides no functionality comparable to summary
in R.
(import (dataframe)
(only (chez-stats)
shuffle))
(define dat
(-> (csv->dataframe "penguins.csv")
(dataframe-select* species body_mass_g sex year)))
(dataframe-display dat)
dim: 344 rows x 4 cols
species body_mass_g sex year
<str> <num> <str> <num>
Adelie 3750 male 2007.
Adelie 3800 female 2007.
Adelie 3250 female 2007.
Adelie na na 2007.
Adelie 3450 female 2007.
Adelie 3650 male 2007.
Adelie 3625 female 2007.
Adelie 4675 male 2007.
Adelie 3475 na 2007.
Adelie 4250 na 2007.
(dataframe-glimpse dat)
dim: 344 rows x 4 cols
species <str> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, ...
body_mass_g <num> 3750, 3800, 3250, na, 3450, 3650, 3625, 4675, 3475, ...
sex <str> male, female, female, na, female, male, female, male, ...
year <num> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, ...
Filter Observations
The dataframe
library includes some handling of missing values, represented as 'na
, but nothing currently that will automatically drop 'na
values while filtering as in dplyr::filter
. A couple of options for handling 'na
are presented in the code below. In the original blog post, they built up to a compound filter with pipe operators, but here I will just cut to the chase and show the final version.
;; this version is more readable, but requires two passes through the dataframe
;; i.e., first with dataframe-remove-na and then with dataframe-filter
(-> dat
(dataframe-remove-na 'body_mass_g 'sex)
(dataframe-filter*
(body_mass_g sex)
(and (> body_mass_g 4000)
(string=? sex "female")))
(dataframe-display))
;; this version is more awkward to read (and write)
;; but does all the filtering in one pass
(-> dat
(dataframe-filter*
(body_mass_g sex)
(and (and (not (na? body_mass_g))
(> body_mass_g 4000))
(and (not (na? sex))
(string=? sex "female"))))
(dataframe-display))
dim: 58 rows x 4 cols
species body_mass_g sex year
<str> <num> <str> <num>
Gentoo 4500. female 2007.
Gentoo 4450. female 2007.
Gentoo 4550. female 2007.
Gentoo 4800. female 2007.
Gentoo 4400. female 2007.
Gentoo 4650. female 2007.
Gentoo 4650. female 2007.
Gentoo 4200. female 2007.
Gentoo 4150. female 2007.
Gentoo 4800. female 2007.
Extract Observations
The dataframe
library doesn't include a slice function, but we can accomplish similar functionality with other dataframe
procedures. You can slice by row indices using dataframe-ref
. dataframe-head
is similar to slice_head
. dataframe-tail
works similarly to list-tail
in Scheme but differently than slice_tail
in dplyr
and requires jumping through some extra hoops to replicate slice_tail
.
(dataframe-display (dataframe-ref dat '(1 4 36)))
dim: 3 rows x 4 cols
species body_mass_g sex year
<str> <num> <str> <num>
Adelie 3800. female 2007.
Adelie 3450. female 2007.
Adelie 3950. male 2007.
(dataframe-display (dataframe-head dat 3))
dim: 3 rows x 4 cols
species body_mass_g sex year
<str> <num> <str> <num>
Adelie 3750. male 2007.
Adelie 3800. female 2007.
Adelie 3250. female 2007.
(dataframe-display (dataframe-tail dat (- (car (dataframe-dim dat)) 3)))
dim: 3 rows x 4 cols
species body_mass_g sex year
<str> <num> <str> <num>
Chinstrap 3775. male 2009.
Chinstrap 4100. male 2009.
Chinstrap 3775. female 2009.
It takes considerably more code to replicate slice_min
and slice_max
from dplyr
.
;; equivalent of `slice_min` example in original post
(-> (dataframe-filter*
dat
(body_mass_g)
(and (not (na? body_mass_g))
(<= body_mass_g (quantile ($ dat 'body_mass_g) 0.25 7))))
(dataframe-sort* (< body_mass_g))
(dataframe-display))
;; previous version only works with filter as the first step in the pipe
;; this version works with filter step anywhere in the pipe
(-> dat
(dataframe-remove-na 'body_mass_g)
(->>
((lambda (dfx)
(dataframe-filter*
dfx
(body_mass_g)
(<= body_mass_g (quantile ($ dfx 'body_mass_g) 0.25 7))))))
(dataframe-sort* (< body_mass_g))
(dataframe-display))
dim: 89 rows x 4 cols
species body_mass_g sex year
<str> <num> <str> <num>
Chinstrap 2700. female 2008.
Adelie 2850. female 2008.
Adelie 2850. female 2008.
Adelie 2900. female 2008.
Adelie 2900. female 2008.
Adelie 2900. female 2009.
Chinstrap 2900. female 2007.
Adelie 2925. female 2009.
Adelie 2975. na 2007.
Adelie 3000. female 2007.
;; equivalent of `slice_max` example in original post
(-> (dataframe-filter*
dat
(body_mass_g)
(and (not (na? body_mass_g))
(>= body_mass_g (quantile ($ dat 'body_mass_g) 0.75 7))))
(dataframe-sort* (< body_mass_g))
(dataframe-display))
dim: 90 rows x 4 cols
species body_mass_g sex year
<str> <num> <str> <num>
Gentoo 4750. male 2008.
Gentoo 4750. female 2008.
Gentoo 4750. female 2008.
Gentoo 4750. female 2009.
Gentoo 4750. female 2009.
Adelie 4775. male 2009.
Gentoo 4800. female 2007.
Gentoo 4800. female 2007.
Chinstrap 4800. male 2008.
Gentoo 4850. female 2008.
Sample Observations
The equivalent of sample_n
and sample_frac
don't currently exist in dataframe
, but it doesn't take much code to reproduce that functionality. Here, we lean on shuffle
from chez-stats
to randomly arrange the indices and then call dataframe-ref
.
(define (get-first-n lst n)
(let loop ([lst lst]
[out '()]
[count 0])
(if (or (null? lst) (= count n))
(reverse out)
(loop (cdr lst) (cons (car lst) out) (add1 count)))))
(define (sample-n df n)
(let* ([ind1 (iota (car (dataframe-dim df)))]
[ind2 (get-first-n (shuffle ind1) n)])
(dataframe-ref df ind2)))
(define (sample-frac df frac)
(let* ([rows (car (dataframe-dim df))]
[n (floor (inexact (* rows frac)))])
(sample-n df n)))
(dataframe-display (sample-n dat 3))
dim: 3 rows x 4 cols
species body_mass_g sex year
<str> <num> <str> <num>
Adelie 4250. male 2009.
Gentoo 5250. male 2009.
Gentoo 4600. female 2008.
(dataframe-display (sample-frac dat 1/2))
dim: 172 rows x 4 cols
species body_mass_g sex year
<str> <num> <str> <num>
Adelie 3325. female 2007.
Adelie 3275. female 2009.
Adelie 3475. female 2009.
Chinstrap 3650. male 2007.
Gentoo 4550. female 2007.
Chinstrap 3675. female 2009.
Adelie 4600. male 2007.
Gentoo 4650. female 2007.
Chinstrap 3325. female 2009.
Adelie 3075. female 2009.
Sort Observations
Sorting also requires explicitly handling missing values. dataframe-sort
first sorts based on the expression on the left and then moves right through all other expressions (in this case, it means first sorting by body mass and then sex).
(-> dat
(dataframe-remove-na 'body_mass_g 'sex)
(dataframe-sort* (< body_mass_g) (string<? sex))
(dataframe-display))
dim: 333 rows x 4 cols
species body_mass_g sex year
<str> <num> <str> <num>
Chinstrap 2700. female 2008.
Adelie 2850. female 2008.
Adelie 2850. female 2008.
Adelie 2900. female 2008.
Adelie 2900. female 2008.
Adelie 2900. female 2009.
Chinstrap 2900. female 2007.
Adelie 2925. female 2009.
Adelie 3000. female 2007.
Adelie 3000. female 2009.
Select Variables
dplyr
's select
provides particularly rich functionality for selecting columns from a dataframe. In the dataframe
library, you can only select or drop columns by exact column names.
(dataframe-display (dataframe-select* dat body_mass_g year))
dim: 344 rows x 2 cols
body_mass_g year
<num> <num>
3750 2007.
3800 2007.
3250 2007.
na 2007.
3450 2007.
3650 2007.
3625 2007.
4675 2007.
3475 2007.
4250 2007.
(dataframe-display (dataframe-drop* dat body_mass_g year))
dim: 344 rows x 2 cols
species sex
<str> <str>
Adelie male
Adelie female
Adelie female
Adelie na
Adelie female
Adelie male
Adelie female
Adelie male
Adelie na
Adelie na
;; re-arrange column order
(dataframe-display (dataframe-select* dat sex species body_mass_g year))
dim: 344 rows x 4 cols
sex species body_mass_g year
<str> <str> <num> <num>
male Adelie 3750 2007.
female Adelie 3800 2007.
female Adelie 3250 2007.
na Adelie na 2007.
female Adelie 3450 2007.
male Adelie 3650 2007.
female Adelie 3625 2007.
male Adelie 4675 2007.
na Adelie 3475 2007.
na Adelie 4250 2007.
Rename Variables
Renaming columns works similarly to dplyr
, but the order is (old-name new-name)
not new_name = old_name
.
(dataframe-display
(dataframe-rename* dat (body_mass_g body_mass) (year study_year)))
dim: 344 rows x 4 cols
species body_mass sex study_year
<str> <num> <str> <num>
Adelie 3750 male 2007.
Adelie 3800 female 2007.
Adelie 3250 female 2007.
Adelie na na 2007.
Adelie 3450 female 2007.
Adelie 3650 male 2007.
Adelie 3625 female 2007.
Adelie 4675 male 2007.
Adelie 3475 na 2007.
Adelie 4250 na 2007.
Create or Modify Variables
Again, we have to explicitly handle 'na
values because /
and <
don't handle symbols and dataframe
doesn't provide alternatives. In the second example, we can create a new column that is not based on any columns in the dataframe by not specifying any columns ()
to use in the expression.
(dataframe-display
(dataframe-modify*
dat
(body_mass_kg
(body_mass_g)
(if (na? body_mass_g) 'na (inexact (/ body_mass_g 1000))))))
dim: 344 rows x 5 cols
species body_mass_g sex year body_mass_kg
<str> <num> <str> <num> <num>
Adelie 3750 male 2007. 3.75
Adelie 3800 female 2007. 3.8
Adelie 3250 female 2007. 3.25
Adelie na na 2007. na
Adelie 3450 female 2007. 3.45
Adelie 3650 male 2007. 3.65
Adelie 3625 female 2007. 3.625
Adelie 4675 male 2007. 4.675
Adelie 3475 na 2007. 3.475
Adelie 4250 na 2007. 4.25
(dataframe-display
(dataframe-modify*
dat
(id () (iota (car (dataframe-dim dat))))))
dim: 344 rows x 5 cols
species body_mass_g sex year id
<str> <num> <str> <num> <num>
Adelie 3750 male 2007. 0.
Adelie 3800 female 2007. 1.
Adelie 3250 female 2007. 2.
Adelie na na 2007. 3.
Adelie 3450 female 2007. 4.
Adelie 3650 male 2007. 5.
Adelie 3625 female 2007. 6.
Adelie 4675 male 2007. 7.
Adelie 3475 na 2007. 8.
Adelie 4250 na 2007. 9.
(dataframe-display
(dataframe-modify*
dat
(body_mass_cat
(body_mass_g)
(if (na? body_mass_g)
'na
(if (> body_mass_g 4000) "High" "Low")))))
dim: 344 rows x 5 cols
species body_mass_g sex year body_mass_cat
<str> <num> <str> <num> <str>
Adelie 3750 male 2007. Low
Adelie 3800 female 2007. Low
Adelie 3250 female 2007. Low
Adelie na na 2007. na
Adelie 3450 female 2007. Low
Adelie 3650 male 2007. Low
Adelie 3625 female 2007. Low
Adelie 4675 male 2007. High
Adelie 3475 na 2007. Low
Adelie 4250 na 2007. High
(dataframe-display
(dataframe-modify*
dat
(body_mass_cat
(body_mass_g)
(cond [(na? body_mass_g) 'na]
[(< body_mass_g 3500) "Low"]
[(> body_mass_g 4750) "High"]
[else "Medium"]))))
dim: 344 rows x 5 cols
species body_mass_g sex year body_mass_cat
<str> <num> <str> <num> <str>
Adelie 3750 male 2007. Medium
Adelie 3800 female 2007. Medium
Adelie 3250 female 2007. Low
Adelie na na 2007. na
Adelie 3450 female 2007. Low
Adelie 3650 male 2007. Medium
Adelie 3625 female 2007. Medium
Adelie 4675 male 2007. Medium
Adelie 3475 na 2007. Low
Adelie 4250 na 2007. Medium
Identify Distinct Values
(-> dat
(dataframe-select* species)
(dataframe-unique)
(dataframe-display))
dim: 3 rows x 1 cols
species
<str>
Adelie
Gentoo
Chinstrap
(-> dat
(dataframe-select* species year)
(dataframe-unique)
(dataframe-display))
dim: 9 rows x 2 cols
species year
<str> <num>
Adelie 2007.
Adelie 2008.
Adelie 2009.
Gentoo 2007.
Gentoo 2008.
Gentoo 2009.
Chinstrap 2007.
Chinstrap 2008.
Chinstrap 2009.
Aggregate Observations
(dataframe-display
(dataframe-aggregate*
dat
(species sex)
(mean (body_mass_g) (inexact (mean body_mass_g)))
(sd (body_mass_g) (standard-deviation body_mass_g))))
dim: 8 rows x 4 cols
species sex mean sd
<str> <str> <num> <num>
Adelie male 4043.4932 346.8116
Adelie female 3368.8356 269.3801
Adelie na 3540.0000 477.1661
Gentoo female 4679.7414 281.5783
Gentoo male 5484.8361 313.1586
Gentoo na 4587.5000 338.1937
Chinstrap female 3527.2059 285.3339
Chinstrap male 3938.9706 362.1376
(dataframe-display
(dataframe-aggregate*
dat
(species)
(n_obs (species) (length species))))
dim: 3 rows x 2 cols
species n_obs
<str> <num>
Adelie 152.
Gentoo 124.
Chinstrap 68.