Future
This section is a work in progress.
The ultimate goal for REAM is to make distributing and reusing data easy. To do so we need to think of REAM not just as a data serialization but as a programming language, and REAM datasets as libraries with well-defined APIs for external libraries to use.
First off, the language. To make REAM datasets easy to be reused, the language itself should encourage or enforce good practices. References and data filters reduce repetition. Inline documentations makes generating human-readable documentations a trivial task. Templating and static typing help validate schemas.
Second, the tooling.
REAM datasets can't be easily distributed and reused without a package manager and a package registry.
This is a notoriously difficult task, and I don't know whether this project will survive long enough for a package manager to be a need.
Still, I think it is important to have distribution in mind when designing the language and not as an afterthought.
At the very least I should provide a boilerplate directory structure for users to import datasets with git submodule add
.
Finally, the ecosystem. The number one reason why most users would even consider using REAM is not because of the language itself, but the quality datasets in the registry that people can easily install and build new datasets upon. For that to happen, the registry should have quality datasets containing popular variables that almost every dataset depends on, such as datasets for country codes and annual GDP.
(Even though this sounds like a proposal for a standard library, what I have in mind is what oh-my-zsh is to Zsh.)
Motivation
If I want to do matrix calculation in Python, I do not need to implement a linear algebra library from scratch.
I can use numpy
:
- Download
numpy
pip install --user numpy
- Import
numpy
import numpy as np
import numpy.linalg as la
- Use
numpy
mat_A_inv = la.inv(mat_A)
If I want to create fancy plots in R, I do not need to write a graphing library from scratch.
I can use ggplot2
- Download
ggplot2
install.packages("ggplot2")
- Import
ggplot2
library(ggplot2)
- Use
ggplot2
plot_1 = ggplot(dat = dat) + geom_point(aes(x = x, y = y))
If I want to add country GDP as a control variable in my dataset, I don't need to calculate GDP for each country myself. I can use GDP data from World Bank Open Data.
- Download dataset
-
Google
World Bank GDP
and click on the first result (assuming you have ad-block installed) -
Download the data in CSV format
-
Unzip
API_NY.GDP.MKTP.CD_DS2_en_csv_v2_1678496.zip
- Import dataset
- Read the dataset:
wb = read.csv("./API_NY.GDP.MKTP.CD_DS2_en_csv_v2_1678496.csv")
- See error messages:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names)
-
Open the CSV file in a text editor. Discover the actual dataset starts on line 5.
-
Reread the dataset:
wb = read.csv("./API_NY.GDP.MKTP.CD_DS2_en_csv_v2_1678496.csv", skip = 4, header = T)
- Use dataset
- write function
get_gdp
to extract GDP:
get_gdp = function(country, year) {
col_i = grep(paste0('X', year), names(wb))
gdp = wb[wb$Country.Name == country,][col_i]
return(gdp[1,1])
}
- apply the function by row:
my_data$GDP = apply(my_data, 1, function(row) get_gdp(row['country'], row['year']))
-
Discover Cote d'Ivoire has
NA
GDP. Oh, it's called "Sierra Leone" in the World Bank's dataset. -
Figure out all the name differences between the two datasets and write a "dictionary" for translation.
country_dict = list(
"Brunei" = "Brunei Darussalam",
"Dominican Republic" = "Dominica",
"Cote d'Ivoire" = "Sierra Leone"
# You get the idea
)
- Modify
get_gdp
get_gdp = function(my_country, year) {
col_i = grep(paste0('X', year), names(wb))
wb_country = country_dict[[my_country]]
gdp = wb[wb$Country.Name == wb_country,][col_i]
return(gdp[1,1])
}
- Apply the function again:
my_data$GDP = apply(my_data, 1, function(row) get_gdp(row['country'], row['year']))
There should be an easier way to import existing datasets. We need a package manager for data.
It's not unheard of to download datasets with package managers.
Besides the build-in datasets in R, you can download quite a few datasets from CRAN using install.pacakges
, including example datasets in libraries (diamonds
in ggplot2), datasets as libraries (titanic
), or wrappers of APIs (censusapi
).
But my ideal package manager is more than a downloader.
Dependency
Consider Alesina et al. (2003) study on ethnic, linguistic, and religious fractionalization. To calculate ethnic fractionalization indices for each country, Alesina compiled a list of ethnic groups worldwide by consulting six types of sources:
- Encyclopedia Brittanica (EB)
- CIA World Factbook (CIA)
- Scarrit and Mozaffar (1999) (SM)
- Levinson (1998) (LEV)
- World Directory of Minorities (WDM)
- National census data (CENSUS)
If you plot the dependency graph, it'll look like the following:
Alesina also calculated fractionalization indexes for language and religion based on data from Encyclopedia Brittanica. Let's update the dependency graph:
To study the effect of fractionalization, variables from past studies are added from Easterly and Levine (1997) and La Porta et al. (1997):
The fractionalization indices are compared with existing measures:
We can continue to expand the dependency graph for each of the dependencies. After adding a few of Easterly and Levine's dependencies, we get:
Eventually, all relevant data are extracted from the dependencies, manually or through scripts, to the aggregated datasets
The practice of copying dependencies to your own project is known as vendoring in programming. Vendoring is not necessarily a bad thing, but we do lose some information along the way.
(TODO: discuss pros and cons of vendoring)
Updating dependencies
Let's zoom in on the dependency graph and focus on ethnic data.
If researchers plan to reproduce the research with updated dependencies, how would they update the dataset?
If the original dataset was created by manually copying and pasting data from dependencies, you'll probably have to repeat the process.
If the dataset was created by extracting data from dependencies with scripts, we can rerun the scripts and generate the updated dataset only if the schemas of all dependencies remain the same. Otherwise, you'll have to analyze the schema changes and run custom migration scripts before running the original scripts.
(To be continued)