Origin

REAM was created as a quick solution to manage my undergraduate thesis data. To understand certain REAM designs, it might be helpful to first understand what problems I was trying to solve.

In summary, I was searching for an alternative to CSV, but after trying YAML and TOML I decided to create my language that caters to my needs and preferences. As I develop the language and the tools, I started reflecting on my experience dealing with social science data and gradually add more features.

Spreadsheet/CSV

I did not want to use a spreadsheet application to organize my undergraduate thesis data, because:

(1) I use Linux, and Microsoft Excel doesn't run natively in Linux,

(2) LibreOffice Calc, a popular Excel alternative in Linux, somehow kept crashing on my machine, and,

(3) I like Neovim so much that I want to edit everything in the text editor, including datasets (probably the main motivation)

I tried editing raw CSV files in Neovim, but even with plugins such as csv.vim it wasn't a smooth experience. That's when I started exploring other data serialization languages.

In addition to being editor-friendly, I wanted the solution to work well with Git or provide a native way for version control. The solution should also support comments since my work was mostly zipping two larger datasets and I wanted my matching rationales to be well documented. This was largely influenced by my work in Composition of Religious and Ethnic Groups (CREG) Project at Cline Center for Advanced Social Research, where we carefully document thousands of matches when zipping dozens of datasets.

YAML

I first tried YAML. I used YAML for other projects and find it to be a pleasant format to work with. Each key-value pair occupy a line, so it's easy to version control.

YAML is also good at representing nested data structure, so instead of:

row:
    - country_name: Belgium
      country_capital: Brussel
      country_population: 11433256
      country_euro_zone: true
      language_name: Dutch
      language_size: 0.59

    - country_name: Belgium
      country_capital: Brussel
      country_population: 11433256
      country_euro_zone: true
      language_name: French
      language_size: 0.4

    - country_name: Belgium
      country_capital: Brussel
      country_population: 11433256
      country_euro_zone: true
      language_name: German
      language_size: 0.01

I can write:

Country:
    - name: Belgium
      capital: Brussel
      population: 11433256
      # data from 2019, retrieved from Work Bank
      euro_zone: true
      # joined in 1999
      Languages:
          - name: Dutch
            size: 0.59
          - name: French
            size: 0.4
          - name: German
            size: 0.01

With this structure, we adhere to the single source of truth principle: attributes of country Belgium only ever appear once in the dataset. To "flatten" the nested data structure, all we need is a few lines of code (not the most elegant solution, but it works).

Another way to organize the data is with anchor and alias, and would look like:

Country:
    - &Belgium
      country_name: Belgium
      country_capital: Brussel
      country_population: 11433256
      country_euro_zone: true

Languages:
    - <<: *Belgium
      language_name: Dutch
      language_size: 0.59
    - <<: *Belgium
      language_name: French
      language_size: 0.4
    - <<: *Belgium
      language_name: German
      language_size: 0.01

Beside being modular, this method uses native YAML features and can be easily converted to CSV. However, the key names are verbose to avoid name collision, hence the country_ and language_ prefixes.

With YAML, my dataset would either be a nested structure but with everything in a single tree, or a modular structure but with everything flatten. How can I get a nested and modular data structure?

(Spoiler: by making the language component-based)

TOML

My next candidate was TOML. It had been quite popular lately, and in my opinion is easier to learn than YAML (the YAML spec is 84 pages long!). We can rewrite the previous example as:

[[Country]]
name = "Belgium"
capital = "Brussel"
population = 11433256
  # data from 2019, retrieved from Work Bank
euro_zone = true
  # joined in 1999

[[Country.Language]]
name = "Dutch"
size = 0.59

[[Country.Language]]
name = "French"
size = 0.4

[[Country.Language]]
name = "German"
size = 0.01

I prefer TOML's syntax for Array for Tables to YAML's indentation-based approach. But I dislike quoting strings, and it lacks reference, which was explicitly rejected during early design.

One thing I like about TOML is its issue tracker. The discussions helped me understood its design rationales and learned that TOML was not a good fit for my use. TOML's array of tables requires full path names, so the names may be verbose if the table is several levels deep. While searching for proposals to make table names more concise, I found a post that stated that the language discourages deeply nested structure, which was exactly what I wanted to do. I gave up TOML soon after reading the post and moved on.

(A new issue was opened on October 2020 to collect ideas on nested structure, but it'll probably take a long time before the community reaches consensus and has the new syntax implemented.)

Documentation

I wanted my inline documentation to be easy to read. I was comfortable reading raw monospace text files, still, I preferred reading HTML or PDF files if possible. To do so with YAML or TOML, I had to write a documentation generator that either builds upon an existing implementation or write one myself from scratch. If so, I much rather write an implementation of my language which caters to my needs and preferences.

If I don't want to convert a data serialization language to a markup language, why don't I serialize a markup language? All I have to do[1] is to write a decoder of this new language, and let third-party file format converters produce HTML and PDF files.

REAM features Markdown-like syntax, in particular, Pandoc-flavored Markdown. The plan was to use Pandoc as my documentation generator, and ream-python to serialize the data. That's why in earlier versions, numbers were surrounded by dollar signs so that they were rendered as inline math notations. Boolean values were surrounded by backticks so that they were rendered as inline code. The language was not indentation-based simply because in Pandoc, as well as other Markdown flavors, a 4-space indentation indicates code block, and I want to have more than 2 levels nested.[2] Empty lines were optional because Pandoc requires an empty line before block quotes, which I use for annotations.

[1]: If you really think about it, mapping a subset of a serialization language to a subset of a markup language is way easier than the opposite. I don't know why I thought the former would be easier.

[2]: I was so close to make REAM indentation-based, which would be a big mistake.