Entry

An entry is a collection of variables. An entry class is proceeded by one or multiple pound signs #, in the form of:

# <entry class>
- <key 1>: <value 1>
- <key 2>: <value 2>
...
- <key n>: <value n>

# Example in previous examples are the classes of Level-1 Entries, as denoted by the single leading pound sign. All REAM files start with a Level-1 Entry, and contain exactly one Level-1 Entry.

Entries are useful when describing an object with multiple attributes:

# Country
- name: Belgium
- capital: Brussels
- population: 11433256
- euro_zone: TRUE
(compile output)

Here we define an object of Country class, whose name is Belgium, capital is Brussels, population is 11433256, and is part of the euro zone.

Let's annotate the dataset:

# Country
- name: Belgium
  > short for the Kingdom of Belgium
- capital: Brussels
- population: 11433256
  > data from 2019; retrieved from World Bank
- euro_zone: TRUE
  > joined in 1999
(compile output)

Entries can have zero variables:

# Country
(compile output)

Entries should have local unique keys. The following code will raise an error:

# Country
- name: Belgium
- language: Dutch
- language: French
- language: German
(compile output)
NOTE

The current parser don't check for duplicate keys yet, so technically this is still valid. This rule will be enforced in future versions.

Subentry

Entries can be nested, and the level of the entry is denoted by the number of leading #. So a Level-1 Entry takes the form of # <Level 1 Entry Class>, and a Level-2 Entry takes the form of ## <Level 2 Entry Class>, and so forth.

Examples:

# Country
- name: Belgium

## Language
- name: Dutch

## Language
- name: French

## Language
- name: German
(compile output)

The # Country entry has one variable name and three Level-2 child entries ## Language.

The three ## Language subentries are also terminal nodes as they do not contain any subentry. When compiling the dataset, the parser look for all terminal nodes in the REAM file and flatten the data structure. Thus the previous example produces a dataset with three rows (one for each terminal node) and two columns (one of each variable).

Note that the variable keys are scoped, so ## Language is allowed to have a variable with the key name despite its parent entry # Country also contain a variable with the same key.

Entry must be nested in order. Level-2 Entries can only be nested in a Level-1 Entry, and Level-3 Entries can only be nested in a Level-2 Entry, and so forth. Compare the datasets compiled from the following two examples with the previous one:

# Country
- name: Belgium

## Language
- name: Dutch
  > This is in a Level 2 Entry

### Language
- name: French
  > This is in a Level 3 Entry

### Language
- name: German
  > This is in a Level 3 Entry
(compile output)
# Country
- name: Belgium

## Language
- name: Dutch
  > This is in a Level 2 Entry

## Language
- name: French
  > This is in a Level 2 Entry

### Language
  > This is in a Level 3 Entry
- name: German
(compile output)

A visualization of the differences between the three schemas are as follows. The terminal nodes are colored yellow.

tree

An entry can contain subentires of differenct classes:

# Country
- name: Belgium

## City
- name: Brussels

## Language
- name: Dutch
(compile output)

Also, entries of the same class need not have identical variables, nor the same variable order.

# Country
- name: Belgium

## Language
- name: Dutch
- size: 0.59

## Language
- size: 0.4
- name: French

## Language
- name: German
(compile output)

Observe that the order of the variables are preserved by default.

The datasets compiled by the last two examples are not too useful for analysis. To compile quality analysis-ready datasets, we should specify the schema of the datasets in the codebook (not yet implemented).