Bret Alexander Beheim

Capturing Hierarchical Data in Plain Text

A practical method of entering structured data into a computer is to type it directly into a text file. Simply fire up your favorite CSV editor1, sit down with your source material and start typing. Unlike boutique database software, this method can be used immediately and with minimal training. In fact, I know several researchers who skip paper forms altogether and do it during their field interviews.

Plaintext data entry works great, unless the data are hierarchical. I’m thinking of one of three situations:

  1. A question might have multiple answers, e.g. “What did you buy at the store today?”, “Who are your closest friends?”, “What fish did you catch?”
  2. A set of questions are to be grouped together as repeating sub-components, e.g. a list of recent trips might include columns called “outbound destination”, “outbound travel time”, “outbound start time”, “outbound average speed”, etc. with “inbound” equivalents.
  3. A set of questions basically describe a complete subtable e.g. a household interview that asks for a list of residents, including name, age, sex, etc. Here there are both a variable number of ‘entities’ (as in 1) and each ‘entity’ has repeated sub-components (as in 2).

Hierarchical data ultimately should be put into a relational database with a normalized schema. Programs like Microsoft Access or Filemaker can do this, but these both require money, specialized training and time to set up for each project. They also require a defined schema known before data is collected, which is optimistic.

A plaintext approach can work, but the existing solutions are usually unsatisfactory. A common fix is to create sets of repeating columns, e.g. in a household census file having resident1_name, resident1_age, resident1_sex, resident2_name, …, etc in a wide format. This leads to ugly, mostly empty tables that require substantial reshaping for data normalization.

Recently I’ve been playing with the idea of typing into a data interchange format. Unlike wide-form tables, these formats were designed to store structured data in plain text so it could pass from place to place, e.g. from a web server to a broswer.

So the idea, then, is you’d type each interview into a text file using an interchange format, then use standard parsing tools2 to pull them into a relational database.

The first format I tried is Javascript Object Notation (JSON), whose syntax is defined by five elegant diagrams. For example, in building a database of 19th-century scientists, a JSON record might look like so:

{
  "name": "Charles Robert Darwin",
  "sex": "male",
  "dob": "1809-02-12",
  "dod": "1882-04-19",
  "nationality": "British",
  "spouses": "Emma Wedgwood",
  "children": [
    "William Erasmus Darwin",
    "Anne Elizabeth Darwin",
    "Mary Eleanor Darwin",
    "Henrietta Emma Darwin",
    "George Howard Darwin",
    "Elizabeth Darwin",
    "Francis Darwin",
    "Leonard Darwin",
    "Horace Darwin",
    "Charles Waring Darwin"
  ],
  "published works": [
    {
      "name": "On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life",
      "year": "1859"
    },
    {
      "name": "The Descent of Man, and Selection in Relation to Sex",
      "year": "1871"
    },
    {
      "name": "The Expression of the Emotions in Man and Animals",
      "year": "1872"
    }
  ]
}

JSON syntax is organized around three concepts:

The language allows arbitrarily complex hierarchies because JSON values can include other JSON objects or arrays.

However, looking at the example above, you can see that JSON is not really right for manual data entry, especially in a real time interview. Lots of quotation marks, braces, brackets, and commas have to be done exactly right, or else it won’t parse properly.

Enter YAML, a simplified sister format intended for direct human editing. For example, the above in YAML would be:

  name: Charles Robert Darwin
  sex: male
  dob: 1809-02-12
  dod: 1882-04-19
  nationality: British
  spouses: Emma Wedgwood
  children:
  - William Erasmus Darwin
  - Anne Elizabeth Darwin
  - Mary Eleanor Darwin
  - Henrietta Emma Darwin
  - George Howard Darwin
  - Elizabeth Darwin
  - Francis Darwin
  - Leonard Darwin
  - Horace Darwin
  - Charles Waring Darwin
  published works:
  - name: On the Origin of Species by Means of Natural Selection, or the Preservation
      of Favoured Races in the Struggle for Life
    year: 1859
  - name: The Descent of Man, and Selection in Relation to Sex
    year: 1871
  - name: The Expression of the Emotions in Man and Animals
    year: 1872

YAML’s syntax replaces almost all the JSON markup with clever line breaks and indentations in readible, logical ways. Moreover, any valid JSON can be written as YAML., so converting between the two is not hard.

The cost of this is fragility; unlike JSON, YAML depends on exact indentation, and some programs (e.g. Notepad.exe) will try to reformat your files.3

Does this mean you can simply do away with relational databases altogether? It’s probably a terrible idea.4 The goal, rather, is to send the data into a suitable heirarchical database. So my current pipeline is:

  1. Type the interview data into YAML.
  2. Automatically convert this into JSON for archiving.
  3. Parse the JSON files into a relational database using R scripts.

That’s all.

Endnotes



  1. Excel is ubiquitous, and the spreadsheet structure makes it easier to type with your eyes off the screen. Free alternatives include OpenOffice and Comma Cameleon. However, provided you are careful with formatting, there’s nothing stopping you from filling out a CSV table directly in a text editor, e.g. Notepad++ (windows) and Sublime Text (OSX / Linux). The latter program is also interesting because it uses JSON to store all user settings. ↩︎

  2. For R users, the library JSONlite provides a readJSON function which works well. The yaml library provides similar functions, but JSON is better for storage. ↩︎

  3. Specifically, if you open a non-Windows text file in Notepad it will not detect the Linux/Mac standard line break, creating a single-line document. Since YAML depends on exact line breaks, it can become unreadible, but JSON does not have this problem. ↩︎

  4. as outlined by one developer’s hard-earned experience here↩︎