Text Mining the Human Behavioral Ecology Bibliography

15 June, 2017

Thanks to Kermyt Anderson, the Human Behavioral Ecology Bibliography (HBEB) has recently gone online. Eleanor Power has taken this text file and parsed it into a bibtex file, and from there it can be turned into an Endnote library or other bibliographic database format.

I’m curious to know more about HBEB’s use as a dataset in its own right, and since I’ve been doing similar work with our field data here at MPI-EVA’s new Department of Human Behavior, Ecology and Culture, I figured I could turn the lens on us at the same time as showcase the methods to interested parties.

Since this dataset will likely grow over time, having everything in one R script also means it can be re-used automatically as HBEB grows. So here goes:

Initial Processing

First, we need a tool that can parse *.bib files and load them into an R environment in a format we can manipulate. A Google search reveals a solution: the ReadBib function in package RefManageR.

    hbeb_data <- ReadBib("./data/Human Behavioral Ecology Bibliography.bib", 
        check=FALSE)

## Unrecognized macro in Reyes-Garc{\'\i}a, Victoria

## Unrecognized macro in Manuel J. Mac{\'\i}a

## Unrecognized macro in Mart{\'\i} Orta-Mart{\'\i}nez

## Unrecognized macro in V. Reyes-Garc{\'\i}a

## Unrecognized macro in Victoria Reyes-Garc{\'\i}a

## Unrecognized macro in Martha Fr{\'\i}as-Armenta

As ReadBib loads the file into R, it displays some messages about inconsistencies in the bib syntax. We can reroute those messages to write to a file by using sink calls to inspect later at our leisure.

    zz <- file("./data/messages.txt", open = "wt")
    sink(file=zz, type="message")
    hbeb_data <- ReadBib("./data/Human Behavioral Ecology Bibliography.bib", check=FALSE)
    sink()

Having loaded the data, I need to turn it into one or more data frames to be able to do anything with it. I’m not familar with the BibEntry class the data is now in, but it inherits the same behavior as basic R lists. I’m going to apply some list methods I use in my [YAML to JSON to CSV pipeline]({% post_url 2017-02-12-hierarchical-data-in-plain-text %}) for processing field interviews.

    nullToNA <- function(x) {
        x[sapply(x, is.null)] <- NA
        return(x)
    }

    hbeb_df <- data.frame(
        key=unlist(hbeb_data$key),
        year=unlist(hbeb_data$year), 
        pages=unlist(nullToNA(hbeb_data$pages)), 
        volume=unlist(nullToNA(hbeb_data$volume)), 
        journal=unlist(nullToNA(hbeb_data$journal)), 
        title=unlist(nullToNA(hbeb_data$title)),
        type=unlist(nullToNA(hbeb_data$bibtype)),
        publisher=unlist(nullToNA(hbeb_data$publisher)),
        doi=unlist(nullToNA(hbeb_data$doi)),
        url=unlist(nullToNA(hbeb_data$url)),
        booktitle=unlist(nullToNA(hbeb_data$booktitle))
    )

    for(i in 1:ncol(hbeb_df)) hbeb_df[,i] <- as.character(hbeb_df[,i])

To describe what the code block above is doing, the unlist function corrals the variables into simple vectors and the data.frame function column-binds them into a data frame. I wrote a new function, nullToNA so missing values are not dropped during this operation.

The result is a data frame with 1255 rows (the same length as the hbeb_data list) and the 11 columns we defined above. Each entry contains data associated with that citation: the year, the title, the DOI, etc., with the exception of the authors. That part is complicated because it lends itself to a [hierarchical data model]({% post_url 2017-02-12-hierarchical-data-in-plain-text %}), so I will save it for last.

    barplot(table(hbeb_df$year), main="count by year of publication", 
        border=NA, col=barplot_color)

    abline(h=seq(0, 100, by=20), lty=2, col=col_alpha("gray", 0.6))

    type_tab <- prop.table(table(hbeb_df$type))
    type_tab <- floor(100*type_tab)

Of the 1255 publications in HBEB, 82 percent are articles, 15 percent are chapters in edited volumes, and 1 percent are books. Looking at the number of entries in HBEB through time, there’s clearly fewer pubs since 2005, steadily declining up to the present.

Such questions are feasible, now that the data are in a usable (for me) format.

What journals most commonly publish human behavioral ecology articles?

If we count up the number of unique journals mentioned in the database, e.g. length(unique(hbeb_df$journal)), we initially see 184 names. But this is misleading: we can see from this list that there’s a lot of alternative spellings which will make it difficult to accurately count up appearances¹. For example:

“Proceedings of the Royal Society B”
“Proceedings of the Royal Society B: Biological Science”
“Proceedings of the Royal Society B: Biological Sciences”
“Proceedings of the Royal Society Biology”
“Proceedings of the Royal Society of London: Biological Sciences”
“Proceedings of the Royal Society, Series B” and even
“roceedings of the Royal Society, Series B”

There’s probably a more elegant approach to data cleaning, but in situations like this, I just renaming the entries like so:

Let’s make a backup copy, then apply the gsub command to eliminate strange characters.

    hbeb_df$journal_og <- hbeb_df$journal
    hbeb_df$journal <- iconv(hbeb_df$journal, to = "utf-8")
    hbeb_df$journal <- gsub("\\{", "", hbeb_df$journal)
    hbeb_df$journal <- gsub("\\}", "", hbeb_df$journal)
    hbeb_df$journal <- gsub("\"", "", hbeb_df$journal)
    hbeb_df$journal <- gsub("\\\\", "", hbeb_df$journal)
    hbeb_df$journal <- gsub("/'", "", hbeb_df$journal)

Now for the renaming:

    old_names <- c("Proceedings of the Royal Society B: Biological Science", 
        "Proceedings of the Royal Society B: Biological Sciences",
        "Proceedings of the Royal Society Biology",
        "Proceedings of the Royal Society of London: Biological Sciences",
        "Proceedings of the Royal Society, Series B",
        "roceedings of the Royal Society, Series B")

    hbeb_df$journal[hbeb_df$journal %in% 
        old_names] <- "Proceedings of the Royal Society B"

I’ll do a similar operation for PNAS and a few other journals in the Rmarkdown code, but for clarity, it will not be not printed here.

This one is actually funny:

“Plos ONE”
“PLoS One”
“PLoS ONE”
“PLOS One”
“PLOS ONE”

We fix this one by standardizing capitalization. I’ll use a function called cw which I got here.

With the journal names at least somewhat cleaned of duplicates, we can count 165 unique journals listed. We can see the most popular ten as a horizontal barplot:

    par(mar=c(5.1, 20, 4.1, 2.1))

    top_ten_journals <- rev(sort(table(as.character(hbeb_df$journal[hbeb_df$type=="Article"]))))[1:10]

    barplot( rev(top_ten_journals), horiz=TRUE, 
        las=1, xlab="pub count", border=NA, col=barplot_color)

    abline(v=seq(0, 150, by=50), lty=2, col=col_alpha("gray", 0.6))

plot of chunk hbeb_top_jounals The most popular journal in the database is Human Nature (165 publications), followed by Evolution And Human Behavior (160 publications) and Current Anthropology (76 publications). In total, these ten journals account for 63 percent of all articles in HBEB.

Identifying keywords in titles

I’ve always wanted to make a word cloud, and the title of each citation can be mined for common words. I’m going to use the packages and methods outlined here.

    # load needed libraries
    library(tm)
    library(SnowballC)

    # remove strange characters from the title input
    title_strings <- iconv(hbeb_df$title, to = "utf-8")
    title_strings <- gsub("\\{", "", title_strings)
    title_strings <- gsub("\\}", "", title_strings)
    title_strings <- gsub("\"", "", title_strings)
    title_strings <- gsub("\\\\", "", title_strings)
    title_strings <- gsub("/'", "", title_strings)

    # use tm's corpus class to process the text
    d_cor <- Corpus(VectorSource(title_strings))
    d_cor <- tm_map(d_cor, PlainTextDocument)
    d_cor <- tm_map(d_cor, removePunctuation)
    d_cor <- tm_map(d_cor, tolower)
    d_cor <- tm_map(d_cor, content_transformer(gsub), 
        pattern = "evolutionary", replacement = "evolution")
    d_cor <- tm_map(d_cor, removeWords, c('the', 'among', 
        'human', 'evolution', stopwords('english')))
    d_cor <- Corpus(VectorSource(d_cor))

Note that I took out the words “the”, “evolution”, “evolutionary” and “human”, since those are the most common and least informative. Now we make a wordcloud with the wordcloud function in the wordcloud library. Wordcloud.

    # plot
    library(wordcloud)
    library(RColorBrewer)

    cloud_cols <- brewer.pal(8, "Dark2")

    wordcloud(d_cor, max.words=100, random.order=FALSE, colors=cloud_cols)

Identifying authors through relational tables

The author data introduces a common complication in behavioral databases. That is, the author data are not easy to fit in a flat file configuration, because each citation has a variable number of authors. You might be tempted to simply glom them onto the hbeb_df table as author_given_name_1, author_family_name_1, author_given_name_2, author_family_name_2, etc., but this is a terrible idea and you should stop doing it. Comes from a tradition of manipulating data tables by hand, wanting to be able to see the entries in a wide format.

We will see there is one citation with 23 authors; creating dozens or hundreds of mostly-empty columns is madness. Rather, we make the data relational, by creating an authors subtable linked to the main table by the citation key. Most importantly this is how we actually want to use the data in analyses.

By inspection:

    str(hbeb_data[[1]]$author)

## Class 'person'  hidden list of 1
##  $ :List of 5
##   ..$ given  : chr [1:2] "Kermyt" "G."
##   ..$ family : chr "Anderson"
##   ..$ role   : NULL
##   ..$ email  : NULL
##   ..$ comment: NULL

The person class in RefManageR is essentially a list with five fixed slots. The following extraction code is quick and dirty but should serve. I begin by initializing a new data frame, which will grow vertically, one row added for each author in each citation. To keep things sane, I’ll add the citation key of the paper in the rows of each author of that paper.

    authors <- data.frame( citation_key = character(0), 
        given_name = character(0), family_name = character(0) )

    for(i in 1:length(hbeb_data)){
        if(!is.null(hbeb_data[[i]]$author)){ # skip if citation has no author
            for(j in 1:length(hbeb_data[[i]]$author)){
                add <- data.frame(
                    citation_key = hbeb_data[[i]]$key, 
                    given_name = paste(hbeb_data[[i]]$author[j]$given, collapse=" "), 
                    family_name = hbeb_data[[i]]$author[j]$family,
                    author_number = j
                )
                authors <- rbind(add, authors)
            }
        }
    }

    for(i in 1:ncol(authors)) authors[,i] <- as.character(authors[,i])

Now we have a long-oriented table with 2942 author-citation entries, and (unlike a wide table) we can elegantly calculate useful statistics from this relational data structure. First, a let’s count up how many coauthors on each citation:

    hbeb_df$author_count <- table(authors$citation_key)[hbeb_df$key]

    par(mar=c(5.1, 4.1, 4.1, 2.1))
    par(mfrow=c(2,2))
    barplot(prop.table(table(hbeb_df$author_count)), las=1, 
        xlab="number of coauthors", ylab="frequency", 
        main="all publications", border=NA, col=barplot_color, 
        xlim=c(0, 7.5), ylim=c(0, 0.45))
    barplot(
        prop.table(table(hbeb_df$author_count[hbeb_df$type=="Article"])), 
        las=1, xlab="number of coauthors", ylab="frequency", 
        main="articles", border=NA, col=barplot_color, 
        xlim=c(0, 7.5), ylim=c(0, 0.45))
    barplot(
        prop.table(table(hbeb_df$author_count[hbeb_df$type=="Book"])), 
        las=1, xlab="number of coauthors", ylab="frequency", 
        main="books", border=NA, col=barplot_color, ylim=c(0, 0.45))
    barplot(prop.table(table(hbeb_df$author_count[hbeb_df$type=="InCollection"])), 
        las=1, xlab="number of coauthors", ylab="frequency", 
        main="chapters", border=NA, col=barplot_color, 
        xlim=c(0, 7.5), ylim=c(0, 0.45))

Before that, though, we must solve a common problem. Consider the following entry for Peter J. Richerson:

    unique(authors[authors$family_name=="Richerson", c("given_name", "family_name")])

##      given_name family_name
## 39     Peter J.   Richerson
## 603       Peter   Richerson
## 2099      P. J.   Richerson
## 2468         P.   Richerson

The given names are non-standardized! So people appear under multiple versions of their names.

I’m going to try a quick match using the first letter of given_name. This can be dangerous, since it will collide two people with the same last name and the same first initial, but (hopefully) it works for now.

    authors$author_key <- paste(
        substr(authors$given_name, 1, 1), authors$family_name
    )

The author_key variable will allow us to create a unique table of authors, and count up how many times each appears.

    author_pk <- unique(authors$author_key)
    count <- as.numeric(table(authors$author_key)[author_pk])
    given_name <- authors$given_name[match(author_pk, authors$author_key)]
    family_name <- authors$family_name[match(author_pk, authors$author_key)]

    author_register <- data.frame( author_pk, given_name, family_name, count )

    author_register$count_firstauthor <- NA
    for(i in 1:nrow(author_register)){
        author_register$count_firstauthor[i] <- sum(
            authors$author_key == author_register$author_pk[i] 
            & authors$author_number==1
        )
    }

And we have our answer: there are (approximately) 1137 unique authors in the HBEB. Like with the journals, we can calculate how many times each author appears in the database. Below I display a sorted count of the top forty authors in HBEB, by the number of publications listed.

    par(mar=c(5.1, 10, 4.1, 2.1))

    o <- rev(order(author_register$count))
    author_register <- author_register[o,]

    barplot( author_register$count[1:40], horiz=TRUE, 
        las=1, xlab="pub count", border=NA, 
        col=barplot_color, names.arg=author_register$author_pk[1:40])
    abline(v=seq(0, 50, by=10), lty=2, col=col_alpha("gray", 0.6))

Footnotes

We can also spot errors, e.g. the same book appears twice under CronkLeech2013 and CronkLeech2012. ↩︎