A colleague doing a social network analysis posed the following problem:1

In my field interviews, I had people name friends they could go to for support. Most friends are the same ethnicity as my interviewees, but some are members of nearby minority groups. I’m trying to see if the women at my field site have a lesser/greater percent of friends from other ethnic groups than men. I was wondering if there’s some way to do this in STAN in R, or some other way that you might know of how to deal with this…

While the question is asking about proportions, in my experience, it is better to use counts, if you can. Only counts can properly take sample sizes into consideration.

However, it is not accurate to just change question to, “Do women in the sample have a greater number of coethnic friends than men?” It could be that women might have a higher number, on average, because women have more friends in general. The spirit of the original question requires we condition the analysis on the total number of friends first, or, “Do men have more/fewer minority friends, for a given number of friends, than women?”

To feel confident we’re on the right track statistically, we will use a valuable trick. I will generate real-looking data from a stochastic process that embodies my assumptions about the real data, analyze it, and confirm that I can recover signals I inject into the data with my statistical methods.

Here our generating process will be as follows: for our sample of people, we specify how many friends each will have. Then, of those friends, we will choose how many will be co-ethnics, and incorporate the original question by adding sex of the respondant as a factor in this process.

In the real data, I am told, there are about 250 interviews, and everyone names at least 1 friend, roughly symmetrical about a mean of 15, with a standard deviation of 5. Coding this into an R notebook, we have:

  n_sample <- 250
  n_friends <- round( rnorm(n_sample, 15, 5) )
  n_friends[n_friends < 1] <- 1 # hack so everyone has some friends
  n_friends <- as.integer(n_friends)

The distribution here is, to some extent, outside the scope of the question. We do not need to worry about whether, for example, men or women have more friends in general, so long as our methods condition on the number of friends for each person.

Now we have a vector of 250 values, each representing the number of friends for that interview. Next we must determine how many of those are minority-group friends. We can do this by imaging each of the friends in the count is a ‘trial’, and is either a coethnic or not with some fixed probability. In other words, the process is Bernoulli.

We choose a baseline number of minority friends at about 10%, which on the logit scale is about -2.2 log odds units.2

To generate this, I will treat each friend as a Bernoulli trial, and they will be a minority with probability prob_minority_friend, a value which differs for male and female interviewees. As R code:

  female <- rbinom(n_sample, 1, 0.5)
  a <- (-2.2)
  b <- (-0.25)

  logistic <- function(x) exp(x) / (1 + exp(x))
  prob_minority_friend <- logistic( a + b*female )
  n_minority_friends <- rbinom( n_sample, n_friends, prob_minority_friend )

We have set a log-odds difference of 0.25 between women and men, so the odds ratio is 0.78, and the data should show men to have more minority friends than women, conditional on any number of friends. If we can correctly recover that, we have a stats model we can take to the real data.

One value of simulating data ourselves is that it encourages us to see the real data in a new light. The way we coded this focuses not on the percentage of coethnics among men and women, but rather, the probability. It is a subtle difference from the original question about percentages, which cannot properly account for variation in the total number of friends.

The process model has, in fact, already specified the statistical model, save for the priors on the parameters. We will do this by a Binomial regression model with a logit link. Using Richard McElreath’s map model syntax, this would be:

  d <- data.frame( n_friends, n_minority_friends, female )
  d$n_friends <- as.integer(d$n_friends) # necessary for dbinom

  library(rethinking) # available at xcelab.net/rm/software

  model1 <- alist(
    n_minority_friends ~ dbinom(n_friends, p),
    logit(p) <- a + b*female,
    a ~ dnorm(0, 1),
    b ~ dnorm(0, 1)

  fit1 <- map(model1, data=d)

Summarizing the model fits using precis, we have:

##    Mean StdDev  5.5% 94.5%
## a -2.22   0.08 -2.35 -2.09
## b -0.20   0.11 -0.39 -0.02

The model has successfully recovered the signal, estimating the parameter to be about -0.2 with a standard error of 0.11, clearly distinct from a zero effect. This is good evidence women have a higher percentage of coethnic ties than men do.

On the probability scale, for females it is a chance of 0.08, give or take 0.01, while for males, 0.10, also give or take 0.01.3 Granted a difference this small might not matter much in the grand scheme of things, but it is valuable to be able to say what it is.

Of course, with real data there are innumerable ways this result could be due to confounding, so a proper analysis would include more predictor variables, and also allow for individual variability through varying effect intercepts.4 We can also use this approach to perform a power analysis, by varying the n_sample values and the strength of the true b parameter, and seeing how that affects our output.

Still, this kind of validation analysis has wide applicability, and I try to use it whenever I can to check my understanding.


  1. modified slightly for clarity 

  2. Confirm that logistic(-2.2) = 0.10, where logistic(x) = exp(x)/(1 + exp(x))

  3. Confirm logistic(a) and logistic(a + b) from the map parameter estimates. 

  4. Varying effects require we replace map with map2stan, so we can run a true MCMC analysis.