Empirical Bayes for multiple sample sizes
03 May 2017Here’s a data problem I encounter all the time. Let’s say I’m running a website where users can submit movie ratings on a continuous 1-10 scale. For the sake of argument, let’s say that the users who rate each movie are an unbiased random sample from the population of users. I’d like to compute the average rating for each movie so that I can create a ranked list of the best movies.
Take a look at my data:
Figure 1. Each circle represents a movie rating by a user. Diamonds represent sample means for each movie.
I’ve got two big problems here. First, nobody is using my website. And second, I’m not sure if I can trust these averages. The movie at the top of the rankings was only rated by two users, and I’ve never even heard of it! Maybe the movie really is good. But maybe it just got lucky. Maybe just by chance, the two users who gave it ratings happened to be the users who liked it to an unusual extent. It would be great if there were a way to adjust for this.
In particular, I would like a method that will give me an estimate closer to the movie’s true mean (i.e. the mean rating it would get if an infinite number of users rated it). Intuitively, movies with mean ratings at the extremes should be nudged, or “shrunk”, towards the center. And intuitively, movies with low sample sizes should be shrunk more than the movies with large sample sizes.
Figure 2. Each circle represents a movie rating by a user. Diamonds represent sample means for each movie. Arrows point towards shrunken estimates of each movie’s mean rating. Shrunken estimates are obtained using the MSS James-Stein Estimator, described in more detail below.
As common a problem as this is, there aren’t a lot of accessible resources describing how to solve it. There are tons of excellent blog posts about the Beta distribution, which is useful if you wish to estimate the true fraction of an occurrence among some events. This works well in the case of Rotten Tomatoes, where one might want to know the true fraction of “thumbs up” judgments. But in my case, I’m dealing with a continuous 1-10 scale, not thumbs up / thumbs down judgments. The Beta-binomial distribution will be of little use.
Many resources mention the James-Stein Estimator, which provides a way to shrink the mean estimates only when the variances of those means can be assumed to be equal. That assumption usually only holds when the sample sizes of each group are equal. But in most real world examples, the sample sizes (and thus the variances of the means) are not equal. When that happens, it’s a lot less clear what to do.
After doing a lot of digging and asking some very helpful folks on Twitter, I found several solutions. For many of the solutions, I ran simulations to determine which worked best. This blog post is my attempt to summarize what I learned. Along the way, we’ll cover the original James-Stein Estimator, two extensions to the James-Stein Estimator, Markov Chain Monte Carlo (MCMC) methods, and several other strategies.
Before diving in, I want to include a list of symbol definitions I’ll be using because – side rant – it sure would be great if all stats papers did this, given that literally every paper I read used its own idiosyncratic notations! I’ll define everything again in the text, but this is just here for reference:
Symbol | Definition |
The number of groups. | |
The true mean of a group. | |
The sample mean of a group. The MLE estimate of |
|
The true variance of observations within a group. | |
The true variance of observations within a group if we assume all groups have the same variance. | |
The sample variance of a group. The MLE estimate of |
|
The number of observations in a group. | |
The number of observations in a group, if we assume all groups have the same size. | |
The true variance of a group’s mean. If each group has the same variance of observations, then |
|
Like |
|
Estimate of |
|
Estimate of |
|
The true mean of the |
|
The sample mean of the sample means. | |
The true variance of the |
|
Estimate of |
|
Estimate of the best term for weighting |
|
Estimate of the best term for weighting |
|
Estimate of a true group means. Its value depends on the method we use. | |
Shape parameter for the Gamma distribution from which sample sizes are drawn. | |
Scale parameter for the Gamma distribution from which sample sizes are drawn. | |
In simulations in which group observation variances |
|
In simulations in which group observation variances |
Quick Analytic Solutions
Our goal is to find a better way of estimating
This seems very sensible. We want something that is in between the sample mean (which is probably too extreme) and the mean of means. But how do we know what value to use for
Let’s start by defining
This formula seems really weird and arbitrary, but it begins to make more sense if we rearrange it a bit and sweep that pesky
Before getting to why this makes sense, I should explain the last step above. The denominator
Anyway, back to the result. This result is actually pretty neat. When we estimate a
The James-Stein Estimator is so widely applicable that many other fields have discovered it independently. In the image processing literature, it is a special case of the Wiener Filter, assuming that both the signal and the additive noise are Gaussian. In the insurance world, actuaries call it the Bühlmann model. And in animal breeding, early researchers called it the Best Unbiased Linear Prediction or BLUP (technically the Empirical BLUP). The BLUP approach is so useful, in fact, that it has received the highly coveted endorsement of the National Swine Improvement Federation.
While the original James-Stein formula is useful, the big limitation is that it only works when we believe that all groups have the same
The Multi Sample Size James-Stein Estimator
The most natural extension of James-Stein is to define each group’s
The denominator can just be estimated as the variance across group sample means.
As reasonable as this approach sounds, it somehow didn’t feel totally kosher to me. But when I looked into the literature, it seems like most researchers basically said “yup, that sounds pretty reasonable”.
To test this approach, I ran some simulations on 1000 artificial datasets. Each dataset involved 25 groups with sample sizes drawn from a Gamma distribution
For each dataset I computed the Mean Squared Error (MSE) between the vector of true group means and the vector of estimated group means. I then averaged the MSEs across datasets. This process was repeated for a variety of different values of
Figure 3. Mean Squared Error between true group means and estimated group means. In this simulation, the
As expected, the MSS James-Stein Estimator outperformed the MLE, with lower MSEs particularly for high values of
The Multi Sample Size Pooled James-Stein Estimator
One thing that’s a little weird about the MSS James-Stein Estimator is that even though we know all the groups should have the same within-group variance
Figure 4. Mean Squared Error between true group means and estimated group means. In this simulation, the
This works a bit better. By obtaining more accurate estimates of each group’s
Of course, this only works better because we created the simulation data in such a way that all groups have the same
Figure 5. Mean Squared Error between true group means and estimated group means. In this simulation, each group has its own variance parameter
Which method you choose should depend on whether you think your groups have similar or different variances of their observations. Here’s an interim summary of the methods covered so far.
Summary of analytic solutions
All of these estimators define
Make sure to clip
Maximum Likelihood Estimation (MLE)
MSS James-Stein Estimator
where
MSS Pooled James-Stein Estimator
where
Implementations in Python and R are available here.
A Bayesian interpretation of the analytic solutions
So far, the analytic approaches make sense directionally. As described above, our estimate of
But is this really the best weighting? Why use a ratio of variances instead of, say, a ratio of standard deviations? Why not use something else entirely?
It turns out this formula falls out naturally from Bayes Law. Imagine for a moment that we already know the prior distribution
(Note that the Wikipedia page uses the symbol ‘
If we multiply all terms in the numerator and denominator by
Or equivalently,
where
This looks familiar! It is basically the MSS James-Stein estimator. The only difference is that in the pure Bayesian approach you must somehow know
Hierarchical Modeling with MCMC
In previous sections we looked at some analytic solutions. While these solutions have the advantage of being quick to calculate, they have the disadvantage of being less accurate than they could be. For more accuracy, we can turn to Hierarchical Model estimation using Markov Chain Monte Carlo (MCMC) methods. MCMC is an iterative process for approximate Bayesian inference. While it is slower than analytical approximations, it tends to be more accurate and has the added benefit of giving you the full posterior distribution. I’m not an expert in how it works internally, but this post looks like a good place to start.
To implement this, I first defined a Hierarchical Model of my data. The model is a description of how I think the data is generated: True means are sampled from a normal distribution, and observations are sampled from a normal distribution centered around the true mean of each group. Of course, I know exactly how my data was generated, because I was the one who generated it! The key thing to understand though is that the Hierarchical Model does not contain any information about the value of the parameters. It’s the MCMC’s job to figure that out. In particular, I used PyStan’s MCMC implementation to fit the parameters of the model based on my data, although I later learned that it would be even easier to use bambi.
Figure 6. Mean Squared Error between true group means and estimated group means. In this simulation, the
For simulated data with shared
If you don’t care about speed and are willing to write the Stan code, then this is probably your best option. It’s also good to learn about MCMC methods, since they can be applied to more complicated models with multiple variables. But if you just want a quick estimate of group means, then one of the analytic solutions above makes more sense.
Other solutions
There are several other solutions that I did not include in my simulations.
-
Regularization. Pick the set of
’s that minimize . Use cross-validation to choose the best . This will probably work pretty well, although it takes a bit more work and time than the analytic solutions described above. -
Mixed Models. Over in Mixed Models World, there’s a whole ’nother literature on how to shrink estimates depending on sample size. They are especially suited for the movie ratings situation because in addition to shrinking the means, they also can correct for rater bias. I don’t really understand all the math behind Mixed Models, but I was able to use lme4 to estimate group means in simulated data under the assumption that the group means are a random effect. This gave me slightly different results compared to the James-Stein / Empirical Bayes approach. I would love if some expert who understood this could write an accessible and authoritative blog post on the differences between Mixed Models and Empirical Bayes. The closest I could find was this comment by David Harville.
-
Efron and Morris’ generalization of James-Stein to unequal sample sizes (Section 3 of their paper). I thought this paper was difficult to read. A more accessible presentation can be found at the end of this column. The Efron and Morris approach is a numerical solution that seemed to work reasonably well when I played around with it, but I didn’t take it very far. If you want to implement it, be sure to prevent any variance estimates from falling below zero. If one of them does, just set it to zero and then compute your estimates of the means. That being said, I feel like if you’re going to go with a numerical solution, you may as well just go with MCMC.
-
Double Shrinkage. When we think that different groups not only have different sample sizes, but also different
’s, we are faced with an interesting conundrum. As shown above, the MSS James-Stein Estimator outperforms the MSS Pooled James-Stein Estimator, because it computes ’s specific to each group. However, these estimates of group variances are probably noisy! Just like we don’t trust the raw estimates of group sample means, why should we trust the raw estimates of group sample variances? One way to address this is to use Zhao’s Double Shinkage Estimator, which not only shrinks the means, but also shrinks the variances. -
Kleinman’s weighted moment estimator. Apparently this was motivated by groups of proportions (i.e. the Rotten Tomatoes case), but the estimator can be applied generally.
Conclusion
Which method you choose depends on your situation. If you want a simple and computationally fast estimate, and if you don’t want to assume that the group variances
Acknowledgements
Special thanks to the many people who responded to my original question on Twitter, including: Sean Taylor, John Myles White, David Robinson, Joe Blitzstein, Manjari Narayan, Otis Anderson, Alex Peysakhovich, Nathaniel Bechhofer, James Neufeld, Andrew Mercer, Patrick Perry, Ronald Richman, Timothy Sweetster, Tal Yarkoni and Alex Coventry. Special thanks also to Marika Inhoff and Leo Pekelis for many discussions.
All code used in this blog post is available on GitHub.