# Part 1: A bipartisan list of people who are bad for America

Imagine that alien researchers visited America to learn about our political culture. If they wrote a report to send back to their planet, I imagine it would look something like this:

Americans have split themselves up into two opposing political tribes. Most people who associate with these groups are well-intentioned, but occasionally some members of a tribe do something bad or say something dumb. Whenever this happens, members of the opposite side feel good about themselves.

Certain writers and media personalities have learned to exploit this fact for personal gain. They have found that they can maximize their TV ratings and social media points by writing news stories that either cherry pick the worst actions of the other side or which interpret the other side’s actions in the least charitable way possible. As a result, news readers have developed increasingly distorted beliefs about their political opponents. The civic culture of the society is broken.

Below is a bipartisan list of people who are stoking partisan outrage for personal gain. Some of them do it for retweets, some of them do it for TV ratings, and some of them – still culpable – do it because they have entered a filter bubble themselves, fueling their own distorted and harmful sense of mission.

• Sean Davis (The Federalist) His Twitter account is deliberately uncharitable.
• Sopan Deb (New York Times) During the presidential campaign, his Twitter feed was nonstop, “look what this stupid Trump supporter said”.
• Stephen Miller (The Wilderness, ex-NRO)
• Chris Cillizza (The Washington Post)
• Sean Hannity (Fox News)
• Tucker Carlson (Fox News)
• Samantha Bee I know, she’s a comedian. I like jokes. But given how many people get their news from selectively edited comedy shows, it’s fair to say that comedians bear some responsibility.
• John Oliver (HBO) It pains me to include him on this list, since he is funny and since his show also includes some constructive policy explainers. But much of the content is selectively edited clips that paint a very distorted picture of the other side.
• Greg Gutfeld (Fox News comedian)

It doesn’t matter if some of the people on this list do accurate reporting. What matters is that their reporting is selective. It doesn’t matter if some of the people on this list support some good policy ideas. What matters is that listening to them will destroy your brain’s ability to understand where the other side is coming from. And it doesn’t matter if one side is more filter-bubbled than the other. Both sides are badly filter-bubbled. Avoiding the people in this list is a good place to start.

In Part 2, I’ll post a bipartisan list of people who argue in good faith.

# Three questions for social scientists: Internet virtue edition

This isn’t news to anybody, but the internet is changing our culture. Recently, I’ve been thinking about how it has changed our moral culture, and I realized that most of our beliefs on this topic are weirdly in tension with one another. Below are three questions that I feel are very much unresolved. I don’t have any good answers to them, and so I think they might be good topics for social science research.

#### 1. Moral Substitution and Moral Licensing versus Moral Contagion

When people do the Ice Bucket Challenge or put a Pride symbol on their profile avatar, they are sometimes accused of virtue signalling, a derogatory term akin to moral grandstanding. Virtue signallers are said to care more about showcasing their virtue than about creating real change.

Virtue signalling is bad, allegedly, for two reasons: First, instead of performing truly impactful moral acts, virtue signallers spend more time performing easy and symbolic acts. This could be called moral substitution. Second, after doing something good, people often feel like they’ve earned enough virtue points that they can get away with doing something bad. This well-studied phenomenon is called moral licensing.

While there are some clear ways that virtue signalling can be bad, there is another way in which it is good. Doing good things makes other people more likely to do good things. This process, known as moral contagion, was famously demonstrated in the Milgram experiments. Participants in those experiments who saw other participants behave morally were dramatically more likely to behave morally as well.

If the social science research is right, then we can conclude that putting a Pride symbol on your avatar make you behave worse (via moral licensing and moral substitution), but it makes other people behave better (via moral contagion). This leaves a couple of open questions:

First, how do the pros and cons balance out? Perhaps your Pride avatar is net positive if you have a large audience on Facebook, but net negative if you have a small audience. And second, how does moral contagion work with symbolic acts? Does the Pride avatar just make other people add Pride symbols to their avatars? Or does it make them behave more ethically in real and impactful ways?

We are beginning to get some quantitative answers to these questions. Clever research from Linda Skitka and others has shown that committing a moral act makes you about 40% less likely to commit another moral act later in the day (moral licensing), whereas hearing about someone else’s moral act makes you about 25% more likely to commit a moral act later in the day (moral contagion), although the latter finding fell short of statistical significance. More research is needed though, particularly when it comes to social media and symbolic virtue signalling.

#### 2. Slacktivism versus Violent Revolution

This question is more for political scientists.

Many people are concerned that the internet encourages slacktivism, a phenomenon closely related to moral substitution. It’s easier to slap a Pride symbol on your Facebook than to engage in real activism. In this way, the internet is really a tool of the already powerful.

On the other hand, some people are concerned that the internet cultivates violent radicalism. Online filter bubbles create anger and online networks create alliances, ultimately leading to violent rhetoric and homegrown terrorism. Many observers already sense the undercurrents of violent revolution.

How can we be worried that the internet is causing both slacktivism and violent radicalism? One possibility is that we only need to worry about slacktivism, and that the violent rhetoric isn’t actually violent – it’s just rhetoric. But the other possibility is that the internet has made slacktivists out of people who otherwise wouldn’t be doing anything at all, and it has made violent radicals out of people who would otherwise be mere activists. I’m not sure what the answer is, but it would be useful to understand this more.

#### 3. Political Correctness: Overton Windows versus Wolf Crying

Perhaps because of filter bubbles on both sides of the political spectrum, the term “political correctness” is back with a vengeance. Leaving aside the question of whether political correctness is good or bad, it would be interesting to understand whether it is effective. On the one hand, political correctness may help define an Overton Window, setting useful bounds around opinions that can be aired in polite company. But on the other hand, if the enforcers squeeze the boundaries too much, imposing stricter and stricter controls on the range of acceptable discourse, they risk undermining their own credibility by “crying wolf”. For what it’s worth, many internet trolls credit their success to a perception that so-called social justice warriors overplayed their cards. I’m not sure how much to believe them, but it seems possible.

Just in terms of effectiveness, is there a point at which political correctness starts to backfire? And more broadly, what is the optimal level of political correctness for a society? Neither of these questions seems easy to answer, but I would love to learn more.

# Keyboard shortcuts I couldn't live without

Keyboard shortcuts are interesting. Even though I know they are almost always worth learning, I often find myself shying away from the uncomfortable task of actually learning them. But after years of clumsily reaching for the mouse while my colleagues looked at me with a kindly sense of pity, I have slowly accumulated enough keyboard tricks that I’d like to share them. This set is probably far from optimal, and different people have their own solutions, but it has worked well for me. Before jumping in, here’s a reference table of key symbols and their common names:

Key Symbol Key Name
Command
Shift
Control
Alt/Option
Enter
Table 1. Key symbol map.

### Sublime Text

Sublime has lots of great shortcuts. My favorites is ⌘D, which allows you to sequentially select exact matches of highlighted text. Once the matches are selected, you can simultaneously edit them with a multi-cursor. If you want to select all matches simultaneously, rather than sequentially, you can use ⌃⌘G.

Figure 1. Demonstration of ⌘D, ⌘←, ⌘→, ⌘A and ⌘KU in Sublime Text.

### Chrome

With the exception of scrolling and link clicking, everything you do in Chrome should be done with keyboard only. If you’re new to this, I’d recommend starting with ⌘L, ⌘T, ⌘W and then expanding from there. Special bonus: if you ever accidentally close a tab, you can reopen it with ⌘⇧T.

Figure 2. The "No Touching" Chrome Zone. Your mouse should never come anywhere near here.

### Mac OS X

On Mac OS X, ⌘ Tab switches between applications, and ⌘ switches between windows of the same application. The ⌘+ and ⌘- shortcuts change the display size of text and other items. You can jump to the beginning of a line with ⌘← or ⌃A, and to the end of a line with ⌘→ or ⌃E. In Terminal, you can delete to the beginning of a line with ⌃U.

For easy window snapping, use the Spectacle app. Because Spectacle’s default mappings conflict with Chrome’s tab switching shortcuts, I’d recommend setting the four main screen position shortcuts to ⌘⌃←, ⌘⌃→, ⌘⌃↑, ⌘⌃↓, and eliminating all the other shortcuts, except the full screen shortcut, which should be set to ⌘⌃F.

Figure 3. Arranging windows with custom shortcuts in Spectacle.

### Gmail

If you’re using a mouse on Gmail, you’re doing it wrong. With the exception of a few word processing operations, literally everything you do in Gmail should be done with keyboard only. Compose, Reply, Reply All, Forward, Send, Search, Navigate, Open, Inbox, Sent, Drafts, Archive. All of these should be done with the keyboard. To enable these shortcuts, you must go into your Settings and navigate to the General tab. Once shortcuts have been enabled, you can see a list of all them by typing ?.

Figure 4. A small sample of things you can do on Gmail without ever touching your mouse.

With shortcuts similar to Gmail’s, you can jump to different pages using only the keyboard: gh brings you to the Home Timeline and gu lets you jump to another user’s profile. The most useful shortcut is probably ., which loads any new tweets that are waiting for you at the top of the Timeline. You can see a list of all shortcuts by typing ?.

### JetBrains

JetBrains products like DataGrip, PyCharm, and IntelliJ offer plenty of keyboard shortcuts. My favorites are ⌃G, for sequential highlighting, and ⌥⌥↓ and ⌥⌥↑ for multi-line cursors.

### Jupyter

Jupyter has tons of essential keyboard shortcuts that can be found by typing ? while in command mode. In addition, it’s possible to get Sublime-style text editing by following the instructions described here.

Figure 5. Common Jupyter workflow done entirely with the keyboard, with help from some Sublime-style editing: c and v to copy and paste a cell, ⌘D for multiple selection, ⌘→ to jump to the end of line, dd` to delete a cell.

# Learning by flip-flopping

I recently came across Artir’s Pyramid of Economic Insight and Virtue. It’s not actually a pyramid, but is instead a riff on the Expanding Brain meme. Check it out:

What’s interesting about Artir’s Pyramid is that at every step, the position flip-flops from the previous step. This isn’t just a dialogue between two sides. It is a description of the belief sequence that people traverse as they learn more about an issue. We might call this learning by flip-flopping.

This got me thinking: In what other issues do people go through a sequence of flip-flops as they learn more about it? In this blog post, I’d like to suggest a few.

Let me stress that in presenting these I don’t necessarily think that the “highest” levels in these examples are correct, nor do I think I have a strong understanding on many of these issues. It’s just something that’s fun to think about.

### Increasing the minimum wage

This is arguably a special case of Artir’s Pyramid and is probably the canonical example of learning by flip-flopping.

When it comes to the minimum wage debate and other debates, I often sometimes see a Stage 2 person talking to someone they believe is at Stage 1 but who is in fact at Stage 3.

Further reading on the minimum wage: Card and Krueger, criticism of Card and Krueger’s data, another case against Card and Krueger, two better studies.

### How to deal with a recession

Recession flip-flopping is less related to Artir’s Pyramid, but is still quite common. I may be bungling some of the later stages here, as my macro knowledge is mostly cobbled together from parody rap videos, so I welcome any suggestions for additional further reading.

Further reading on recessions: The government is not a household, Keynesian economics, boom and bust cycles, and a wonderful book by Tim Harford.

I’d like to keep this blog post as value-judgment free as possible, but I’ll make a special exception for this one. The 140 character limit is no longer a good idea, and Stage 3 is the correct stage.

### The meaning of life

David Chapman writes about how STEM-trained people should think about meaning. Extending Robert Kegen’s theory of human development, he believes that most STEM-trained people can find meaning in ideological rationalism (Stage 4) but, upon finding that rationality does not provide any meaning, they become in danger of falling into the Nihilism trap (Stage 4.5). Chapman claims that there is a Stage 5, sometimes called meta-rationality or fluidity, in which meaning can once again be found. You can read more about it on his blog.

What other examples of learning by flip-flopping are out there?

UPDATE: John McDonnell points me towards Hegelian Dialectic.

# Empirical Bayes for multiple sample sizes

Here’s a data problem I encounter all the time. Let’s say I’m running a website where users can submit movie ratings on a continuous 1-10 scale. For the sake of argument, let’s say that the users who rate each movie are an unbiased random sample from the population of users. I’d like to compute the average rating for each movie so that I can create a ranked list of the best movies.

Take a look at my data:

Figure 1. Each circle represents a movie rating by a user. Diamonds represent sample means for each movie.

I’ve got two big problems here. First, nobody is using my website. And second, I’m not sure if I can trust these averages. The movie at the top of the rankings was only rated by two users, and I’ve never even heard of it! Maybe the movie really is good. But maybe it just got lucky. Maybe just by chance, the two users who gave it ratings happened to be the users who liked it to an unusual extent. It would be great if there were a way to adjust for this.

In particular, I would like a method that will give me an estimate closer to the movie’s true mean (i.e. the mean rating it would get if an infinite number of users rated it). Intuitively, movies with mean ratings at the extremes should be nudged, or “shrunk”, towards the center. And intuitively, movies with low sample sizes should be shrunk more than the movies with large sample sizes.

Figure 2. Each circle represents a movie rating by a user. Diamonds represent sample means for each movie. Arrows point towards shrunken estimates of each movie's mean rating. Shrunken estimates are obtained using the MSS James-Stein Estimator, described in more detail below.

As common a problem as this is, there aren’t a lot of accessible resources describing how to solve it. There are tons of excellent blog posts about the Beta-binomial distribution, which is useful if you wish to estimate the true fraction of an occurrence among some events. This works well in the case of Rotten Tomatoes, where one might want to know the true fraction of “thumbs up” judgments. But in my case, I’m dealing with a continuous 1-10 scale, not thumbs up / thumbs down judgments. The Beta-binomial distribution will be of little use.

Many resources mention the James-Stein Estimator, which provides a way to shrink the mean estimates only when the variances of those means can be assumed to be equal. That assumption usually only holds when the sample sizes of each group are equal. But in most real world examples, the sample sizes (and thus the variances of the means) are not equal. When that happens, it’s a lot less clear what to do.

After doing a lot of digging and asking some very helpful folks on Twitter, I found several solutions. For many of the solutions, I ran simulations to determine which worked best. This blog post is my attempt to summarize what I learned. Along the way, we’ll cover the original James-Stein Estimator, two extensions to the James-Stein Estimator, Markov Chain Monte Carlo (MCMC) methods, and several other strategies.

Before diving in, I want to include a list of symbol definitions I’ll be using because – side rant – it sure would be great if all stats papers did this, given that literally every paper I read used its own idiosyncratic notations! I’ll define everything again in the text, but this is just here for reference:

Symbol Definition
$k$ The number of groups.
$\theta_i$ The true mean of a group.
$x_i$ The sample mean of a group. The MLE estimate of $\theta_i$.
$\epsilon^{2}_i$ The true variance of observations within a group.
$\epsilon^2$ The true variance of observations within a group if we assume all groups have the same variance.
$s^{2}_i$ The sample variance of a group. The MLE estimate of $\epsilon^{2}_i$.
$n_i$ The number of observations in a group.
$n$ The number of observations in a group, if we assume all groups have the same size.
$\sigma^{2}_i$ The true variance of a group's mean. If each group has the same variance of observations, then $\sigma^{2}_i = \epsilon^{2} / n_i$. If each group has different variances of observations, then $\sigma^{2}_i = \epsilon^{2}_i / n_i$.
$\sigma^{2}$ Like $\sigma^{2}_i$, but if we assume all groups had the same variance of the mean. Equal to $\epsilon^{2} / n$.
$\hat{\sigma^{2}_i}$ Estimate of $\sigma^{2}_i$.
$\hat{\sigma^{2}}$ Estimate of $\sigma^{2}$.
$\mu$ The true mean of the $\theta_i$'s (the true group means). The mean of the distribution from which the $\theta_i$'s are drawn.
$\overline{X}$ The sample mean of the sample means.
$\tau^{2}$ The true variance of the $\theta_i$'s (the true group means). The variance of the distribution from which the $\theta_i$'s are drawn.
$\hat{\tau^{2}}$ Estimate of $\tau^{2}$.
$\hat{B}$ Estimate of the best term for weighting $x_i$ and $\overline{X}$ when calculating $\hat{\theta_i}$. Assumes each group has the same $\sigma^2$.
$\hat{B_i}$ Estimate of the best term for weighting $x_i$ and $\overline{X}$ when calculating $\hat{\theta_i}$. Does not assume that all group's have the same $\sigma^{2}_i$.
$\hat{\theta_i}$ Estimate of a true group means. Its value depends on the method we use.
$k_{\Gamma}$ Shape parameter for the Gamma distribution from which sample sizes are drawn.
$\theta_{\Gamma}$ Scale parameter for the Gamma distribution from which sample sizes are drawn.
$\mu_v$ In simulations in which group observation variances $\epsilon^{2}_i$ are allowed to vary, this is the mean parameter of the log-normal distribution from which the $\epsilon^{2}_i$'s are drawn.
$\tau^{2}_v$ In simulations in which group observation variances $\epsilon^{2}_i$ are allowed to vary, this is the variance parameter of the log-normal distribution from which the $\epsilon^{2}_i$'s are drawn.

### Quick Analytic Solutions

Our goal is to find a better way of estimating $\theta_i$, the true mean of a group. A common theme in many of the papers I read is that good estimates of $\theta_i$ are usually weighted averages of the group’s sample mean $x_i$ and the global mean of all group means $\overline{X}$. Let’s call this weighting factor $\hat{B_i}$.

This seems very sensible. We want something that is in between the sample mean (which is probably too extreme) and the mean of means. But how do we know what value to use for $\hat{B_i}$? Different methods exist, and each leads to different results.

Let’s start by defining $\sigma^{2}_i$, the true variance of a group’s mean. This is equivalent to $\epsilon^{2}_i / n_i$, where $\epsilon^{2}_i$ is the true variance of the observations within that group, and $n_i$ is the sample size of the group. According to the original James-Stein approach, if we assume that all the group means have the same known variance $\hat{\sigma^2}$, which would usually only happen if the groups all had the same sample size, then we can define a common $\hat{B}$ for all groups as:

This formula seems really weird and arbitrary, but it begins to make more sense if we rearrange it a bit and sweep that pesky $\left(k-3\right)$ under the rug and replace it with a $(k-1)$. Sorry hardliners!

Before getting to why this makes sense, I should explain the last step above. The denominator $\sum{(x_i - \overline{X})^2}/\left(k-1\right)$ is the observed variance of the observed sample means. This variance comes from two sources: $\tau^2$ is the true variance in the true means and $\sigma^2$ is the true variance caused by the fact that each $x_i$ is computed from a sample. Since variances add, the total variance of the observed means is $\tau^{2} + \sigma^{2}$.

Anyway, back to the result. This result is actually pretty neat. When we estimate a $\theta_i$, the weight that we place on the global mean $\overline{X}$ is the fraction of total variance in the means that is caused by within-group sampling variance. In other words, when the sample mean comes with high uncertainty, we should weight the global mean more. When the sample mean comes with low uncertainty, we should weight the global mean less. At least directionally, this formula makes sense. Later in this blog post, we’ll see how it falls naturally out of Bayes Law.

The James-Stein Estimator is so widely applicable that many other fields have discovered it independently. In the image processing literature, it is a special case of the Wiener Filter, assuming that both the signal and the additive noise are Gaussian. In the insurance world, actuaries call it the Bühlmann model. And in animal breeding, early researchers called it the Best Unbiased Linear Prediction or BLUP (technically the Empirical BLUP). The BLUP approach is so useful, in fact, that it has received the highly coveted endorsement of the National Swine Improvement Federation.

While the original James-Stein formula is useful, the big limitation is that it only works when we believe that all groups have the same $\sigma^2$. In cases where we have different sample sizes, each group will have it’s own $\sigma^{2}_i$. (Recall that $\sigma^{2}_i = \epsilon^{2}_i / n_i$.) We’re going to want to shrink some groups more than others, and the original James-Stein estimator does not allow this. In the following sections, we’ll look at a couple of extensions to the James-Stein estimator. These extensions have analogues in the Bühlmann model and BLUP literature.

#### The Multi Sample Size James-Stein Estimator

The most natural extension of James-Stein is to define each group’s $\hat{\sigma^{2}_i}$ as the squared standard error of the group’s mean. This allows us to estimate a weighting factor $\hat{B_i}$ tailored to each group. Let’s call this the Multi Sample Size James-Stein Estimator, or MSS James-Stein Estimator.

The denominator can just be estimated as the variance across group sample means.

As reasonable as this approach sounds, it somehow didn’t feel totally kosher to me. But when I looked into the literature, it seems like most researchers basically said “yup, that sounds pretty reasonable”.

To test this approach, I ran some simulations on 1000 artificial datasets. Each dataset involved 25 groups with sample sizes drawn from a Gamma distribution $\Gamma(k_{\Gamma}=1.5,\theta_{\Gamma}=10)$. True group means ($\theta_i$’s) were sampled from a Normal distribution $\mathcal{N}(\mu, \tau^2)$. Observations within each group were sampled from $\mathcal{N}(\theta_i, \epsilon^2)$, where $\epsilon$ was shared between groups.

For each dataset I computed the Mean Squared Error (MSE) between the vector of true group means and the vector of estimated group means. I then averaged the MSEs across datasets. This process was repeated for a variety of different values of $\epsilon$ and for two different estimators: The MSS James-Stein Estimator and the Maximum Likelihood Estimator (MLE). To compute the MLE, I just used $x_i$ as my $\hat{\theta_i}$ estimate.

Figure 3. Mean Squared Error between true group means and estimated group means. In this simulation, the $\epsilon$ parameter for within-group variance of observations is shared by all groups.

As expected, the MSS James-Stein Estimator outperformed the MLE, with lower MSEs particularly for high values of $\epsilon$. This make sense. When the raw sample means are noisy, the MLE should be especially untrustworthy and it makes sense to pull extreme estimates back towards the global mean.

#### The Multi Sample Size Pooled James-Stein Estimator

One thing that’s a little weird about the MSS James-Stein Estimator is that even though we know all the groups should have the same within-group variance $\epsilon^2$, we still estimate each group’s standard error separately. Given what we know, it might make more sense to pool the data from all groups to estimate a common $\epsilon^2$. Then we can estimate each group’s $\sigma^{2}_i$ as $\epsilon^2 / n_i$. Let’s call this approach the MSS Pooled James-Stein Estimator.

Figure 4. Mean Squared Error between true group means and estimated group means. In this simulation, the $\epsilon$ parameter for within-group variance of observations is shared by all groups.

This works a bit better. By obtaining more accurate estimates of each group’s $\sigma^{2}_i$, we are able to find a more appropriate shrinking factor $B_i$ for each group.

Of course, this only works better because we created the simulation data in such a way that all groups have the same $\epsilon^2$. But if we run a different set of simulations, in which each group’s $\epsilon_i$ is drawn from a log-normal distribution $ln\mathcal{N}\left(\mu_v, \tau^{2}_v\right)$, we obtain the reverse results. The MSS James-Stein Estimator, which estimates a separate $\hat{\epsilon^{2}_i}$ for each group, does a better job than the MSS Pooled James-Stein Estimator. This makes sense.

Figure 5. Mean Squared Error between true group means and estimated group means. In this simulation, each group has its own variance parameter $\epsilon^2$ for the observations within the group. These parameters are sampled from a log-normal distribution $ln\mathcal{N}\left(\mu_v, \tau^{2}_v\right)$. For simplicity, the two parameters of this distribution are always set to be identical, and are shown on the horizontal axis.

Which method you choose should depend on whether you think your groups have similar or different variances of their observations. Here’s an interim summary of the methods covered so far.

#### Summary of analytic solutions

All of these estimators define $\hat{\theta_i}$ as a weighted average of the group sample mean $x_i$ and the mean of group sample means $\overline{X}$. $$\hat{\theta_i} = \left(1-\hat{B_i}\right) x_i + \hat{B_i} \overline{X}$$ Make sure to clip $\hat{B_i}$ to the range [0, 1].

1. Maximum Likelihood Estimation (MLE)
$\hat{B_i} = 0$
2. MSS James-Stein Estimator
$\hat{B_i} = \frac{s^{2}_i/n_i}{\sum\frac{(x_i - \overline{X})^2}{k-1}}$
where $s_i$ is the standard deviation of observations with a group.
3. MSS Pooled James-Stein Estimator
$\hat{B_i} = \frac{s^{2}_p/n_i}{\sum\frac{(x_i - \overline{X})^2}{k-1}}$
where $s^{2}_p$ is the pooled estimate of variance.
Implementations in Python and R are available here.

#### A Bayesian interpretation of the analytic solutions

So far, the analytic approaches make sense directionally. As described above, our estimate of $\theta_i$ should be a weighted average of $x_i$ and $\overline{X}$, where the weight depends on the ratio of sample mean variance to total variance of the means.

But is this really the best weighting? Why use a ratio of variances instead of, say, a ratio of standard deviations? Why not use something else entirely?

It turns out this formula falls out naturally from Bayes Law. Imagine for a moment that we already know the prior distribution $\mathcal{N}\left(\mu, \tau^2\right)$ over the $\theta_i$’s. And imagine we know the likelihood function for a group mean is $\mathcal{N}\left(x_i, \epsilon^{2}_i/n_i\right)$. According to the Wikipedia page on conjugate priors, the posterior distribution for the group mean is itself a Gaussian distribution with mean:

(Note that the Wikipedia page uses the symbol ‘$x_i$’ to refer to observations, whereas this blog post will always use the term to refer to the sample mean, including in the equation above. Also note that Wikipedia refers to the variance of observations within a group as ‘$\sigma^2$’ whereas this blog post uses $\epsilon^{2}_i$.)

If we multiply all terms in the numerator and denominator by $\frac{\tau^2 \epsilon^{2}_i}{n_i}$, we get:

Or equivalently,

where

This looks familiar! It is basically the MSS James-Stein estimator. The only difference is that in the pure Bayesian approach you must somehow know $\mu$, $\tau^2$, and $\sigma^{2}_i$ in advance. In the MSS James-Stein approach, you estimate those parameters from the data itself. This is the key insight in Empirical Bayes: Use priors to keep your estimates under control, but obtain the priors empirically from the data itself.

### Hierarchical Modeling with MCMC

In previous sections we looked at some analytic solutions. While these solutions have the advantage of being quick to calculate, they have the disadvantage of being less accurate than they could be. For more accuracy, we can turn to Hierarchical Model estimation using Markov Chain Monte Carlo (MCMC) methods. MCMC is an iterative process for approximate Bayesian inference. While it is slower than analytical approximations, it tends to be more accurate and has the added benefit of giving you the full posterior distribution. I’m not an expert in how it works internally, but this post looks like a good place to start.

To implement this, I first defined a Hierarchical Model of my data. The model is a description of how I think the data is generated: True means are sampled from a normal distribution, and observations are sampled from a normal distribution centered around the true mean of each group. Of course, I know exactly how my data was generated, because I was the one who generated it! The key thing to understand though is that the Hierarchical Model does not contain any information about the value of the parameters. It’s the MCMC’s job to figure that out. In particular, I used PyStan’s MCMC implementation to fit the parameters of the model based on my data, although I later learned that it would be even easier to use bambi.

Figure 6. Mean Squared Error between true group means and estimated group means. In this simulation, the $\epsilon$ parameter for within-group variance of observations is shared by all groups.

For simulated data with shared $\epsilon^2$, MCMC did well, outperforming both the MSS James-Stein estimator and the MSS Pooled James-Stein estimator.

If you don’t care about speed and are willing to write the Stan code, then this is probably your best option. It’s also good to learn about MCMC methods, since they can be applied to more complicated models with multiple variables. But if you just want a quick estimate of group means, then one of the analytic solutions above makes more sense.

### Other solutions

There are several other solutions that I did not include in my simulations.

1. Regularization. Pick the set of $\hat{\theta_i}$’s that minimize $\sum{(\hat{\theta_i} - x_i)^2} + \lambda \sum{(\hat{\theta_i} - \overline{X})^2}$. Use cross-validation to choose the best $\lambda$. This will probably work pretty well, although it takes a bit more work and time than the analytic solutions described above.

2. Mixed Models. Over in Mixed Models World, there’s a whole ’nother literature on how to shrink estimates depending on sample size. They are especially suited for the movie ratings situation because in addition to shrinking the means, they also can correct for rater bias. I don’t really understand all the math behind Mixed Models, but I was able to use lme4 to estimate group means in simulated data under the assumption that the group means are a random effect. This gave me slightly different results compared to the James-Stein / Empirical Bayes approach. I would love if some expert who understood this could write an accessible and authoritative blog post on the differences between Mixed Models and Empirical Bayes. The closest I could find was this comment by David Harville.

3. Efron and Morris’ generalization of James-Stein to unequal sample sizes (Section 3 of their paper). I thought this paper was difficult to read. A more accessible presentation can be found at the end of this column. The Efron and Morris approach is a numerical solution that seemed to work reasonably well when I played around with it, but I didn’t take it very far. If you want to implement it, be sure to prevent any variance estimates from falling below zero. If one of them does, just set it to zero and then compute your estimates of the means. That being said, I feel like if you’re going to go with a numerical solution, you may as well just go with MCMC.

4. Double Shrinkage. When we think that different groups not only have different sample sizes, but also different $\epsilon_i$’s, we are faced with an interesting conundrum. As shown above, the MSS James-Stein Estimator outperforms the MSS Pooled James-Stein Estimator, because it computes $\hat{\epsilon_i}$’s specific to each group. However, these estimates of group variances are probably noisy! Just like we don’t trust the raw estimates of group sample means, why should we trust the raw estimates of group sample variances? One way to address this is to use Zhao’s Double Shinkage Estimator, which not only shrinks the means, but also shrinks the variances.

5. Kleinman’s weighted moment estimator. Apparently this was motivated by groups of proportions (i.e. the Rotten Tomatoes case), but the estimator can be applied generally.

### Conclusion

Which method you choose depends on your situation. If you want a simple and computationally fast estimate, and if you don’t want to assume that the group variances $\epsilon^{2}_i$ are identical, I would recommend either the MSS James-Stein Estimator or the Double Shrinkage Estimator if you can get it to work. If you want a fast estimate and can assume all groups share the same $\epsilon^{2}$, I’d recommend the MSS Pooled James-Stein Estimator. If you don’t care about speed or code complexity, I’d recommend MCMC, Mixed Models, or regularization with a cross-validated penalty term.

### Acknowledgements

Special thanks to the many people who responded to my original question on Twitter, including: Sean Taylor, John Myles White, David Robinson, Joe Blitzstein, Manjari Narayan, Otis Anderson, Alex Peysakhovich, Nathaniel Bechhofer, James Neufeld, Andrew Mercer, Patrick Perry, Ronald Richman, Timothy Sweetster, Tal Yarkoni and Alex Coventry. Special thanks also to Marika Inhoff and Leo Pekelis for many discussions.

All code used in this blog post is available on GitHub.