Here’s a data problem I encounter all the time. Let’s say I’m running a website where users can submit movie ratings on a continuous 1-10 scale. For the sake of argument, let’s say that the users who rate each movie are an unbiased random sample from the population of users. I’d like to compute the average rating for each movie so that I can create a ranked list of the best movies.
Take a look at my data:
Figure 1. Each circle represents a movie rating by a user. Diamonds represent sample means for each movie.
I’ve got two big problems here. First, nobody is using my website. And second, I’m not sure if I can trust these averages. The movie at the top of the rankings was only rated by two users, and I’ve never even heard of it! Maybe the movie really is good. But maybe it just got lucky. Maybe just by chance, the two users who gave it ratings happened to be the users who liked it to an unusual extent. It would be great if there were a way to adjust for this.
In particular, I would like a method that will give me an estimate closer to the movie’s true mean (i.e. the mean rating it would get if an infinite number of users rated it). Intuitively, movies with mean ratings at the extremes should be nudged, or “shrunk”, towards the center. And intuitively, movies with low sample sizes should be shrunk more than the movies with large sample sizes.
Figure 2. Each circle represents a movie rating by a user. Diamonds represent sample means for each movie. Arrows point towards shrunken estimates of each movie's mean rating. Shrunken estimates are obtained using the MSS James-Stein Estimator, described in more detail below.
As common a problem as this is, there aren’t a lot of accessible resources describing how to solve it. There are tons of excellent blog posts about the Beta-binomial distribution, which is useful if you wish to estimate the true fraction of an occurrence among some events. This works well in the case of Rotten Tomatoes, where one might want to know the true fraction of “thumbs up” judgments. But in my case, I’m dealing with a continuous 1-10 scale, not thumbs up / thumbs down judgments. The Beta-binomial distribution will be of little use.
Many resources mention the James-Stein Estimator, which provides a way to shrink the mean estimates only when the variances of those means can be assumed to be equal. That assumption usually only holds when the sample sizes of each group are equal. But in most real world examples, the sample sizes (and thus the variances of the means) are not equal. When that happens, it’s a lot less clear what to do.
After doing a lot of digging and asking some very helpful folks on Twitter, I found several solutions. For many of the solutions, I ran simulations to determine which worked best. This blog post is my attempt to summarize what I learned. Along the way, we’ll cover the original James-Stein Estimator, two extensions to the James-Stein Estimator, Markov Chain Monte Carlo (MCMC) methods, and several other strategies.
Before diving in, I want to include a list of symbol definitions I’ll be using because – side rant – it sure would be great if all stats papers did this, given that literally every paper I read used its own idiosyncratic notations! I’ll define everything again in the text, but this is just here for reference:
Symbol
Definition
$ k $
The number of groups.
$ \theta_i $
The true mean of a group.
$ x_i $
The sample mean of a group. The MLE estimate of $ \theta_i $.
$ \epsilon^{2}_i $
The true variance of observations within a group.
$ \epsilon^2 $
The true variance of observations within a group if we assume all groups have the same variance.
$ s^{2}_i $
The sample variance of a group. The MLE estimate of $ \epsilon^{2}_i $.
$ n_i $
The number of observations in a group.
$ n $
The number of observations in a group, if we assume all groups have the same size.
$ \sigma^{2}_i $
The true variance of a group's mean. If each group has the same variance of observations, then $ \sigma^{2}_i = \epsilon^{2} / n_i $. If each group has different variances of observations, then $ \sigma^{2}_i = \epsilon^{2}_i / n_i $.
$ \sigma^{2} $
Like $ \sigma^{2}_i $, but if we assume all groups had the same variance of the mean. Equal to $ \epsilon^{2} / n $.
$ \hat{\sigma^{2}_i} $
Estimate of $ \sigma^{2}_i $.
$ \hat{\sigma^{2}} $
Estimate of $ \sigma^{2} $.
$ \mu $
The true mean of the $ \theta_i $'s (the true group means). The mean of the distribution from which the $ \theta_i $'s are drawn.
$ \overline{X} $
The sample mean of the sample means.
$ \tau^{2} $
The true variance of the $ \theta_i $'s (the true group means). The variance of the distribution from which the $ \theta_i $'s are drawn.
$ \hat{\tau^{2}} $
Estimate of $ \tau^{2} $.
$ \hat{B} $
Estimate of the best term for weighting $ x_i $ and $ \overline{X} $ when calculating $ \hat{\theta_i} $. Assumes each group has the same $ \sigma^2 $.
$ \hat{B_i} $
Estimate of the best term for weighting $ x_i $ and $ \overline{X} $ when calculating $ \hat{\theta_i} $. Does not assume that all group's have the same $ \sigma^{2}_i $.
$ \hat{\theta_i} $
Estimate of a true group means. Its value depends on the method we use.
$ k_{\Gamma} $
Shape parameter for the Gamma distribution from which sample sizes are drawn.
$ \theta_{\Gamma} $
Scale parameter for the Gamma distribution from which sample sizes are drawn.
$ \mu_v $
In simulations in which group observation variances $ \epsilon^{2}_i $ are allowed to vary, this is the mean parameter of the log-normal distribution from which the $ \epsilon^{2}_i $'s are drawn.
$ \tau^{2}_v $
In simulations in which group observation variances $ \epsilon^{2}_i $ are allowed to vary, this is the variance parameter of the log-normal distribution from which the $ \epsilon^{2}_i $'s are drawn.
Quick Analytic Solutions
Our goal is to find a better way of estimating $ \theta_i $, the true mean of a group. A commonthemeinmanyofthepapers I read is that good estimates of $ \theta_i $ are usually weighted averages of the group’s sample mean $ x_i $ and the global mean of all group means $ \overline{X} $. Let’s call this weighting factor $ \hat{B_i} $.
This seems very sensible. We want something that is in between the sample mean (which is probably too extreme) and the mean of means. But how do we know what value to use for $ \hat{B_i} $? Different methods exist, and each leads to different results.
Let’s start by defining $ \sigma^{2}_i $, the true variance of a group’s mean. This is equivalent to $ \epsilon^{2}_i / n_i $, where $ \epsilon^{2}_i $ is the true variance of the observations within that group, and $ n_i $ is the sample size of the group. According to the original James-Stein approach, if we assume that all the group means have the same known variance $ \hat{\sigma^2} $, which would usually only happen if the groups all had the same sample size, then we can define a common $ \hat{B} $ for all groups as:
This formula seems really weird and arbitrary, but it begins to make more sense if we rearrange it a bit and sweep that pesky $ \left(k-3\right) $ under the rug and replace it with a $ (k-1) $. Sorry hardliners!
Before getting to why this makes sense, I should explain the last step above. The denominator $ \sum{(x_i - \overline{X})^2}/\left(k-1\right) $ is the observed variance of the observed sample means. This variance comes from two sources: $ \tau^2 $ is the true variance in the true means and $ \sigma^2 $ is the true variance caused by the fact that each $ x_i $ is computed from a sample. Since variances add, the total variance of the observed means is $ \tau^{2} + \sigma^{2} $.
Anyway, back to the result. This result is actually pretty neat. When we estimate a $ \theta_i $, the weight that we place on the global mean $ \overline{X} $ is the fraction of total variance in the means that is caused by within-group sampling variance. In other words, when the sample mean comes with high uncertainty, we should weight the global mean more. When the sample mean comes with low uncertainty, we should weight the global mean less. At least directionally, this formula makes sense. Later in this blog post, we’ll see how it falls naturally out of Bayes Law.
The James-Stein Estimator is so widely applicable that many other fields have discovered it independently. In the image processing literature, it is a special case of the Wiener Filter, assuming that both the signal and the additive noise are Gaussian. In the insurance world, actuaries call it the Bühlmann model. And in animal breeding, early researchers called it the Best Unbiased Linear Prediction or BLUP (technically the Empirical BLUP). The BLUP approach is so useful, in fact, that it has received the highly coveted endorsement of the National Swine Improvement Federation.
While the original James-Stein formula is useful, the big limitation is that it only works when we believe that all groups have the same $ \sigma^2 $. In cases where we have different sample sizes, each group will have it’s own $ \sigma^{2}_i $. (Recall that $ \sigma^{2}_i = \epsilon^{2}_i / n_i $.) We’re going to want to shrink some groups more than others, and the original James-Stein estimator does not allow this. In the following sections, we’ll look at a couple of extensions to the James-Stein estimator. These extensions have analogues in the Bühlmann model and BLUP literature.
The Multi Sample Size James-Stein Estimator
The most natural extension of James-Stein is to define each group’s $ \hat{\sigma^{2}_i} $ as the squared standard error of the group’s mean. This allows us to estimate a weighting factor $ \hat{B_i} $ tailored to each group. Let’s call this the Multi Sample Size James-Stein Estimator, or MSS James-Stein Estimator.
The denominator can just be estimated as the variance across group sample means.
As reasonable as this approach sounds, it somehow didn’t feel totally kosher to me. But when I looked into the literature, it seems like most researchers basically said “yup, that soundspretty reasonable”.
To test this approach, I ran some simulations on 1000 artificial datasets. Each dataset involved 25 groups with sample sizes drawn from a Gamma distribution $ \Gamma(k_{\Gamma}=1.5,\theta_{\Gamma}=10) $. True group means ($ \theta_i $’s) were sampled from a Normal distribution $ \mathcal{N}(\mu, \tau^2) $. Observations within each group were sampled from $ \mathcal{N}(\theta_i, \epsilon^2) $, where $ \epsilon $ was shared between groups.
For each dataset I computed the Mean Squared Error (MSE) between the vector of true group means and the vector of estimated group means. I then averaged the MSEs across datasets. This process was repeated for a variety of different values of $ \epsilon $ and for two different estimators: The MSS James-Stein Estimator and the Maximum Likelihood Estimator (MLE). To compute the MLE, I just used $ x_i $ as my $ \hat{\theta_i} $ estimate.
Figure 3. Mean Squared Error between true group means and estimated group means. In this simulation, the $ \epsilon $ parameter for within-group variance of observations is shared by all groups.
As expected, the MSS James-Stein Estimator outperformed the MLE, with lower MSEs particularly for high values of $ \epsilon $. This make sense. When the raw sample means are noisy, the MLE should be especially untrustworthy and it makes sense to pull extreme estimates back towards the global mean.
The Multi Sample Size Pooled James-Stein Estimator
One thing that’s a little weird about the MSS James-Stein Estimator is that even though we know all the groups should have the same within-group variance $ \epsilon^2 $, we still estimate each group’s standard error separately. Given what we know, it might make more sense to pool the data from all groups to estimate a common $ \epsilon^2 $. Then we can estimate each group’s $ \sigma^{2}_i $ as $ \epsilon^2 / n_i $. Let’s call this approach the MSS Pooled James-Stein Estimator.
Figure 4. Mean Squared Error between true group means and estimated group means. In this simulation, the $ \epsilon $ parameter for within-group variance of observations is shared by all groups.
This works a bit better. By obtaining more accurate estimates of each group’s $ \sigma^{2}_i $, we are able to find a more appropriate shrinking factor $ B_i $ for each group.
Of course, this only works better because we created the simulation data in such a way that all groups have the same $ \epsilon^2 $. But if we run a different set of simulations, in which each group’s $ \epsilon_i $ is drawn from a log-normal distribution $ ln\mathcal{N}\left(\mu_v, \tau^{2}_v\right) $, we obtain the reverse results. The MSS James-Stein Estimator, which estimates a separate $ \hat{\epsilon^{2}_i} $ for each group, does a better job than the MSS Pooled James-Stein Estimator. This makes sense.
Figure 5. Mean Squared Error between true group means and estimated group means. In this simulation, each group has its own variance parameter $ \epsilon^2 $ for the observations within the group. These parameters are sampled from a log-normal distribution $ ln\mathcal{N}\left(\mu_v, \tau^{2}_v\right) $. For simplicity, the two parameters of this distribution are always set to be identical, and are shown on the horizontal axis.
Which method you choose should depend on whether you think your groups have similar or different variances of their observations. Here’s an interim summary of the methods covered so far.
Summary of analytic solutions
All of these estimators define $ \hat{\theta_i} $ as a weighted average of the group sample mean $ x_i $ and the mean of group sample means $ \overline{X} $.
$$ \hat{\theta_i} = \left(1-\hat{B_i}\right) x_i + \hat{B_i} \overline{X} $$
Make sure to clip $ \hat{B_i} $ to the range [0, 1].
where $ s^{2}_p $ is the pooled estimate of variance.
Implementations in Python and R are available here.
A Bayesian interpretation of the analytic solutions
So far, the analytic approaches make sense directionally. As described above, our estimate of $ \theta_i $ should be a weighted average of $ x_i $ and $ \overline{X} $, where the weight depends on the ratio of sample mean variance to total variance of the means.
But is this really the best weighting? Why use a ratio of variances instead of, say, a ratio of standard deviations? Why not use something else entirely?
It turns out this formula falls out naturally from Bayes Law. Imagine for a moment that we already know the prior distribution $ \mathcal{N}\left(\mu, \tau^2\right) $ over the $ \theta_i $’s. And imagine we know the likelihood function for a group mean is $ \mathcal{N}\left(x_i, \epsilon^{2}_i/n_i\right) $. According to the Wikipedia page on conjugate priors, the posterior distribution for the group mean is itself a Gaussian distribution with mean:
(Note that the Wikipedia page uses the symbol ‘$ x_i $’ to refer to observations, whereas this blog post will always use the term to refer to the sample mean, including in the equation above. Also note that Wikipedia refers to the variance of observations within a group as ‘$ \sigma^2 $’ whereas this blog post uses $ \epsilon^{2}_i $.)
If we multiply all terms in the numerator and denominator by $ \frac{\tau^2 \epsilon^{2}_i}{n_i} $, we get:
Or equivalently,
where
This looks familiar! It is basically the MSS James-Stein estimator. The only difference is that in the pure Bayesian approach you must somehow know $ \mu $, $ \tau^2 $, and $ \sigma^{2}_i $ in advance. In the MSS James-Stein approach, you estimate those parameters from the data itself. This is the key insight in Empirical Bayes: Use priors to keep your estimates under control, but obtain the priors empirically from the data itself.
Hierarchical Modeling with MCMC
In previous sections we looked at some analytic solutions. While these solutions have the advantage of being quick to calculate, they have the disadvantage of being less accurate than they could be. For more accuracy, we can turn to Hierarchical Model estimation using Markov Chain Monte Carlo (MCMC) methods. MCMC is an iterative process for approximate Bayesian inference. While it is slower than analytical approximations, it tends to be more accurate and has the added benefit of giving you the full posterior distribution. I’m not an expert in how it works internally, but this post looks like a good place to start.
To implement this, I first defined a Hierarchical Model of my data. The model is a description of how I think the data is generated: True means are sampled from a normal distribution, and observations are sampled from a normal distribution centered around the true mean of each group. Of course, I know exactly how my data was generated, because I was the one who generated it! The key thing to understand though is that the Hierarchical Model does not contain any information about the value of the parameters. It’s the MCMC’s job to figure that out. In particular, I used PyStan’s MCMC implementation to fit the parameters of the model based on my data, although I later learned that it would be even easier to use bambi.
Figure 6. Mean Squared Error between true group means and estimated group means. In this simulation, the $ \epsilon $ parameter for within-group variance of observations is shared by all groups.
For simulated data with shared $ \epsilon^2 $, MCMC did well, outperforming both the MSS James-Stein estimator and the MSS Pooled James-Stein estimator.
If you don’t care about speed and are willing to write the Stan code, then this is probably your best option. It’s also good to learn about MCMC methods, since they can be applied to more complicated models with multiple variables. But if you just want a quick estimate of group means, then one of the analytic solutions above makes more sense.
Other solutions
There are several other solutions that I did not include in my simulations.
Regularization. Pick the set of $ \hat{\theta_i} $’s that minimize . Use cross-validation to choose the best $ \lambda $. This will probably work pretty well, although it takes a bit more work and time than the analytic solutions described above.
Mixed Models. Over in Mixed Models World, there’s a whole ’nother literature on how to shrink estimates depending on sample size. They are especially suited for the movie ratings situation because in addition to shrinking the means, they also can correct for rater bias. I don’t really understand all the math behind Mixed Models, but I was able to use lme4 to estimate group means in simulated data under the assumption that the group means are a random effect. This gave me slightly different results compared to the James-Stein / Empirical Bayes approach. I would love if some expert who understood this could write an accessible and authoritative blog post on the differences between Mixed Models and Empirical Bayes. The closest I could find was this comment by David Harville.
Efron and Morris’ generalization of James-Stein to unequal sample sizes (Section 3 of their paper). I thought this paper was difficult to read. A more accessible presentation can be found at the end of this column. The Efron and Morris approach is a numerical solution that seemed to work reasonably well when I played around with it, but I didn’t take it very far. If you want to implement it, be sure to prevent any variance estimates from falling below zero. If one of them does, just set it to zero and then compute your estimates of the means. That being said, I feel like if you’re going to go with a numerical solution, you may as well just go with MCMC.
Double Shrinkage. When we think that different groups not only have different sample sizes, but also different $ \epsilon_i $’s, we are faced with an interesting conundrum. As shown above, the MSS James-Stein Estimator outperforms the MSS Pooled James-Stein Estimator, because it computes $ \hat{\epsilon_i} $’s specific to each group. However, these estimates of group variances are probably noisy! Just like we don’t trust the raw estimates of group sample means, why should we trust the raw estimates of group sample variances? One way to address this is to use Zhao’s Double Shinkage Estimator, which not only shrinks the means, but also shrinks the variances.
Kleinman’s weighted moment estimator. Apparently this was motivated by groups of proportions (i.e. the Rotten Tomatoes case), but the estimator can be applied generally.
Conclusion
Which method you choose depends on your situation. If you want a simple and computationally fast estimate, and if you don’t want to assume that the group variances $ \epsilon^{2}_i $ are identical, I would recommend either the MSS James-Stein Estimator or the Double Shrinkage Estimator if you can get it to work. If you want a fast estimate and can assume all groups share the same $ \epsilon^{2} $, I’d recommend the MSS Pooled James-Stein Estimator. If you don’t care about speed or code complexity, I’d recommend MCMC, Mixed Models, or regularization with a cross-validated penalty term.
As a data scientist, a big part of my job involves picking metrics to optimize and thinking about how to do things as efficiently as possible. With these types of questions on my mind, I recently discovered a totally fascinating book about about economic problems in the USSR and the team of data-driven economists and computer scientists who wanted to solve them. The book is called Red Plenty. It’s actually written as a novel, weirdly, but it nevertheless presents an accurate economic history of the USSR. It draws heavily on an earlier book from 1973 called Planning Problems in the USSR, which I also picked up. As I read these books, I couldn’t help but notice some parallels with planning in any modern organization. In what will be familiar to any data scientist today, the second book even includes a quote from a researcher who complained that 90% of his time was spent cleaning the data, and only 10% of his time was spent doing actual modeling!
Beyond all the interesting parallels to modern data science and operations research, these books helped me understand a lot of interesting things I previously knew very little about, such as linear programming, price equilibria, and Soviet history. This blog post is about I learned, and ends with some surprising-to-me speculation about whether the technical challenges that the brought down the Soviet Union would be as much of a problem in the future.
Balance sheets and manual calculation: Kind of a trainwreck
The main task in the centrally planned Soviet economy was to allocate resources so that a desired assortment of goods and services was produced. Every year, certain target outputs for each good were established. Armed with estimates of the available input resources, central administrators used balance sheets to set plans for every factory, specifying exactly how much input commodities each factory would receive, and how much output it should produce. Up through the 1960s, this was always done by manual calculation. Since there were hundreds of thousands of commodities, and since the supply chains had many dependency steps, it was impossible to compute the full balance sheets for the economy. The administrators therefore decided to make some simplifying assumptions. As a result of these these simplifying assumptions, resource allocation became a bit of a trainwreck. Below are a few of the simplifications and their consequences.
Dimensionality reduction by removing variables. Because there were too many commodities to track, administrators often limited their analysis to the 10,000 most important commodities in the economy. But when the production of those commodities were planned, there was often a hidden shortage of commodities whose output was not planned centrally but which were used as inputs to one of the 10,000 planned products. Factories that depended on those commodities often sat idle for months as they waited for the shortages to end.
Dimensionality reduction by aggregation. Apparently, steel tubes can come in thousands of different types. They can come in different lengths, different shapes, and different compositions. To reduce the dimensionality of the problem, administrators would often track the total tonnage of a few broad classes of steel tubes in the models, rather than using a more detailed classification scheme. While their models successfully balanced the tonnage of tubes for the broad categories (the output in tons of tube-producing factories matched the input requirements in tons of tube-consuming factories), there were constant surpluses of some specific types of tubes, and shortages of other specific types of tubes. In particular, since tonnage was used as a metric, tube-producing factories were overly incentivized to make easy-to-produce thick tubes. As a result, thin tubes were always in short supply.
Propagating adjustments only a few degrees back. Let’s say that during balance calculations, the administrators realized they needed to bump up the target output of one commodity. If they did that, it was also necessary to bump up the output targets of commodities that were input into the target commodity. But if they did that, they also needed to bump up the output targets of commodities that fed into those commodities, and so on! This involved a crazy amount of extra hand calculations every time they needed make an adjustment. To simplify things, the administrators typically made adjustments to the first-order suppliers, without making the necessary adjustments to the suppliers of the suppliers. This of course led to critical shortages of input commodities, which again led to idle factories.
Figure 1. Some example inputs and outputs in the Soviet economy in 1951, described in units of weight. This summary shows an extreme dimensionality reduction, more extreme than was ever used in planning. In this diagram, most commodities are excluded and each displayed commodity collapses across multiple different product types. Multiple steps in the supply chain are collapsed into a single step. (Source: CIA)
Even if the administrators could get the accounting correct, which they couldn’t, their attempts to allocate resources would still be far from optimal. In the steel industry, for example, some factories were better at producing some types of tubes whereas others were better at producing other types of tubes. Since there were thousands of different factories and tube types, it was non-trivial to decide how to best distribute resources and output requirements, and it was not immediately obvious which factories should be expanded and which should be closed down.
Supply chain optimizations
In the late 1960’s, a group of economists and computer scientists known as the “optimal planners” began to push for a better way of doing things. The group argued that a technique called linear programming, invented by Leonid Kantorovich, could optimally solve the problems with the supply chain. At a minimum, since the process could be computerized, it would be possible to perform more detailed calculations than could be done by hand, with less dimensionality reduction. But more importantly, linear programming allowed you to optimize arbitrary objective functions given certain constraints. In the case of the supply chain, it showed you how to efficiently allocate resources, identifying efficient factories that should get more input commodities, and inefficient factories that should be shut down.
Figure 2. Leonid Kantorovich, inventor of linear programming and winner of the 1975 Nobel Prize in Economics.
The optimal planners had some success here. For example, in the steel industry, about 60,000 consumers requested 10,000 different types of products from 500 producers. The producers were not equally efficient in their production. Some producers were efficient for some types of steel products, but less efficient for other types of steel products. Given the total amount of each product requested, and given the constraints of how much each factory can produce, the goal was decide how much each factory should produce of each type of product. If we simplify the problem by just asking how much each factory should produce without considering how the products will be distributed to the consuming factories, this becomes a straightforward application of the Optimal Assignment Problem, a well-studied example in linear programming. If we additionally want to optimize distribution, taking into account the distance-dependent costs of shipments from one factory to another, the problem becomes more complicated but is still doable. The problem becomes similar to the Transportation Problem, another well-studied example in linear programming, but in this case generalized to multiple commodities instead of just one.
By introducing linear programming, the optimal planners were modestly successful at improving the efficiency of some industries, but their effect was limited. First, political considerations prevented many of the recommendations surfaced by the model from being implemented. Cement factories that were known to be too inefficient or too far away from consumers were allowed to remain open even though the optimal solution recommended that they be closed. Second, since the planners were only allowed to work in certain narrow parts of the economy, they never had an opportunity to propagate their recommendations back in the supply chain, although one could imagine extending the models to do so. Third, and perhaps most importantly, the value of each commodity was set by old-school administrators in an unprincipled way, and so the optimal planners were forced to optimize objective functions that didn’t even make sense.
Ideas about optimizing the entire economy
While the optimal planners were able to improve the efficiency of a few industries, they had more ambitious plans. They believed they could use linear programming to optimize the entire economy and outperform capitalist societies. Doing so involved more than just scaling out the supply chain optimizations adopted by certain industries. It involved shadow prices and interest rates, and a few other things I’ll admit I don’t totally understand. But while I don’t really understand the implementation, I feel like the broader goal of the planners is easier to understand and explain:
Basically, in a completely free market, at least under certain assumptions, prices are supposed to converge to what’s called a General Equilibrium. The equilibrium prices have a some nice properties. They balance aggregate supply and demand, so that no commodities are in shortage or surplus. They are also Pareto efficient, which means that nobody in the economy can be made better off without making someone else worse off.
The optimal planners thought that they could do better. In particular, they pointed to two problems with capitalism: First, prices in a capitalist society were determined by individual agents using trial and error to guess the best price. Surely these agents, who had imperfect information, were not picking the exactly optimal prices. In contrast, a central planner using optimal computerized methods could pick prices that hit the equilibrium more exactly. Second, and more importantly, capitalism targeted an objective function that — while Pareto efficient — was not socially optimal. Because of huge differences in wealth, some people were able to obtain far more goods and services than other people. The optimal planners proposed using linear programming to optimize an objective function that would be more socially optimal. For example, it could aim to distribute goods more equitably. It could prioritize certain socially valuable goods (e.g. books) over socially destructive goods (e.g. alcohol). It could prioritize sectors that provide benefits over longer time horizons (e.g. heavy industry). And it could include constraints to ensure full employment.
What happened
None of this ever really happened. The ambitious ideas of the optimal planners were never adopted, and by the 1970s it was clear that living standards in the USSR were falling further behind those of the West. Perhaps things would have been better if the optimal planners got their way, but it seems like the consensus is that their plans would have failed even if they were implemented. Below are some of the main problems that would have been encountered.
Computational complexity. As described in a wonderful blog post by Cosma Shalizi, the number of calculations needed to solve a linear programming problem is: , where is the number of products, is the number of constraints, and is how much error you are willing to tolerate. Since the number of products, , was in the millions, and since the complexity was proportional to , it would have been practically impossible for the Soviets to compute a solution to their planning problem with sufficient detail (although see below). Any attempt to reduce the dimensionality would lead to the same perverse incentives and shortages that bedeviled earlier systems driven by hand calculations.
Data quality. The optimal planners thought that optimal computer methods could find prices that more exactly approximated equilibrium than could be done in a market economy, where fallible human actors guessed at prices by trial and error. The reality, however, would have been the exact opposite. Individual actors in a market economy understand their local needs and constraints pretty well, whereas central planners have basically no idea what’s going on. For example, central planners don’t have good information on when a factory fails to receive a shipment and they don’t have an accurate sense for how much more efficient some devices are than others. Even worse, in order to obtain more resources, factory managers in the USSR routinely lied to the central planners about their production capabilities. The situation became so bad that, according toone of the deep state secrets of the USSR, central planners preferred to use the CIA’s analyses of certain Russian commodities rather than reports from local Party bosses! This is especially crazy if you consider that the CIA described its own data as being of “debilitatingly” poor quality.
Nonlinearities. The optimal planners assumed linearity, such that the cost for a factory producing its 1000th widget was assumed to be the same as the cost for producing its first widget. In the real world, this is obviously false, as there are increasing returns to scale. It’s possible to model increasing returns to scale, but it becomes harder to solve computationally.
Choosing an objective function. Choosing what the society should value is really a political problem, and Cosma Shalizi does a very nice job describing why it would be so hard to come to agreement.
Incentives for innovation. The central planners couldn’t determine resource allocation for products that didn’t exist yet, and more importantly neither they nor the factories had much incentive to invent new products. That’s why the Soviet Union remained so focused on the steel/coal/cement economy while Western nations shifted their focus to plastics and microelectronics.
Political resistance. As described in a previous example, the model-based recommendations to shut down certain factories were ignored for political reasons. It is likely that many recommendations for the broader economy would have been ignored as well. For example, if a computer recommended that the price of heating oil should be doubled in the winter, how many politicians would let that happen?
Could this work in the future?
Had the optimal planners’ ideas been adopted at the time, they would have failed. But what about the future? In a hundred years, could we have the technical capability to pull off a totally planned economy? I did some poking around the internet and found, somewhat to my surprise, that the answer is actually… maybe. It turns out that two of the most serious problems with central planning could have technological solutions that may seem far-fetched but are perhaps not impossible:
Let’s start with computational complexity. As described above and in Cosma Shalizi’s post, the number of steps required to solve a linear programming problem with products and constraints is proportional to . The USSR had about 12 million types of goods. If you cross them over about 1000 possible locations, that gives you 12 billion variables, which according to Cosma would correspond to an optimization problem that would take a thousand years to solve on a modern desktop computer. However, if Moore’s Law holds up, it would be possible in 100 years to solve this problem reasonably quickly. It’s also worth pointing out that the economy’s input-output matrix is sparse, since not every product depends on every other product as input. It may be possible that someone might develop a faster algorithm that leverages this sparsity, although Cosma is somewhat skeptical that this could happen. [In an earlier version of this post, I discussed a sparsity-based proposal that supposedly brought things down to complexity. This was apparently a red herring that doesn’t actually solve the optimization problem.]
As described earlier, the second serious issue with a centrally planned economy was data quality: Central planners’ knowledge about the input requirements and output capabilities of individual factories was simply not as good as the people actually working in the factory. While this was certainly the case in the Soviet Union, one can’t help but wonder about technological improvements in supply chain management. Imagine if every product had a GPS device to track its location, with other sensors and cameras to determine product quality. Already Amazon is moving in that direction for pretty much all consumer goods, and one could imagine a world where demand could be measured with the Internet of Things. Whether a government would be able to harness this data as competently as Amazon is doubtful, and it’s obviously worth asking whether we would ever want a government to be using that type of data. But from a technical point of view it’s interesting to think about how the data quality issues that destroyed the USSR could be much less serious in the future.
All that being said, it’s still unclear to me how an objective function could be chosen in way that would democratically satisfy people, how innovation could be incentivized, or how political freedoms could be preserved. In terms of freedom, things were obviously not-so-hot in the Soviet Union. If you’d like to read more about this, you should definitely check out Red Plenty. It was one of the weirdest and most interesting books I have read.
Last week on Marginal Revolution, there was a link to a wonderful paper comparing the policy opinions of economic experts to those of the general public. The paper, by Paola Sapienza and Luigi Zingales, found some pretty significant discrepancies between the two groups. The authors attributed this difference to the degree of trust each group put in the implicit assumptions embedded into the economists’ answers. It’s an excellent and fascinating read. However, like pretty much all academic economics papers, it displays its data in cumbersome text tables rather than figures. In this blog post, I’ve created some figures from the data that I think are easier to read than tables.
For each policy question, the paper provides two numbers. One is the percentage of economic experts who agree with the policy position. The second is the percentage of general public respondents who agree with the policy position. This sounds like a good opportunity for a scatter plot:
There’s a light negative correlation (r = -0.47) between how much economic experts and the general public agree with these positions. As shown in the bottom right, the vast majority of economists prefer a carbon tax to car emissions standards, whereas the vast majority of the general public prefers the reverse. As shown in the top left, the vast majority of the general public believes that requiring the government to “Buy American” is an effective way to improve manufacturing employment, whereas the majority of economists do not. For the exact wording of the questions, see the Appendix to the original paper.
Another way to visualize the same data is with a slopegraph. Below, the left column shows all the policy positions ranked by agreement from economic experts. The right column shows the same policy positions ranked by agreement from the general public. This type of plot vividly shows how unanimous the experts are on a few beliefs: It’s very hard to predict the stock market, we’re on the left side of the Laffer Curve, and the US economy is fiscally unsustainable without healthcare cuts or taxes hikes.
Slopegraphs are useful in this context because they create less text overlap than scatter plots. With the scatter plot above, I had to use interactive mouseover events to selectively show the text for individual data points. This wouldn’t be possible in a publication, since most economics journals still require static PDFs.
Even with a slopegraph, there is still some risk of overlapping text. To keep this from happening, I added some light repulsion to the data points.
For those interested, the experts in this dataset come from the Economic Expert Panel, a diverse set of economists comprising Democrats, Republicans, and Independents. This panel is the same panel that generates the data used in my Which Famous Economist website.
[Python code for the slope graph. Scatter plot is in page source.]
One of the great developments in product design has been the adoption of A/B testing. Instead of just guessing what is best for your customers, you can offer a product variant to a subset of customers and measure how well it works. While undeniably useful, A/B testing is sometimes said to encourage too much “hill climbing”, an incremental and short-sighted style of product development that emphasizes easy and immediate wins.
Discussion around hill climbing can sometimes get a bit vague, so I thought I would make some animations that describe four distinct pitfalls that can emerge from an overreliance on hill climbing.
1. Local maxima
If you climb hills incrementally, you may end up in a local maximum and miss out on an opportunity to land on a global maximum with much bigger reward. Concerns about local maxima are often wrapped up in critiques of incrementalism.
Local maxima and global maxima can be illustrated with hill diagrams like the one above. The horizontal axis represents product space collapsed into a single dimension. In reality, of course, there are many dimensions that a product could explore.
2. Emergent maxima
If you run short A/B tests, or A/B tests that do not fully capture network effects, you might not realize that a change that initially seems bad may be good in the long run. This idea, which is distinct from concerns about incrementalism, can be described with a dynamic reward function animation. As before, the horizontal axis is product space. Each video frame represents a time step, and the vertical axis represents immediate, measurable reward.
When a product changes, the intial effect is negative. But eventually, customers begin to enjoy the new version, as shown by changes in the reward function. By waiting at a position that initially seemed negative, you are able to discover an emergent mountain, and receive greater reward than you would have from short-term optimization.
3. Novelty effects
Short-term optimization can be bad, not only because it prevents discovery of emergent mountains, but also because some hills can be transient. One way a hill can disappear is through novelty effects, where a shiny new feature can be engaging to customers in the short term, but uninteresting or even negative in the long term.
4. Loss of differentiation
Another way a hill can disappear is through loss of differentiation from more dominant competitors. Your product may occupy a market niche. If you try to copy your competitor, you may initially see some benefits. But at some point, your customers may leave because not much separates you from your more dominant competitor. Differentiation matters in some dimensions more than others.
You can think of an industry as a dynamic ecosystem where each company has its own reward function. When one company moves, it changes its own reward function as well as the reward functions of other companies. If this sounds like biology, you’re not mistaken. The dynamics here are similar to evolutionary fitness landscapes.
While all of the criticisms of hill climbing have obvious validity, I think it is easy for people to overreact to them. Here are some caveats in defense of hill climbing:
The plots above probably exaggerate the magnitude and frequency with which reward functions change.
There is huge uncertainty and disagreement about what future landscapes will look like. In most cases, it’s better to explore regions that increase (rather than decrease) reward, making sure to run long term experiments when needed.
The space is high dimensional. Even if your product is at a local maximum in one dimension, there are many other dimensions to explore and measure.
We may overestimate the causal relationship between bold product moves and company success. Investors often observe that companies who don’t make bold changes are doomed to fail. While I don’t doubt that there is some causation here, I think there is also some reverse causation. Bold changes require lots of resources. Maybe it’s mostly the success-bound companies who have enough resources to afford the bold changes.
Special thanks to Marika Inhoff, John McDonnell and Alex Rosner for comments on a draft of this post.
Jupyter notebooks are great because they allow you to easily present interactive figures. In addition, these notebooks include the figures and code in a single file, making it easy for others to reproduce your results. Sometimes though, you may want to present a cleaner report to an audience who may not care about the code. This blog post shows how to make code visibility optional, and how to remove various Jupyter elements to get a clean presentation.
On the top is a typical Jupyter presentation with code and some extra elements. Below that is a more polished version that removes some of the extra elements and makes code visibility optional with a button.
To make the code optionally visible, available at the click of a button, include a raw cell at the beginning of your notebook containing the JavaScript and HTML below. This code sample is inspired by a Stack Overflow post, but makes a few improvements such as using a raw cell so that the button position stays fixed, changing the button text depending on state, and displaying gradual transitions so the user understands what is happening.
It’s pretty straightforward to remove the extra elements like the header, footer, and prompt numbers. That being said, you may want to still include some attribution to the Jupyter project and to your free hosting service. To do all of this, just include a raw cell at the end of your notebook with some more JavaScript.
One shortcoming with what we have so far is that users may still see some code or other unwanted elements while the page is loading. This can be especially problematic if you have a long presentation with many plots. To avoid this problem, add a raw cell at the very top of your notebook containing a preloader. This example preloader includes an animation that signals to users that the page is still loading. It heavily inspired by this preloader created by @mimoYmima.
To work with these notebooks, you can clone my GitHub repository. While the notebooks render correctly on nbviewer (unpolished, polished), they do not render correctly on the GitHub viewer.