Variance after scaling and summing: One of the most useful facts from statistics

What do $ R^2 $, laboratory error analysis, ensemble learning, meta-analysis, and financial portfolio risk all have in common? The answer is that they all depend on a fundamental principle of statistics that is not as widely known as it should be. Once this principle is understood, a lot of stuff starts to make more sense.

Here’s a sneak peek at what the principle is.

Don’t worry if the formula doesn’t yet make sense! We’ll work our way up to it slowly, taking pit stops along the way at simpler formulas are that useful on their own. As we work through these principles, we’ll encounter lots of neat applications and explainers.

This post consists of three parts:

  • Part 1: Sums of uncorrelated random variables: Applications to social science and laboratory error analysis
  • Part 2: Weighted sums of uncorrelated random variables: Applications to machine learning and scientific meta-analysis
  • Part 3: Correlated variables and Modern Portfolio Theory

Part 1: Sums of uncorrelated random variables: Applications to social science and laboratory error analysis

Let’s start with some simplifying conditions and assume that we are dealing with uncorrelated random variables. If you take two of them and add them together, the variance of their sum will equal the sum of their variances. This is amazing!

To demonstrate this, I’ve written some Python code that generates three arrays, each of length 1 million. The first two arrays contain samples from two normal distributions with variances 9 and 16, respectively. The third array is the sum of the first two arrays. As shown in the simulation, its variance is 25, which is equal to the sum of the variances of the first two arrays (9 + 16).

from numpy.random import randn
import numpy as np
n = 1000000
x1 = np.sqrt(9) * randn(n) # 1M samples from normal distribution with variance=9
print(x1.var()) # 9
x2 = np.sqrt(16) * randn(n) # 1M samples from normal distribution with variance=16
print(x2.var()) # 16
xp = x1 + x2
print(xp.var()) # 25

This fact was first discovered in 1853 and is known as Bienaymé’s Formula. While the code example above shows the sum of two random variables, the formula can be extended to multiple random variables as follows:

If $ X_p $ is a sum of uncorrelated random variables $ X_1 .. X_n $, then the variance of $ X_p $ will be $$ \sigma_{p}^{2} = \sum{\sigma^2_i} $$ where each $ X_i $ has variance $ \sigma_i^2 $.

What does the $ p $ stand for in $ X_p $? It stands for portfolio, which is just one of the many applications we’ll see later in this post.

Why this is useful

Bienaymé’s result is surprising and unintuitive. But since it’s such a simple formula, it is worth committing to memory, especially because it sheds light on so many other principles. Let’s look at two of them.

Understanding $ R^2 $ and “variance explained”

Psychologists often talk about “within-group variance”, “between-group variance”, and “variance explained”. What do these terms mean?

Imagine a hypothetical study that measured the extraversion of 10 boys and 10 girls, where extraversion is measured on a 10-point scale (Figure 1. Orange bars). The boys have a mean extraversion of 4.4 and the girls have a mean extraversion 5.0. In addition, the overall variance of the data is 2.5. We can decompose this variance into two parts:

  • Between-group variance: Create a 20-element array where every boy is assigned to the mean boy extraversion of 4.4, and every girl is assigned to the mean girl extraversion of 5.0. The variance of this array is 0.9. (Figure 1. Blue bars).
  • Within-group variance: Create a 20-element array of the amount each child’s extraversion deviates from the mean value for their sex. Some of these values will be negative and some will be positive. The variance of this array is 1.6. (Figure 1. Pink bars).
Figure 1: Decomposition of extraversion scores (orange) into between-group variance (blue) and within-group variance (pink).

If you add these arrays together, the resulting array will represent the observed data (Figure 1. Orange bars). The variance of the observed array is 2.5, which is exactly what is predicted by Bienaymé’s Formula. It is the sum of the variances of the two component arrays (0.9 + 1.6). Psychologists might say that sex “explains” 0.9/2.5 = 36% of the extraversion variance. Equivalently, a model of extraversion that uses sex as the only predictor would have an $ R^2 $ of 0.36.

Error propagation in laboratories

If you ever took a physics lab or chemistry lab back in college, you may remember having to perform error analysis, in which you calculated how errors would propagate through one noisy measurement after another.

Physics textbooks often say that standard deviations add in “quadrature”, which just means that if you are trying to estimate some quantity that is the sum of two other measurements, and if each measurement has some error with standard deviation and respectively, the final standard deviation would be . I think it’s probably easier to just use variances, as in the Bienaymé Formula, with .

For example, imagine you are trying to estimate the height of two boxes stacked on top of each other (Figure 2). One box has a height of 1 meter with variance $ \sigma^2_1 $ = 0.01, and the other has a height of 2 meters with variance $ \sigma^2_2 $ = 0.01. Let’s further assume, perhaps optimistically, that these errors are independent. That is, if the measurement of the first box is too high, it’s not any more likely that the measurement of the second box will also be too high. If we can make these assumptions, then the total height of the two boxes will be 3 meters with variance $ \sigma^2_p $ = 0.02.

Figure 2: Two boxes stacked on top of each other. The height of each box is measured with some variance (uncertainty). The total height is the the sum of the individual heights, and the total variance (uncertainty) is the sum of the individual variances.

There is a key difference between the extraversion example and the stacked boxes example. In the extraversion example, we added two arrays that each had an observed sample variance. In the stacked boxes example, we added two scalar measurements, where the variance of these measurements refers to our measurement uncertainty. Since both cases have a meaningful concept of ‘variance’, the Bienaymé Formula applies to both.

Part 2: Weighted sums of uncorrelated random variables: Applications to machine learning and scientific meta-analysis

Let’s now move on to the case of weighted sums of uncorrelated random variables. But before we get there, we first need to understand what happens to variance when a random variable is scaled.

If $ X_p $ is defined as $ X $ scaled by a factor of $ w $, then the variance $ X_p $ will be $$ \sigma_{p}^{2} = w^2 \sigma^2 $$ where $ \sigma^2 $ is the variance of $ X $.

This means that if a random variable is scaled, the scale factor on the variance will change quadratically. Let’s see this in code.

from numpy.random import randn
import numpy as np
n = 1000000
baseline_var = 10
w = 0.7
x1 = np.sqrt(baseline_var) * randn(n) # Array of 1M samples from normal distribution with variance=10
print(x1.var()) # 10
xp = w * x1 # Scale this by w=0.7
print(w**2 * baseline_var) # 4.9 (predicted variance)
print(xp.var()) # 4.9 (empirical variance) 

To gain some intuition for this rule, it’s helpful to think about outliers. We know that outliers have a huge effect on variance. That’s because the formula used to compute variance, $ \sum{\frac{(x_i - \bar{x})^2}{n-1}} $, squares all the deviations, and so we get really big variances when we square large deviations. With that as background, let’s think about what happens if we scale our data by 2. The outliers will spread out twice as far, which means they will have even more than twice as much impact on the variance. Similarly, if we multiply our data by 0.5, we will squash the most “damaging” part of the outliers, and so we will reduce our variance by more than a factor of two.

While the above principle is pretty simple, things start to get interesting when you combine it with the Bienaymé Formula in Part I:

If $ X_p $ is a weighted sum of uncorrelated random variables $ X_1 ... X_n $, then the variance of $ X_p $ will be $$ \sigma_{p}^{2} = \sum{w^2_i \sigma^2_i} $$ where each $ w_i $ is a weight on $ X_i $, and each $ X_i $ has its own variance $ \sigma_i^2 $.

The above formula shows what happens when you scale and then sum random variables. The final variance is the weighted sum of the original variances, where the weights are squares of the original weights. Let’s see how this can be applied to machine learning.

An ensemble model with equal weights

Imagine that you have built two separate models to predict car prices. While the models are unbiased, they have variance in their errors. That is, sometimes a model prediction will be too high, and sometimes a model prediction will be too low. Model 1 has a mean squared error (MSE) of \$1,000 and Model 2 has an MSE of \$2,000.

A valuable insight from machine learning is that you can often create a better model by simply averaging the predictions of other models. Let’s demonstrate this with simulations below.

from numpy.random import randn
import numpy as np
n = 1000000
actual = 20000 + 5000 * randn(n)
errors1 = np.sqrt(1000) * randn(n)
print(errors1.var()) # 1000
errors2 = np.sqrt(2000) * randn(n)
print(errors2.var()) # 2000

# Note that this section could be replaced with 
# errors_ensemble = 0.5 * errors1 + 0.5 * errors2
preds1 = actual + errors1
preds2 = actual + errors2
preds_ensemble = 0.5 * preds1 + 0.5 * preds2
errors_ensemble = preds_ensemble - actual

print(errors_ensemble.var()) # 750. Lower than variance of component models!

As shown in the code above, even though a good model (Model 1) was averaged with an inferior model (Model 2), the resulting Ensemble model’s MSE of \$750 is better than either of the models individually.

The benefits of ensembling follow directly from the weighted sum formula we saw above, . To understand why, it’s helpful to think of models not as generating predictions, but rather as generating errors. Since averaging the predictions of a model corresponds to averaging the errors of the model, we can treat each model’s array of errors as samples of a random variable whose variance can be plugged in to the formula. Assuming the models are unbiased (i.e. the errors average to about zero), the formula tells us the expected MSE of the ensemble predictions. In the example above, the MSE would be

which is exactly what we observed in the simulations.

(For a totally different intuition of why ensembling works, see this blog post that I co-wrote for my company, Opendoor.)

An ensemble model with Inverse Variance Weighting

In the example above, we obtained good results by using an equally-weighted average of the two models. But can we do better?

Yes we can! Since Model 1 was better than Model 2, we should probably put more weight on Model 1. But of course we shouldn’t put all our weight on it, because then we would throw away the demonstrably useful information from Model 2. The optimal weight must be somewhere in between 50% and 100%.

An effective way to find the optimal weight is to build another model on top of these models. However, if you can make certain assumptions (unbiased and uncorrelated errors), there’s an even simpler approach that is great for back-of-the envelope calculations and great for understanding the principles behind ensembling.

To find the optimal weights (assuming unbiased and uncorrelated errors), we need to minimize the variance of the ensemble errors with the constraint that .

It turns out that the variance-minimizing weight for a model should be proportional to the inverse of its variance.

When we apply this method, we obtain optimal weights of = 0.67 and = 0.33. These weights give us an ensemble error variance of

which is significantly better than the $750 variance we were getting with equal weighting.

This method is called Inverse Variance Weighting, and allows you to assign the right amount of weight to each model, depending on its error.

Inverse Variance Weighting is not just useful as a way to understand Machine Learning ensembles. It is also one of the core principles in scientific meta-analysis, which is popular in medicine and the social sciences. When multiple scientific studies attempt to estimate some quantity, and each study has a different sample size (and hence variance of their estimate), a meta-analysis should weight the high sample size studies more. Inverse Variance Weighting is used to determine those weights.

Part 3: Correlated variables and Modern Portfolio Theory

Let’s imagine we now have three unbiased models with the following MSEs:

  • Model 1: MSE = 1000
  • Model 2: MSE = 1000
  • Model 3: MSE = 2000

By Inverse Variance Weighting, we should assign more weight to the first two models, with .

But what happens if Model 1 and Model 2 have correlated errors? For example, whenever Model 2’s predictions are too high, Model 3’s predictions tend to also be too high. In that case, maybe we don’t want to give so much weight to Models 1 and 2, since they provide somewhat redundant information. Instead we might want to diversify our ensemble by increasing the weight on Model 3, since it provides new independent information.

To determine how much weight to put on each model, we first need to determine how much total variance there will be if the errors are correlated. To do this, we need to borrow a formula from the financial literature, which extends the formulas we’ve worked with before. This is the formula we’ve been waiting for.

If $ X_p $ is a weighted sum of (correlated or uncorrelated) random variables $ X_1 ... X_n $, then the variance of $ X_p $ will be $$ \sigma_{p}^{2} = \sum\limits_{i} \sum\limits_{j} w_i w_j \sigma_i \sigma_j \rho_{ij} $$ where each $ w_i $ and $ w_j $ are weights assigned to $ X_i $ and $ X_j $, where each $ X_i $ and $ X_j $ have standard deviations $ \sigma_i $ and $ \sigma_j $, and where the correlation between $ X_i $ and $ X_j $ is $ \rho_{ij} $.

There’s a lot to unpack here, so let’s take this step by step.

  • is a scalar quantity representing the covariance between and .
  • If none of the variables are correlated with each other, then all the cases where $ i \neq j $ will go to zero, and the formula reduces to , which we have seen before.
  • The more that two variables and are correlated, the more the total variance increases.
  • If two variables and are anti-correlated, then the total variance decreases, since is negative.
  • This formula can be rewritten in more compact notation as , where is the weight vector, and is the covariance matrix (not a summation sign!)

If you skimmed the bullet points above, go back and re-read them! They are super important.

To find the set of weights that minimize variance in the errors, you must minimize the above formula, with the constraint that . One way to do this is to use a numerical optimization method. In practice, however, it is more common to just find weights by building another model on top of the base models

Regardless of how the weights are found, it will usually be the case that if Models 1 and 2 are correlated, the optimal weights will reduce redundancy and put lower weight on these models than simple Inverse Variance Weighting would suggest.

Applications to financial portfolios

The formula above was discovered by economist Harry Markowitz in his Modern Portfolio Theory, which describes how an investor can optimally trade off between expected returns and expected risk, often measured as variance. In particular, the theory shows how to maximize expected return given a fixed variance, or minimize variance given a fixed expected return. We’ll focus on the latter.

Imagine you have three stocks to put in your portfolio. You plan to sell them at time $ T $, at which point you expect that Stock 1 will have gone up by 5%, with some uncertainty. You can describe your uncertainty as variance, and in the case of Stock 1, let’s say = 1. This stock, as well Stocks 2 and 3, are summarized in the table below:

Stock ID Expected Return Expected Risk ()
1 5.0 1.0
2 5.0 1.0
3 5.0 2.0

This financial example should remind you of ensembling in machine learning. In the case of ensembling, we wanted to minimize variance of the weighted sum of error arrays. In the case of financial portfolios, we want to minimize the variance of the weighted sum of scalar financial returns.

As before, if there are no correlations between the expected returns (i.e. if Stock 1 exceeding 5% return does not imply that Stock 2 or Stock 3 will exceed 5% return), then the total variance in the portfolio will be and we can use Inverse Variance Weighting to obtain weights $ w_1=0.4, w_2=0.4, w_3=0.2 $.

However, sometimes stocks have correlated expected returns. For example, if two of the stocks are in oil companies, then one stock exceeding 5% implies the other is also likely to exceed 5%. When this happens, the total variance becomes

as we saw before in the ensemble example. Since this includes an additional positive term for , the expected variance is higher than in the uncorrelated case, assuming the correlations are positive. To reduce this variance, we should put less weight on Stocks 1 and 2 than we would otherwise.

While the example above focused on minimizing the variance of a financial portfolio, you might also be interested in having a portfolio with high return. Modern Portfolio Theory describes how a portfolio can reach any abitrary point on the efficient frontier of variance and return, but that’s outside the scope of this blog post. And as you might expect, financial markets can be more complicated than Modern Portfolio Theory suggests, but that’s also outside scope.


That was a long post, but I hope that the principles described have been informative. It may be helpful to summarize them in backwards order, starting with the most general principle.

If $ X_p $ is a weighted sum of (correlated or uncorrelated) random variables $ X_1 ... X_n $, then the variance of $ X_p $ will be $$ \sigma_{p}^{2} = \sum\limits_{i} \sum\limits_{j} w_i w_j \sigma_i \sigma_j \rho_{ij} $$ where each $ w_i $ and $ w_j $ are weights assigned to $ X_i $ and $ X_j $, where each $ X_i $ and $ X_j $ have standard deviations $ \sigma_i $ and $ \sigma_j $, and where the correlation between $ X_i $ and $ X_j $ is $ \rho_{ij} $. The term $ \sigma_i \sigma_j \rho_{ij} $ is a scalar quantity representing the covariance between $ X_i $ and $ X_j $.

If none of the variables are correlated, then all the cases where $ i \neq j $ go to zero, and the formula reduces to $$ \sigma_{p}^{2} = \sum{w^2_i \sigma^2_i} $$ And finally, if we are computing a simple sum of random variables where all the weights are 1, then the formula reduces to $$ \sigma_{p}^{2} = \sum{\sigma^2_i} $$

Using your ears and head to escape the Cone Of Confusion

One of coolest things I ever learned about sensory physiology is how the auditory system is able to locate sounds. To determine whether sound is coming from the right or left, the brain uses inter-ear differences in amplitude and timing. As shown in the figure below, if the sound is louder in the right ear compared to the left ear, it’s probably coming from the right side. The smaller that difference is, the closer the sound is to the midline (i.e the vertical plane going from your front to your back). Similarly, if the sound arrives at your right ear before the left ear, it’s probably coming from the right. The smaller the timing difference, the closer it is to the midline. There’s a fascinating literature on the neural mechanisms behind this.

Inter-ear loudness and timing differences are pretty useful, but unfortunately they still leave a lot of ambiguity. For example, a sound from your front right will have the exact same loudness differences and timing differences as a sound from your back right.

Not only does this system leave ambiguities between front and back, it also leaves ambiguities between top and down. In fact, there is an entire cone of confusion that cannot be disambiguated by this system. Sound from all points along the surface of the cone will have the same inter-ear loudness differences and timing differences.

While this system leaves a cone of confusion, humans are still able to determine the location of sounds from different points on the cone, at least to some extent. How are we able to do this?

Amazingly, we are able to do this because of the shape of our ears and heads. When sound passes through our ears and head, certain frequencies are attenuated more than others. Critically, the attenuation pattern is highly dependent on sound direction.

This location-dependent attenuation pattern is called a Head-related transfer function (HRTF) and in theory this could be used to disambiguate locations along the cone of confusion. An example of someone’s HRTF is shown below, with frequency on the horizontal axis and polar angle on the vertical axis. Hotter colors represent less attenuation (i.e. more power). If your head and ears gave you this HRTF, you might decide a sound is coming from the front if it has more high frequency power than you’d expect.

HRTF image from Simon Carlile's Psychoacoustics chapter in The Sonification Handbook.

This system sounds good in theory, but do we actually use these cues in practice? In 1988, Frederic Wightman and Doris Kistler performed an ingenious set of experiments (1, 2) to show that that people really do use HRTFs to infer location. First, they measured the HRTF of each participant by putting a small microphone in their ears and playing sounds from different locations. Next they created a digital filter for each location and each participant. That is to say, these filters implemented each participant’s HRTF. Finally, they placed headphones on the listeners and played sounds to them, each time passing the sound through one of the digital filters. Amazingly, participants were able to correctly guess the “location” of the sound, depending on which filter was used, even though the sound was coming from headphones. They were also much better at sound localization when using their own HRTF, rather than someone else’s HRTF.

Further evidence for this hypothesis comes from Hofman et al., 1998, who showed that by using putty to reshape people’s ears, they were able to change the HRTFs and thus disrupt sound localization. Interestingly, people were able to quickly relearn how to localize sound with their new HRTFs.

Image from Hofman et al., 1998.

A final fun fact: to improve the sound localization of humanoid robots, researchers in Japan attached artificial ears to the robot heads and implemented some sophisticated algorithms to infer sound location. Here are some pictures of the robots.

Their paper is kind of ridiculous and has some questionable justifications for not just using microphones in multiple locations, but I thought it was fun to see these principles being applied.


Hyperbolic discounting — The irrational behavior that might be rational after all

When I was in grad school I occasionally overheard people talk about how humans do something called “hyperbolic discounting”. Apparently, hyperbolic discounting was considered irrational under standard economic theory.

I recently decided to learn what hyperbolic discounting was all about, so I set out to write this blog post. I have to admit that hyperbolic discounting has been pretty hard for me to understand, but I think I now finally have a good enough handle on it to write about it. Along the way, I learned something interesting: Hyperbolic discounting might be rational after all.

Rational and irrational discounting

If I offered you $50 now or $100 in 6 months, which would you pick? It’s not crazy to choose the $50 now. One reason is that it’s a safer bet. If you had chosen the delayed $100, there’s a risk that I might forget about the deal when the time came to pay [personal note: I wouldn’t], and you would never get your money. Another reason is that if you invest the $50 now, you might be able to make up some of the remainder in interest.

Valuing immediate money more than future money is a rational behavior known as discounting. Everybody has their own discount factor. Some people might value money in 6 months at, say, 75% of what they’d value it today. Others might value it at 90%.

In the early 1980s, psychologist George Ainslie discovered something peculiar. He found that while a lot of people would prefer $50 immediately rather than $100 in 6 months, they would not prefer $50 in 3 months rather than $100 in 9 months. These two different scenarios are shown in the diagram below, where the green checks indicate the options that people tended to choose.

If you think about it, there’s something inconsistent about this behavior. The two scenarios are actually identical, just shifted by 3 months, and yet the same people behave differently depending on when the scenario would be presented. If we waited 3 months and then asked them again if they would prefer $50 immediately or $100 in 6 months, their original response to Scenario 1 implies they would prefer the fast money (i.e. they would have a high discount rate), but their original response to Scenario 2 implies they would prefer the delayed money (i.e. they would have a low discount rate). In other words, their present self today would make a decision in Scenario 2 that three months from now they will have regretted making. This behavior is time-inconsistent and is therefore considered irrational according to standard economic theory.

The usual way to achieve rational time-consistent discounting is with an exponential discount curve, where the value of receiving something at future time is a fixed fraction of its present value, and where is the constant discount rate.

With an exponential curve, a dollar delayed by six months is always worth the same fixed fraction of a dollar at the baseline date, no matter what the baseline date it is. If you are indifferent between $100 now and $50 in 6 months, you should also be indifferent between $100 in 3 months and $50 in 9 months. The discount rate is constant.

In contrast to an exponential curve, humans tend to show a hyperbolic discount curve, which is considered irrational according to standard economic theory. Confusingly, the hyperbolic discount curve is not defined by one of the hyperbolic functions you may remember from high school. Instead, it is defined as follows, where $ \tau $ is the relative time from now:

Whereas an exponential curve has a constant discount rate, a hyperbolic discount curve has a higher discount rate in the near future and lower discount rate in the distant future. That’s why the participants in Ainslie’s experiment cared more about the delay from 0 to 6 months than about the same delay from 3 to 9 months.

The apparent rationality of an exponential function is often presented either with a hazard rate interpretation or an interest rate interpretation. It turns out, however, that both of these interpretations make implausible assumptions about the world. The rest of this blog post describes how if we make some more plausible assumptions, hyperbolic discount function becomes rational.

Hazard Rate interpretation

According to the hazard-based interpretation of discounting, you should prefer immediate money to future money because there is a risk that the future money will never be delivered. This interpretation is more common in the animal behavior literature. (Pigeons, rats, and monkeys all show hyperbolic discounting.)

Apparent rationality of exponential discounting

Imagine that there could be an event at some time that could cause you to no longer receive your reward. Perhaps the person who owes you the money could die or lose their assets. Assuming a constant hazard rate (i.e. assuming the event is equally likely to happen at any time), the probability that the event has not happened by time $ \tau $ is:

The function is called a “survival” function because it describes the probability that the deal is still “alive” by time . The parameter is the constant hazard rate.

A key insight here is that the survival function can also be interpreted as a discount function. If someone says you can receive your reward at time $ \tau $ if you’re willing to wait for it, and if there’s only a 60% chance you’ll actually receive it, you should value that offer at 60% of the current value of the reward. Since the survival function is exponential, and since the survival function is the discount function, your discount function is also exponential, at least according to standard economic theory.

Rationality of hyperbolic discounting

The problem with assuming an exponential survival function is that it assumes you know the hazard rate $ \lambda $. In most situations, you don’t know exactly what the hazard rate is. Instead you have uncertainty around the parameter. Souza (2015) proposes an exponential prior distribution over the hazard rate.

The exponential prior is shown in the graph below. Although this curve looks like a discount function, it is not. It is a distribution over a parameter.

According to Souza, if you average over all the possible exponential survival functions generated by this prior, you get a hyperbolic survival function, or equivalently, a hyperbolic discount function. Let’s see this in action.

In the plot below, I’ve drawn 30 exponential survival functions in blue, each with a sampled from the prior distribution defined above. The pink curve is the mean of all of them. Notice that whereas the individual survival curves are all exponential, their mean is hyperbolic, with a distant future characterized by a flat slope and relatively high survival values.

As a consequence, the further you look into the future, the lower your discount rate. This is exactly what hyperbolic discounting is and exactly what humans do. For the choice between $50 now and $100 in 6 months, we apply a heavy discount rate. For the choice between $50 in 3 months and $100 in 9 months, we apply a lighter discount rate.

For a more qualitative intuition, you can think of it this way: When you make a deal with someone, you don’t know what the hazard rate is going to be. But if the deal is still alive after 80 months, then your hazard rate is probably favorable and thus the deal is likely to still be alive after 100 months. You can therefore have a light discount rate between 80 months and 100 months.

Interest Rate interpretation

While hazard functions are one way to explain discounting, another common explanation involves interest rates. A dollar now is worth more than a dollar in a year, because if you take the dollar now and invest it, it will worth more in a year.

Apparent rationality of exponential discounting

If the interest rate is 5%, you should be indifferent between $100 today and $105 in a year, since you could just invest the $100 today and get the same amount in a year. If the interest rate is constant, the value of the dollar will rise exponentially, according to $ v(\tau) = e^{0.05\tau} $, where $ v $ is the value. To rationally maintain indifference between a dollar and its equivalent value in the future after investment, your discount function should decay exponentially, according to $ s(\tau) = e^{-0.05\tau} $.

Rationality of hyperbolic discounting

The problem with this story is that it only works if you assume that the interest rate is constant. In the real world, the interest rate fluctuates.

Before taking on the fluctuating interest rate scenario, let’s first take on a different assumption that is still somewhat simplified. Let’s assume that the interest rate is constant but we don’t know what it is, just as we didn’t know what the hazard rate was in the previous interpretation. With this assumption, the justification for hyperbolic discounting becomes similar to the explanation in the blue and pink plots above. When you do a probability-weighted average over these decaying exponential curves, you get a hyperbolic function.

The previous paragraph assumed that the interest rate was constant but unknown. In the real world, the interest rate is known but fluctuates over time. Farmer and Geanakoplos (2009) showed that if you assume that interest rate fluctuations follow a geometric random walk, hyperbolic discounting becomes optimal, at least asymptotically as . In the near future, you know the interest rate with reasonable certainty and should therefore discount with an exponential curve. But as you look further into the future, your uncertainty about the interest rate increases and you should therefore discount with a hyperbolic curve.

Is the geometric random walk a process that was cherry picked by the authors to produce this outcome? Not really. Newell and Pizer (2003) studied US bond rates in the 19th and 20th century and found that the geometric random walk provided a better fit than any of the other interest rate models tested.


When interpreting discounting as a survival function, a hyperbolic discounting function is rational if you introduce uncertainty into the hazard parameter via an exponential prior (Souza, 2015). When interpreting the discount rate as an interest rate, a hyperbolic discounting function is asymptotically rational if you introduce uncertainty in the interest rate via a geometric random walk (Farmer and Geanakoplos, 2009).


Religions as firms

I recently came across a magazine that helps pastors manage the financial and operational challenges of church management. The magazine is called Church Executive.

Readers concerned about seasonal effects on tithing can learn how to “sustain generosity” during the weaker summer months. Technology like push notifications and text messages is encouraged as a way to remind people to tithe. There is also some emphasis on messaging, as pastors are told to “make sure your generosity-focused sermons are hitting home with your audience”.

Churches need money to stay active, and it’s natural that pastors would want to maintain a healthy cash flow. But the brazen language of Church Executive reminded me of the language of profit-maximizing firms. This got me thinking: What are the other ways in which religions act like a business?

This post is my attempt to understand religions as if they were businesses. This isn’t a perfect metaphor. Most religious leaders are motivated by genuine beliefs, and few are motivated primarily by profit. But it can still be instructive to view religions through the lens of business and economics, if only as an exercise. After working through this myself, I feel like I have a better understanding of why religions act the way they do.


As with any business, one of the most pressing concerns of a religion is competition. According to sociologist Carl Bankston, the set of religions can be described as a marketplace of competing firms that vie for customers. Religious consumers can leave one church to go to another. To hedge their bets on the afterlife, some consumers may even belong to several churches simultaneously, in a strategy that has been described as “portfolio diversification”.

One way that a religion can ward off competitors is to prohibit its members from following them. The Bible is insistent on this point, with 26 separate verses banning idolatry. Other religions have been able to eliminate competition entirely by forming state-sponsored monopolies.


Just like a business, religions need to determine how to price their product. According to economists Laurence Iannaccone and Feler Bose, the optimal pricing strategy for a religion depends on whether it is proselytizing or non-proselytizing.

Non-proselytizing religions like Judaism and Hinduism earn much of their income from membership fees. While exceptions are often made for people who are too poor to pay, and while donations are still accepted, the explicit nature of the membership fees help these religions avoid having too many free riders.

Proselytizing religions like Christianity are different. Because of their strong emphasis on growth, they are willing to forgo explicit membership fees and instead rely more on donations that are up to the member’s discretion. Large donations from wealthy individuals can cross-subsidize the membership of those who make smaller donations. Even free riders who make no donations at all may be worthwhile, since they may attract more members in the future.

Surge Pricing

Like Uber, some religions raise the price during periods of peak demand. While attendance at Jewish synagogue for a regular Shabbat service is normally free, attendance during one of the High Holidays typically requires a payment for seating, in part to ensure space for everyone.

Surge pricing makes sense for non-proselytizing religions such as Judaism, but it does not make sense for proselytizing religions such as Christianity, which views the higher demand during peak season as an opportunity to convert newcomers and to reactivate lapsed members. Thus, Christian churches tend to expand seating and schedule extra services during Christmas and Easter, rather than charging fees.

Product Quality

Just as business consumers will pay higher prices for better products, consumers of polytheistic religions will pay higher “prices” for gods with more wide-ranging powers. Even today, some American megachurches have found success with the prosperity gospel, which emphasizes that God can make you wealthy.

Of course, not all religious consumers will prefer the cheap promises of the prosperity gospel. For many religions, product quality is defined primarily by community, a sense of meaning, and in some cases the promise of an afterlife.

Software Updates

A good business should be constantly updating its product to fix bugs and to respond to changes in consumer preference or government regulation. Some religions do the same thing, via the process of continuous revelation from their deity. Perhaps no church exemplifies this better than the Church of Jesus Christ of Latter-day Saints.

For most of the history of the Mormon Church, individuals of African descent were prohibited from serving as priests. By the 1960s, as civil rights protests against the church received media attention, the policy became increasingly untenable. On June 1, 1978, Mormon leaders reported that God had instructed them to update the policy and allow black priests. This event was known as the 1978 Revelation on the Priesthood.

In the late 19th Century, when the Mormon Church was under intense pressure from the US Government regarding polygamy, the Church president claimed to receive a revelation from Jesus Christ asking him to prohibit it. This revelation, known as the 1890 Revelation, overturned the previous 1843 Revelation which allowed polygamy.

While frequent updates usually make sense in business, they don’t always make sense in religion. Most religions have a fairly static doctrine, as the prospect of future updates undermines the authority of current doctrine.

Growth and marketing

Instead of focusing only on immediate profitability, many businesses invest in user growth. As mentioned earlier, many religions are willing to cross-subsidize participation from new members, especially young members, with older members bearing most of the costs.

Christianity’s concept of a heaven and hell encouraged its members to convert their friends and family. In some ways, this is reminiscent of viral marketing.

International expansion

Facebook and Netflix both experienced rapid adoption, starting with a U.S. audience. But as U.S. growth began to slow down, both companies needed to look towards international expansion.

A similar thing happened with the Mormon church. By the 20th century, U.S. growth was driven only by increasing family sizes, so the church turned towards international expansion.

The graph below shows similar US and international growth curves for Netflix and the Church of Jesus Christ of Latter-day Saints.[1,2,3,4]


Like any company, most religions try to maintain a good brand. But unlike businesses, most religions do not have brand protection, and thus their brands can be co-opted by other religions. Marketing from Mormons and from Jehovah’s Witnesses tends to emphasize the good brand of Jesus Christ, even though most mainstream Christians regard these churches as heretical.

One of the most interesting risks to brands is genericide, in which a popular trademark becomes synonymous with the general class of product, thereby diluting its distinctive meaning. Famous examples of generic trademarks include Kleenex and Band-Aid. Amazingly, genericide can also happen to religious deities. The ancient Near East god El began as a distinct god with followers, but gradually became a generic name for “God” and eventually merged with the Hebrew god Yahweh.

Mergers and spin-offs

In business, companies can spin off other companies or merge with other companies. But with rare exceptions, religions only seem to have spin-offs. Why do religions hardly ever merge with other religions? My guess is that since there is no protection for religious intellectual property, religions can acquire the intellectual property of another religion without requiring a merger. Religions can simply copy each other’s ideas.

Another reason that religious mergers are rare is that religions are strongly tied to personal identity and tap into tribal thinking. When WhatsApp was acquired, its leadership was happy to adopt new identities as Facebook employees. But it is far less likely that members of, say, the Syriac Catholic Church would ever tolerate merging into the rival Syriac Maronite Church, even if it might provide them with economies of scale and more political power.

On Twitter, I asked why there are so few religious mergers and got lots of interesting responses. People pointed out that reconciliation of doctrine could undermine the authority of the leaders, and that there is little benefit from economies of scale. Others noted that religious mergers aren’t that rare. Hinduism and Judaism may have began as mergers of smaller religions, many Christian traditions involve mergers with religions they replaced, and that even today Hinduism continues to be a merging of various sects.

It’s worth repeating that economic explanations aren’t always great at describing the conscious motivations of religious individuals, who generally have sincere beliefs. Nevertheless, economic reasoning does a decent job of predicting the behavior of systems, and it’s been pretty interesting to learn how religion is no exception.


Part 2: A bipartisan list of people who argue in good faith

In Part 1, I posted a bipartisan list of people who are bad for America. Those people present news stories that cherry pick the worst actions from the other side so that they can get higher TV ratings and more social media points.

Here in Part 2, I post a list of people who don’t do that, at least for the most part. This isn’t a list of centrists. If anything, it is a more politically diverse list than the list in Part 1. This is a list of people who usually make good-faith attempts to persuade others about their point of view.

  • Megan McArdle (Twitter, Bloomberg) – Moderately libertarian ideas presented to a diverse audience
  • Noah Smith (Twitter, Bloomberg) – Center-left economics
  • Ross Douthat (Twitter, NYT) – Social conservatism presented to a left-of-center audience
  • Noam Chomsky (Website)
  • Conor Friedersdorf (The Atlantic)
  • Ben Sasse — Has the third-most conservative voting record in the Senate but never caricatures the other side and is very concerned about filter bubbles.
  • Julia Galef (Twitter) – Has some great advice for understanding the other side
  • Nicky Case (Twitter)
  • Fareed Zakaria (Washington Post) – Center-left foreign policy
  • Eli Lake (Twitter, Bloomberg) – Hawkish foreign policy
  • Kevin Drum (Mother Jones) – Center-left blogger who writes in good faith
  • John Carl Baker (Twitter) – One of the few modern socialists I have found who avoids in-group snark.
  • Michael Dougherty (Twitter, The Week)
  • Reihan Salam (Twitter, NRO)
  • Avik Roy (Twitter, NRO) – Conservative health care
  • Ezra Klein (Vox, early days at the American Prospect) – While at the American Prospect, Ezra did an amazing job trying to persuade people about the benefits of Obamacare. Vox, the explainer site that he started, sometimes slips into red meat clickbait. But to its credit, Vox has managed to reach a wide audience with mostly explainer content.

Reading the people on this list with an open mind will broaden your worldview.