Coronavirus and fragmented data pipelines

Anybody looking at coronavirus data right now must feel very confused. The UK has a daily case count 60 times higher than Australia. Italy has a case fatality rate 3 times higher than nearby Greece and 12 times higher than Pakistan. These heterogeneities seem massive and have the potential to teach us critical insights about the disease. But as any epidemiologist will readily acknowledge, these statistics are terribly confounded by inconsistent reporting protocols and variable testing ability, both of which could be driving inter-country differences of 10x or more.

In this blog post, I want to talk about some of the enormous issues with data quality and why we have them. And I want to describe how we can go beyond merely acknowledging the biases, and instead focus on the policy changes that can fix them.

Testing count

The first issue is that different countries and different states vary greatly in how much testing they do. Iceland has tested 10% of its population, whereas India has only tested 0.1% of its population. Obviously these differences in testing will drive differences in reported case counts.

Given differences in economic development, it’s understandable that different countries will have different testing counts. What’s not acceptable is that so many countries don’t report their testing counts! They just report the number of positive tests! Even among the countries that report both numbers, many greatly underreport the testing count because they rely on commercial labs that do not provide these records.

This isn’t just an issue for developing countries. Looking just at the United States, economically advanced states like California, Washington, and New York, have not regularly reported their total number of people tested. New York has, at various times, started and stopped reporting negative results.

If we wish to make sensible comparisons of infection rates across regions, it is utterly important to know how many people were tested in each region, especially if there may be 10x or 100x test rate differences across regions. I do not want to be too critical here, but it is astonishing to me that we cannot produce this data.

Mix shift in reasons for testing

It is not enough to report just the number of tests and the number of positive results. To estimate the true infection rates, we must also know why each test was done. Imagine one country only tests symptomatic people, whereas another country tests symptomatic people and high risk non-symptomatic people (e.g. health care workers). And imagine that both countries do the same number of tests. Even though both countries do the same number of tests, you can’t calculate the true infection count by simply dividing the positive tests by the test rate. For a reasonable estimate, you need to divide the positives in each stratum by the testing rate in each stratum, and then sum them up.

The CDC case report form has a field for why each test was done (Figure 1). Unfortunately, many states do not use this form and instead use their own form. (To be fair, some states like Washington use a more comprehensive form, but there is still a lack of standardization.) Moreover, none of the states to my knowledge report the total number of people in each stratum (tested and untested), which would be necessary to do a stratified analysis.

Figure 1: A section from the CDC case report form which could be used for stratification.


Why don’t we do this already?

None of this is groundbreaking stuff. Epidemiologists know about stratification and are keenly aware of the limitations of crude infection counts and crude case fatality rates.

This isn’t a knowledge problem. Instead, this is caused by some combination of three factors.

First, there is a natural tendency among data professionals to focus more on modeling than on upstream data quality issues. My own field of data science is certainly guilty of this. I am not saying data quality has received no attention. I am just saying that if epidemiology is anything like data science, then data quality issues get less attention than they deserve.

Second, our public health infrastructure is fragmented in a very particular way. Internationally, the WHO has no jurisdiction over individual countries and can only ask them to “consider reporting” their data. In the US, the Tenth Amendment requires most public health work to be run by the states rather than the federal government, and the CDC can therefore not compel states to report data or to use its forms. This is clearly a collective action that people have warned about for decades.

Third, it’s really hard. It is currently quite onerous to fill out a report for negative results, although one could imagine a world where negative cases were easy to report.

The three levels of data org maturity, and a dream for the future

In my field of data science, you can determine the maturity of data organizations by their outlook on upstream data quality issues. There are roughly three levels.

Table 1: Levels of data maturity for a data organization.


Our public health infrastructure is at Level II. My strong belief, and I mean this in as constructive a way as possible, is that Level II is unacceptable. We need to be at Level III before the next pandemic hits.

I want us to live in a world where every country reports positive case counts and total test counts to the WHO, stratified by test reason, and using standardized easy-to-use technology. I want to live in a world where every state reports the same information to the CDC.

How do we get there? Here are some thoughts.

  • A cultural shift among researchers towards the hard work on upstream data quality issues. And yes, cultural shifts are possible.
  • Financial incentives for better case reporting. For example, while the CDC cannot legally compel states to do adequate reporting, it does provide financial assistance via the ELC Cooperative Agreement. The CDC could use this program as a financial carrot for better reporting, similar to how they have used other cooperative agreements to incentivize participation standards in cancer registries. One can imagine similar financial arrangements at the international level.
  • Technology. Even today many health providers fax case reports to state agencies, where someone then converts the data by hand into a CDC report. The CDC is aware of the need for interoperable technology, but clearly this work needs to be prioritized and accelerated.
  • Funding for public health agencies to make this all possible.

In the meantime, please consider supporting Our World In Data or following the Covid Tracking Project’s recommendation to contact your state public health authority. Both of these organizations have made a herculean effort to compile incomplete data.

   

The shower problem

Attention mathematicians and computer scientists: I’ve got a problem for you, and I don’t know the solution.

Here’s the setup: You’re at your friend’s place and you need to take a shower. The shower knob is unlabeled. One direction is hot and the other direction is cold, and you don’t know which is which.

You turn it to the left. It’s cold. You wait.

At what point do you switch over to the right?

The baseline shower problem

Let’s make this more explicit.

  • Your goal is to find a policy that minimizes the expected amount of time it takes to get hot water flowing out of the shower head. To simplify things, assume that the water coming out of the head is either hot or cold, and that the lukewarm transition time is effectively zero.
  • You know that the shower has a Time-To-Hot constant called . This value is defined as the time it takes for hot water to arrive, assuming you have turned the knob to the hot direction and keep it there.
  • The constant is a fixed property of the shower and is sampled once from a known distribution. You have certain knowledge of the distribution, but you don’t know .
  • The shower is memoryless, such that every time you turn the knob to the hot direction, it will take seconds until the hot water arrives, regardless of your prior actions. Every time you turn it to the cold direction, only cold water will come out.
Figure 1. Unbeknownst to the user, the hot water direction is to the left, with a time constant τ of 100 seconds. The user knows the probability distribution over τ, and follows a strategy of eliminating segments of that distribution. They initially guess the correct direction, but give up too soon at 75 seconds. They then spend another 75 seconds on the rightwards direction. Finally, they return to the left direction, passing through the 75 seconds they had already eliminated, and then finally getting the hot water after 100 seconds on the left direction. In all, it takes the user 250 seconds to find the hot water.


I don’t know how to solve this problem. But as a starting point I realize it’s possible to keep track of the probability that the hot direction is to the left or to the right. In the animation above, the probability that the hot direction is to the right is just the unexplored white area under the right curve, divided by the total unexplored white area of both curves.

But how do you turn that into a policy for exploring the space? Does anybody know?

Submissions

If you would like to submit a proposal, please report your average duration for the sample of 20,000 ’s provided here. Currently, Cameron Davidson-Pilon is in the lead with an average duration of 111.365 seconds.

Bonus problem: Plumbing realities and the elusive “Middle Solution”

The baseline shower problem assumes a simplified version of reality, where the shower is memoryless and there is only a single pipe. If you want a harder problem, I have written a comment below that describes some of the plumbing realities, including lag and the existence of separate hot and cold pipes. The comment explores the tantalizing possibility that we’ve all been fiddling with our showers wrong this whole time. Instead of swinging the knob between one extreme and the other, what if the optimal solution is to start by putting the knob in the middle? To read more, see the comment below.

   

Optimizing sample sizes in A/B testing, Part III: Aggregate time-discounted expected lift

This is Part III of a three part blog post on how to optimize your sample size in A/B testing. Make sure to read Part I and Part II if you haven't already.

In Part II, we learned how before the experiment starts we can estimate $\hat{L}$, the expected post-experiment lift, a probability weighted average of outcomes.

In Part III, we’ll discuss how to estimate what is perhaps the most important per-unit cost of experimentation: the forfeited benefits that are lost by delayed shipment. This leads to something I think is incredibly cool: A formula for the aggregate time-discounted expected post-experiment lift as a function of sample size. We call this quantity $\hat{L}_a$. The formula for $\hat{L}_a$ allows you to pick optimal sample sizes specific to your business circumstances. We’ll cover two examples in Python, one where you are testing a continuous variable, and one where you are testing a binary variable (as in conversion rate experiments).

As usual, the focus will be on choosing a sample size at the beginning of the experiment and committing to it, not on dynamically updating the sample size as the experiment proceeds.

A quick modification from Part II

In Part II, we saw that if you ship whichever version (A or B) does best in the experiment, your business will on average experience a post-experiment per-user lift of

where $\sigma_\Delta^2$ is the variance on your normally distributed zero-mean prior for $\mu_B - \mu_A$, $\sigma_X^2$ is the within-group variance, and $n$ is the per-bucket sample size.

Because Part III is primarily concerned with the duration of the experiment, we’re going to modify the formula to be time-dependent. As a simplifying assumption we’re going to make sessions, rather then users, the unit of analysis. We’ll also assume that you have a constant number of sessions per day. This changes the definition of $\hat{L}$ to a post-experiment per-session lift, and the formula becomes

where $m$ is the sessions per day for each bucket, and $\tau$ is duration of the experiment in days.

Time Discounting

The formula above shows that larger sample sizes result in higher $ \hat{L} $, since larger samples make it more likely you will ship the better version. But as with all things in life, there are costs to increasing your sample size. In particular, the larger your sample size, the longer you have to wait to ship the winning bucket. This is bad because lift today is much more valuable than the same lift a year from now.

How much more valuable is lift today versus lift a year from now? A common way to quantify this is with exponential discounting, such that weights (or “discount factors”) on future lift follow the form:

where $ r $ is a discount rate. For startup teams, the annual discount rate might be quite large, like 0.5 or even 1.0, which would correspond to a daily discount rate $r$ of 0.5/365 or 1.0/365, respectively. Figure 1 shows an example of a discount rate of 1.0/365

Figure 1.


Aggregate time-discounted expected lift: Visual Intuition

Take a look at Figure 2, below. It shows an experiment that is planned to run for $\tau = 60$ days. The top panel shows $\hat{L}$, which we have now defined as the expected per-session lift. While the experiment is running, $\hat{L} = 0$, since our prior is that $\Delta$ is sampled from a normal distribution with mean zero. But once the experiment finishes and we launch the winning bucket, we should begin to reap our expected per-session lift.

The middle panel shows our discount function.

The bottom panel shows our time-discounted lift, defined as the product of the lift in the top panel and the time discount in the middle panel. (We can also multiply it by $M$, the number of post-experiment sessions per day, which for simplicity we set to 1 here.) The aggregate time-discounted expected lift, $\hat{L}_a$, is the area under the curve.

Figure 2.


Now let’s see what happens with different experiment durations. Figure 3 shows that the longer you plan to run your experiment, the higher $\hat{L}$ will be (top panel). But due to time discounting, (middle panel), the area under the time-discounted lift curve (bottom panel) is low for overly large sample sizes. There is an optimal duration of the experiment (in this case, $\tau = 24$ days), that maximizes $\hat{L}_a$, the area under the curve.

Figure 3.


Aggregate time-discounted expected lift: Formula

The aggregate time-discounted expected lift $\hat{L}_a$, i.e. the area under the curve in the bottom panel of Figure 3, is:

where $ \tau $ is the duration of the experiment and $M$ is the number of post-experiment sessions per day. See the Appendix for a derivation.

There’s two things to note about this formula.

  1. Increasing the number of bucketed sessions per day, $m$, always increases $\hat{L}_a$.
  2. Increasing the duration of the experiment, $\tau$, may or may not help. Its impact is controlled by competing forces in the numerator and denominator. In the numerator, higher $\tau$ decreases $\hat{L}_a$ by delaying shipment. In the denominator, higher $\tau$ increases $\hat{L}_a$ by making it more likely you will ship the superior version.

Optimizing sample size

At long last, we can answer the question, “How long should we run this experiment?”. A nice way to do it is to plot $\hat{L}_a$ as a function of $\tau$. Below we see what this looks like for one set of parameters. Here the optimal duration is 38 days.

Figure 4.


Note also that a set of simulated experiment and post-experiment periods (in blue) confirm the predictions of the closed form solution (in gray). See the notebook for details.

Examples in Python

Example 1: Continuous variable metric

Let’s say you want to run an experiment comparing two different versions of a website, and your main metric is revenue per session. You know in advance that the within-group variance of this metric is $\sigma_X^2 = 100$. You don’t know which version is better but you have a prior that the true difference in means is normally distributed with variance $\sigma_\Delta^2 = 1$. You have 200 sessions per day and plan to bucket 100 sessions into Version A and 100 sessions into Version B, running the experiment for $\tau=20$ days. Your discount rate is fairly aggressive at 1.0 annually, or $r = 1/365$ per day. Using the function in the notebook, you can find $\hat{L}_a$ with this command:

get_agg_lift_via_closed_form(var_D=1, var_X=100, m=100, tau=25, r=1/365, M=200)
# returns 26298

You can also use the find_optimal_tau function to determine the optimal duration, which in this case is $\tau=18$.

Example 2: Conversion rates

Let’s say your main metric is conversion rate. You think that on average conversion rates will be about 10%, and that the difference in conversion rates between buckets will be normally distributed with variance 1%. Using the normal approximation of the binomial distribution, you can use p*(1-p) for var_X.

p = 0.1
get_agg_lift_via_closed_form(var_D=0.01**2, var_X=p*(1-p), m=100, tau=25, r=1/365, M=200)
# returns 207

You can also use the find_optimal_tau function to determine the optimal duration, which in this case is $\tau=49$.

FAQ

Q: Has there been any similar work on this?

A: As I was writing this, I came across a fantastic in-press paper by Elea Feit and Ron Berman. The paper is exceptionally clear and I would recommend reading it. Like this blog post, Feit and Berman argue that it doesn’t make any sense to pick sample sizes based on statistical significance and power thresholds. Instead they recommend profit-maximizing sample sizes. They independently come to the same formula for $ \hat{L} $ as I do (see right addend in their Equation 9, making sure to substitute my $\frac{\sigma_\Delta^2}{2}$ for their $\sigma^2)$. Where they differ is that they assume there is a fixed pool of $N$ users that can only experience the product once. In their setup, you can allocate $n_1$ users to Bucket A and $n_2$ users to Bucket B. Once you have identified the winning bucket, you ship that version to the remaining $N-n_1-n_2$ users. Your expected profit is determined by the total expected lift from those users. My experience in industry differs from this setup. In my experience there is no constraint that you can only show the product once to a fixed set of users. Instead, there is often an indefinitely increasing pool of new users, and once you ship the winning bucket you can ship it to everyone, including users who already participated in the experiment. To me, the main constraint in industry is therefore time discounting, rather than a finite pool of users.

Q: In addition to the lift from shipping a winning bucket, doesn’t experimentation also help inform us about the types of products that might work in the future? And if so, doesn’t this mean we should run experiments longer than recommended by your formula for $\hat{L}_a$?

A: Yes, experimentation can teach lessons that are generalizable beyond the particular product being tested. This is an advantage of high powered experimentation not included in my framework.

Q: What about novelty effects?

A: Yup, that’s a real concern not covered by my framework. You probably want to know a somewhat long term impact of your product, which means you should probably run the experiment for longer than recommended by my framework.

Q: If some users can show up in multiple sessions, doesn’t bucketing by session violate independence assumptions?

A: Yeah, so this is tricky. For many companies, there is a distribution of user activity, where some users come for many sessions per week and other users come for only one session at most. Modeling this would make the framework significantly more complicated, so I tried to simplify things by making sessions the unit of analysis.

Q: Is there anything else on your blog vaguely related to this topic?

A: I’m glad you asked!

Appendix

The aggregate time-discounted expected lift $\hat{L}_a$ is

where $\hat{L}$ is the expected per-session lift, $M$ is the number of post-experiment sessions per day, $r$ is the discount rate, and $ \tau $ is the duration of the experiment. Solving the integral gives:

Plugging in our previously solved value of $\hat{L}$ gives

   

Optimizing sample sizes in A/B testing, Part II: Expected lift

This is Part II of a three-part blog post on how to optimize your sample size in A/B testing. Make sure to read Part I if you haven't already.

In this blog post (Part II), I describe what I think is an incredibly cool business-focused formula that quantifies how much you can benefit from increasing your sample size. It is, in short, an average of the value of all possible outcomes of the experiment, weighted by their probabilities. This post starts off kind of dry, but if you can make it through the first section, it gets a lot easier.

Outcome probabilities

Imagine you are comparing two versions of a website. You currently are on version A, but you would like to compare it to version B. Imagine you are measuring some random variable $X$, which might represent something like clicks per user or page views per user. The goal of the experiment is to determine which version of the website has a higher mean value of $X$.

This blog post aims to quantify the benefit of experimentation as an average of the value of all possible outcomes, weighted by their probabilities. To do that, we first need to describe the probabilities of all the different outcomes. An outcome consists of two parts: A true difference in means, $\Delta$, defined as

and an experimentally observed difference in means $\delta$, defined as

Let’s start with $\Delta$. While you don’t yet know which version of the website is better (that’s what the experiment is for!), you have a sense for how important the product change is. You can therefore create a normally distributed prior on $\Delta$ with mean zero and variance $ \sigma_\Delta^2 $.

Figure 1.


Next, let’s consider $\delta$, your experimentally observed difference in means. It will be a noisy estimate of $\Delta$. Let’s assume you have previously measured the variance of $X$ to be $ \sigma_X^2 $. It is reasonable to assume that within each group in the experiment, and for any particular $\Delta$, the variance of $X$ will still be $ \sigma_X^2$. You should therefore believe that for any particular $\Delta$, the observed difference in means $\delta$ will be sampled from a normal distribution $\mathcal{N}(\Delta, \sigma_c^2)$, where

and where $n$ is the sample size in each of the two buckets. If that doesn’t make sense, check out this video.

Collectively, this all forms a bivariate normal distribution of outcomes, shown below.

Figure 2. Probabilities of possible outcomes, based on your prior beliefs. The horizontal axis is the true difference in means, and the vertical axis is the observed difference in means.


To gain some more intuition about this, take a look at Figure 3. As sample size increases, $ \sigma^2_c $ decreases.

Figure 3.


Outcome lifts 

Now that we know the probabilities of all the different outcomes, we next need to estimate how much per-user lift, $l$, we will gain from each possible outcome, assuming we follow a policy of shipping whichever bucket (A or B) looked better in the experiment.

  • In cases where $\delta > 0$ and $\Delta > 0$, you would ship B and your post-experiment per-user lift will be positively valued at $l = \Delta$.
  • In cases where $\delta > 0$ and $\Delta < 0$, you would ship B, but unfortunately your post-experiment per-user lift will be negatively valued at $l = \Delta$, since $\Delta$ is negative.
  • In cases where $\delta < 0$, you would keep A in production, and your post-experiment lift would be zero.

A heatmap of the per-user lifts ($l$) for each outcome is shown in the plot below. Good outcomes, where shipping B was the right decision, are shown in blue. Bad outcomes, where shipping B was the wrong decision, are shown in red. There are two main ways to get a neutral outcomes, shown in white. Either you keep A (bottom segment), in which case there is zero lift, or you ship B where B is only negligibly different than A (vertical white stripe).

Figure 4. Heatmap of possible outcomes, where the color scale represents the lift, $l$. The horizontal axis is the true difference in means, and the vertical axis is the observed difference in means


Probability-weighted Outcome Lifts

At this point, we know the probability of each outcome, and we know the post-experiment per-user lift of each outcome. To determine how much lift we can expect, on average, by shipping the winning bucket of an experiment, we need to compute a probability-weighted average of the outcome lifts. Let’s start by looking at this visually and then later we’ll get into the math.

As shown in Figure 5, if we multiply the bivariate normal distribution (left) by the lift map (center), we can obtain the probability-weighted lift of each outcome (right).

Figure 5.


The good outcomes contribute more than the bad outcomes, simply because a good outcome is more likely than a bad outcome. To put it differently, experimentation will on average give you useful information.

To gain some more intuition on this, it is helpful to see this plot for different sample sizes. As sample size increases, the probability-weighted contribution of bad outcomes gets smaller and smaller. 

Figure 6.


Computing the expected post-experiment per-user lift

We’re almost there! To determine the expected post-experiment lift from shipping the winning bucket, we need to compute a probability-weighted average of all the post-experiment lifts. In other words, we need to sum up all the probability-weighted post-experiment lifts on the right panel of Figure 5. The formula for doing this is shown below. A derivation can be found in the Appendix.

There’s three things to notice about this formula.

  • As $n$ increases, $\hat{L}$ increases. This makes sense. The larger the sample size, the more likely it is that you’ll ship the winning bucket.
  • As the within-group variance $\sigma_X^2$ increases, $\hat{L}$ decreases. That’s because a high within-group variance makes experiments less informative – they’re more likely to give you the wrong answer.
  • As the variance prior on $\Delta$ increases, $\hat{L}$ increases. This also make sense. The more impactful (positive or negative) you think the product change might be, the more value you will get from experimentation.

You can try this out using the get_lift_via_closed_form formula in the Notebook.

Demonstration via simulation

In the previous section, we derived a formula for $\hat{L}$. Should you trust a formula you found on a random internet blog? Yes! Let’s put the formula to the test, by comparing its predictions to actual simulations.

First, let’s consider the case where the outcome is a continuous variable, such as the number of clicks. Let’s set $ \sigma_D^2 = 2 $ and $ \sigma_X^2 = 100 $. We then measure $\hat{L}$ for a range of sample sizes, using both the closed-form solution and simulations. To see how we determine $\hat{L}$ for simulations, refer to the box below.

Procedure for finding $\hat{L}$ with simulations

Loop through thousands of simulated experiments. On each each experiment doing the following:

  1. Sample a true group difference $\Delta$ from $\mathcal{N}(0, \sigma_D^2)$
  2. Sample an $X$ for each of the $n$ users in each bucket A and B, using Normal distributions $\mathcal{N}(\frac{\Delta}{2}, \sigma_X^2)$ and $\mathcal{N}(-\frac{\Delta}{2}, \sigma_X^2)$, respectively.
  3. Compute $ \delta = \overline{X}_B - \overline{X}_A $.
  4. If $\delta <= 0$, stick with A and accrue zero lift.
  5. If $\delta > 0$, ship B and accrue the per-user lift of $\Delta$, which will probably, but not necessarily, be positive.
We run these experiments thousands of times, each time computing the per-user lift. Finally, we average all the per-user lifts together to get $\hat{L}$. See the get_lift_via_simulations_continuous function in the notebook for an implementation.

As seen in Figure 7, below, the results of the simulation closely match the closed-form solution.

Figure 7.


Second, let’s consider the case where the variable is binary, as in conversion rates. For reasonably large values of $ n $, we can safely assume that the error variance is normally distributed with variance $ \sigma_X^2 = p(1-p) $, where $ p $ is the baseline conversion rate. For this example, let’s set the baseline conversion rate $p = 0.1$, and let’s set $ \sigma_\Delta^2 = 0.01^2 $. The results of the simulation closely match the closed-form solution.

Figure 8.


Thinking about costs, and a preview of Part III

In this blog post, we saw how increasing the sample size improves the expected post-experiment per-user lift, $\hat{L}$. But to determine the optimal sample size, we need to think about costs.

The cost in dollars of an experiment can be described as $f + vn$, where $f$ is a fixed cost and $ v $ is the variable cost per participant. If you already know these costs, and if you already know the revenue increase $ u $ from each unit increase in lift, you can calculate the net revenue $R$ as

and then find the sample size $ n $ that maximizes $ R $.

Unfortunately, these costs aren’t always readily available. The good news is that there is a really nice way to calculate the most important cost: the forfeited benefit that comes from prolonging your experiment. To read about that, and about how to optimize your sample size, please continue to Part III.

Appendix

To determine $\hat{L}$, we start with the probability-weighted lifts on the right panel of Figure 5. This is a bivariate normal distribution over $ \Delta $ and $ \delta $, multiplied by $ \Delta $.

where the correlation coefficient $ \rho $, is defined as:

and $\sigma_\delta^2$ is the variance on $\delta$. By the variance addition rules, $\sigma_\delta^2$ is defined as

We next need to sum up the probability-weighted values in $f(\Delta, \delta)$. To obtain a closed form solution, we can use integration.

The integration limits on $ \delta $ start at zero because the lift will always be zero if $ \delta < 0 $ (i.e if the status quo bucket A wins the experiment).

Thanks to my 15-day free trial of Mathematica, I determined that this integral comes out to the surprisingly simple 

The command I used was:

Integrate[(t / (2*\[Pi]*s1*s2*Sqrt[1 - p^2]))*Exp[-((t^2/s1^2 - \
(2*p*t*d)/(s1*s2) + d^2/s2^2)/(2*(1 - p^2)))], {d, 0, \[Infinity]}, \
{t, -\[Infinity], \[Infinity]}, Assumptions -> p > 0 && p < 1 && s1 > \
0 && s2 > 0]

If we then substitute in previously defined formulas for $ \rho $ and $ \sigma_c^2 $, we can produce a formula that accepts more readily-available inputs.

Continue to Part III.

   

Optimizing sample sizes in A/B testing, Part I: General summary

A special thanks to John McDonnell, who came up with the idea for this post. Thanks also to Marika Inhoff and Nelson Ray for comments on an earlier draft.

If you’re a data scientist, you’ve surely encountered the question, “How big should this A/B test be?”

The standard answer is to do a power analysis, typically aiming for 80% power at $\alpha$=5%. But if you think about it, this advice is pretty weird. Why is 80% power the best choice for your business? And doesn’t a 5% significance cutoff seem pretty arbitrary?

In most business decisions, you want to choose a policy that maximizes your benefits minus your costs. In experimentation, the benefit comes from learning information to drive future decisions, and the cost comes from the experiment itself. The optimal sample size will therefore depend on the unique circumstances of your business, not on arbitrary statistical significance thresholds.

In this three-part blog post, I’ll present a new way of determining optimal sample sizes that completely abandons the notion of statistical significance.

  • Part I: General Overview. Starts with a mostly non-technical overview and ends with a section called “Three lessons for practitioners”.
  • Part II: Expected lift. A more technical section that quantifies the benefits of experimentation as a function of sample size.
  • Part III: Aggregate time-discounted lift. A more technical section that quantifies the costs of experimentation as a function of sample size. It then combines costs and benefits into a closed-form expression that can be optimized. Ends with an FAQ.

Throughout Parts I-III, the focus will be on choosing a sample size at the beginning of the experiment and committing to it, not on dynamically updating the sample size as the experiment proceeds.

With that out of the way, let’s get started!

Benefits of large samples

The bigger your sample size, the more likely it is that you’ll ship the right bucket. Since there is a gain to shipping the right bucket and a loss to shipping the wrong bucket, the average benefit of the experiment is a probability-weighted average of these outcomes. We call this the expected post-experiment lift, $\hat{L}$, which increases with sample size. We’ll cover this in more detail in Part II.

Costs of large samples

For most businesses, increasing your sample size requires you to run your experiment longer. This brings us to the main per-unit cost of experimentation: the forfeited benefits that could come from shipping the winning bucket earlier. In a fast moving startup, there’s often good reason to accrue your wins as soon as possible. The advantage of shipping earlier can be quantified with a discount rate, which describes how much you value the near future over the distant future. If you have a high discount rate, it’s critical to ship as soon as possible. If you have a low discount rate, you can afford to wait longer. This is described in more detail in Part III.

Combining costs and benefits into an optimization function

You should run your experiment long enough that you’ll likely ship the winning bucket, but not so long that you waste time not having shipped your product. The optimal duration depends on the unique circumstances of your business. The overall benefit of running an experiment, as a function of duration and other parameters, is defined as the aggregate time-discounted expected post-experiment lift, or $\hat{L}_a$.

Figure 1 shows $\hat{L}_a$ as a function of experiment duration in days ($\tau$) for one particular set of business parameters. The gray curve shows the result of a closed form solution presented in Part III. The blue curve shows the results of simulated experiments. As you can see, the optimal duration for this experiment should be about 38 days. Simulations match the closed-form predictions.

Figure 1. Aggregate time-discounted expected post-experiment lift ($\hat{L}_a$) as a function of experiment duration in days ($\tau$), for a fairly typical set of business parameters.


Three lessons for practitioners

I played around with the formula for $\hat{L}_a$ and came across three lessons that should be of interest to practitioners.

1. You should run “underpowered” experiments if you have a very high discount rate

Take a look at Figure 2, which shows some recommendations for a fairly typical two-bucket conversion rate experiment with 1000 sessions per bucket per day. On the left panel we plot the optimal duration as a function of the annual discount rate. If you have a high discount rate, you care a lot more about the near future than the distant future. It is therefore critical that you ship any potentially winning version as soon as possible. In this scenario, the optimal duration is low (left panel). Power, the probability you will find a statistically significant result, is also low (right panel). For many of these cases, the optimal duration would traditionally be considered “underpowered”.

Figure 2.


2. You should run “underpowered” experiments if you have a small user base

Now let’s plot these curves as a function of $m$, our daily sessions per bucket. If you only have a small number of daily sessions to work with, you’ll need to run the experiment for longer (left panel). So far, that’s not surprising. But here’s where it gets interesting: Even though optimal duration increases as $m$ decreases, it doesn’t increase fast enough to maintain constant power (right panel). In fact, for low $m$ scenarios where you don’t have a lot of users, the optimal duration results in power that can drop below 50%, well into what would traditionally be considered “underpowered” territory. In these situations, waiting to get a large number of sessions causes too much time-discounting loss.

Figure 3.


3. That said, it’s far better to run your experiment too long than too short

Let’s take another look at $\hat{L}_a$ as a function of duration. As shown in Figure 4 below, the left shoulder is steeper than the right shoulder. This means that it’s really bad if your experiment is shorter than optimal, but it’s kind of ok if your experiment is longer than optimal.

Figure 4. Aggregate time-discounted expected post-experiment lift ($\hat{L}_a$) as a function of experiment duration in days ($\tau$), for a fairly typical set of business parameters.


Is this true in general? Yes. Below we plot $\hat{L}_a$ as a function of duration for various combinations of $m$ and the discount rate, $r$. For all of these parameter combinations, it’s better to run a bit longer than optimal than a bit shorter than optimal. The only exception is if you have an insanely high discount rate (not shown).

Figure 5.


Upcoming posts and Python notebook

You probably have a lot of questions about where this framework comes from and how it is justified. Part II and Part III dive more deeply into the math and visual intuition behind it. They also contain some example uses of the get_lift_via_closed_form and get_agg_lift_via_closed_form functions available in the accompanying Python Notebook.