CUPED for switchback tests

8 min read • September 4, 2024

Authors

This post was written by Garret O’Connell, with contributions from Carlos Bentes and Kenil Cheng. It was supported by the work of the Bolt Experimentation Team: Gabriela-Raluca Săsăran, Andrey Kuzmin, August Hovland, and Pavel Kiper.

TL;DR

We explored a two-stage CUPED variance reduction approach to speed up the time switchback tests take to reach power. We found major reductions in test duration in the range of 25-50%. We flagged some factors that appear to modulate this effect and proposed extensions for potential further test duration reductions.

Background

It’s been over a year since we at Bolt’s Experimentation Team presented some of our early learnings in running switchback tests (see Tips and considerations for switchback test designs) to account for network interference effects in 2-sided marketplaces. Since then, we’ve heard mainly one request from teams using switchback tests — make them FASTER!

Switchback test designs are generally much slower than user- or session-randomised tests to reach similar power levels, as they will have smaller sample sizes — there are fewer timeslices than users. This can slow development for teams that depend on switchback tests because their products affect marketplace dynamics, which can harm the business as they often affect impactful products (e.g. pricing), which we want to keep tuned to as frequently as possible.

Variance reduction methods help speed up A/B testing to reach statistical power faster. CUPED (Controlled-experiment Using Pre-Experiment Data) is a popular variance reduction method. Still, we haven’t seen documented cases of its application in switchback tests (see here for a beginner’s guide). There is an example from DoorDash using a related ML approach called CUPAC here. Still, for companies that don’t have ML infrastructure in their data pipelines, CUPED could be more practically useful.

Historical matric as a covariate

CUPED usually takes user metric values from a pre-experiment period and uses this treatment-unrelated information to form an expected baseline for the experiment period. This removes some of the treatment-unrelated variance, allowing us to increase the sensitivity to or speed of detection of treatment effects.

One challenge in translating this approach to switchback test designs is that there is no “user” with historical data from which to build baselines, as switchbacks randomise over timeslices, which never repeat. However, we can consider timeslices from the same weekly periods (e.g., Mondays at 9–10 a.m.) as related in terms of the metric values we would expect. This allows us to build histories for timeslices and use this information about weekly seasonalities to explain away some of the treatment-unrelated variances.

Here, we used repeating user activity patterns over weekly periods as a CUPED covariate. As is recommended for CUPED, we used the pre-experiment values of the outcome metric as a covariate. This is the first of two stages of CUPED we applied. Let’s call this stage CUPED-metric, as it uses values from the same metric as a covariate.

Experiment vs. historical delta of demands as a covariate

CUPED can be used with multiple covariates, allowing us to remove treatment-unrelated variance further. We asked ourselves which other information could complement the usual CUPED-metric covariate.

A limitation of using only historical data is that it relies on past trends to explain activity during the experiment. If trends change from the past to the experiment period, CUPED’s ability to reduce variance is weakened. To handle these changes, we could include a covariate that tracks changes between pre-experiment and experiment periods.

Before continuing, some of you might be concerned that using information from during the experiment could remove variance that’s affected by (or belongs to) the treatment and, therefore, risk underestimating the treatment variance and biased results. However, the original CUPED paper (ref) explicitly suggests using within-experiment covariates when the covariate is unaffected by the treatment.

One example of such a covariate metric is the number of users with sessions (i.e., opened the app) in a given time slice, an indicator of demand during that time. This metric is unlikely to be affected by any feature exposed during the session because users would only see such features after opening the app and logging a session, increasing the metric.

As a second CUPED covariate, we included the delta of this demand metric—the number of users logged in a period—between pre- and within-experiment periods. This allowed us to complement our purely historical information from the CUPED metric with an indicator of recent shifts for a given period. Let’s call this stage CUPED-demand.

Cuped calculation steps

Below is an overview of the steps for how CUPED-metric and CUPED-demand can be implemented in a simple ways. In this example, we use revenue as the outcome metric. The variable timeslice_id is an identifier for unique timeslices, whereas timeperiod denotes an identifier of “dayofweek – hour” of when the user started their first session within a timeslice.

The basic steps were as follows:

Pull unit-level test data for test and pre-test windows
Join test and pre-test tables and calculate inputs to CUPED parameters:
1. average values over pre-test periods for the metric and demand
2. average metric values over test timeslices for the metric and demand
3. delta of within timeslice vs. pre-experiment demand
Calculate theta and covariate averages over periods
Calculate CUPED for the metric using historical data of the metric (labelled as metric_cuped)
Repeat steps 2-4 using the CUPED-metric as the metric and the delta demand as the covariate to calculate CUPED-demand (labelled as demand_cuped)

CUPED evaluation approach

To evaluate how much faster tests could reach the target power using CUPED metrics, we calculated the required sample size to reach 80% power for a Minimum Detectable Effect (MDE) of 2% at an alpha of 0.05 (5%). Since the required sample size proportionally increases with test duration for switchback tests, we used the size of sample needed to track CUPED’s impact on test speed.

Switchback tests are examples of “clustered” data where observations within a timeslice cluster can share characteristics. This needs to be corrected when estimating variance in statistical tests (see previous post for details). To do this a Variance Inflation Factor (VIF) which is a correction factor that can be applied to required sample sizes or standard errors to account for the correlation between units within timeslices. Using the VIF is analogous to using Mixed Linear Models with timeslice as a random factor (see here for an example).

A/A test data set

We evaluated the required sample sizes in one large city (~3.5 M residents) using two weeks of user data each for pre- and within-experiment windows. One-hour timeslices were randomised for treatment and control using an alternating schedule. Weekly periods were defined as one hour. The outcome metric was the sum of revenue per user per timeslice, and the demand metric was the number of users with sessions per timeslice. Evaluations compared required sample sizes between metric versions of non-CUPED, CUPED-metric, and CUPED-demand.

CUPED evaluation results

Below are the results of evaluating each CUPED stage on the switchback A/A test user data. We didn’t show the absolute numbers for confidentiality purposes but instead expressed the impact relative to the non-CUPED metric version.

CUPED evaluation results for city A (results relative to non-CUPED metric).

After applying both CUPED stages, we cut our test duration by over half. Most cities we investigated showed reductions in test duration in the 25-40% range. Like any CUPED approach, it works better for some metrics than others, but it can be applied equally to continuous, count, or rate metrics. We could think of ways to add more information about these weekly seasonalities that could explain variance over timeslices. For example, the above CUPED procedure could be applied separately for different geographical regions of a city.

We included results of the impact on the variance between and within timeslices to illustrate one particular point of how CUPED for switchbacks works. The results show that CUPED does nothing for the variance within a timeslice (between users) but reduces the variance between timeslices. This is because we used timeslice-level information in the CUPED approach, not user-level.

When we reduce the variance between timeslices, we’re also reducing the extent to which our data is clustered, i.e. if timeslices are more similar, then relative similarities between users in the same timeslices decrease. This means we don’t have to correct as much for clustering in the VIF, as seen in how CUPED decreases the VIF. CUPED over periods thus works by “de-correlating” units within each timeslice (the same point is made here in the Appendix section).

We’ve noted that for some cities, the length of the period or timeslice can be influential. For example, below are the results for city B for different lengths of timeslice. Note: The baseline for each timeslice length is different from the non-CUPED metric version for that length.

CUPED evaluation results for city B (note: baselines are specific to each level of timeslice length).

We see that the gains were around double those for longer timeslices than for the shorter timeslices. Tests with longer timeslices can, therefore, enjoy a larger relative impact. Note that this does not mean these tests will run faster than those with shorter timeslices, as the baselines of the above results are specific to the timeslice length. Tests with longer timeslices have smaller sample sizes so that they will run longer. In the above results, the baseline test duration for 3-hour timeslices was 2 times longer than that for 1 hour.

Summary

Here, we presented a two-stage method of variance reduction using CUPED for switchback tests. This could provide a much-needed boost in development speed for many teams, as switchbacks tend to run impractically long. While more general methods exist (see here), the presented approach can be implemented purely in SQL, lowering the bar for implementation.

Appendix

In addition to the above covariates, we also explored variations on the set of covariates, e.g., using only the historical or within-experiment demand values as a covariate (and not the delta), and applying the CUPED-demand stage alone without the CUPED-metric stage. However, these were only somewhat impactful.

We didn’t observe any change in the average metric lift between treatment and control from non-CUPED to CUPED metric versions > .001%, indicating no bias was introduced.

To further check for bias, we generated a permuted p-value distribution from 500 runs fully resampled with replacement data sets of the CUPED-demand version. As seen below, these were uniformly distributed, indicating no bias in terms of how the variance of the metrics is estimated. P-values were calculated at the user level and corrected for within-timeslice correlation between users by applying the VIF to the standard error of the treatment effect.

Join us!

Bolt is a place where you can grow professionally at lightning speed and create a real impact on a global scale.

Take a look at our careers page and browse through hundreds of open roles, each offering an exciting opportunity to contribute to making cities for people, not cars.

If you’re ready to work in an exciting, dynamic, fast-paced industry and are not afraid of a challenge, we’re waiting for you!