When to abort an experiment?

Aborting -or early stopping-- an experiment is sometimes necessary, but it must be approached cautiously. Aborting means stopping an experiment before it is completed. This article explains when it might be appropriate to stop or abort a Fixed-horizon or a Group Sequential test (both have their own rules and considerations for early stopping), the risks involved, and best practices for decision-making.

Aborting fixed-horizon experiments

In a fixed-horizon experiment, results should ideally be evaluated only at the pre-determined horizon (e.g., after collecting the required sample size). Aborting a fixed-horizon experiment means making a decisions (usually to stop) before the experiment has collected the required sample size. Aborting is not recommended because fixed-horizon tests require full sample size completion to ensure validity.

Aborting Group Sequential experiments

Group Sequential Test (GST) are designed for interim analyses, providing valid statistical checks that allow for early-stopping experiments when appropriate. The benefit of GST is to allow for early stopping but only this can only happen if the experiment crossed a boundary at one of the planned interim analysis. Aborting a Group Sequential Test, means making a decision (usually to stop) in-between analysis and before the experiment crossed either the efficiency or the futility boundary.

When can an experiment be aborted?

There are a few scenarios where aborting might be justified:

Negative results with a potential high cost

If the treatment causes significant harm to primary or guardrail metrics, such as decreased conversion rates or increased error rates, it may justify aborting the experiment. However, this decision should be seen as a precautionary measure rather than conclusive evidence, as:

The experiment may be underpowered, increasing the risk of false positives.
The observed effect may be due to a novelty effect, which could diminish over time as users adapt.

Example:
A treatment causes a 30% increase in conversion within the first day. Even if the evidence is not definitive, the potential harm to the business justifies stopping early.

Critical bugs or errors

If the treatment introduces a bug or technical failure that disrupts core functionality, aborting the experiment is necessary to prevent further damage. Bugs in the experiment tracking or in the variant implementation are the most common reasons for aborting experiments.

Example:
A bug in the checkout flow prevents users from completing transactions, making it imperative to abort the experiment immediately.

Experiment Health Checks alert

Health Checks are quality checks run in the background to ensure the experiment and data are reliable. Most health checks alerts would require to abort the experiment and investigate the reason for the alert.

Example: The health checks show an assignment conflicts was detected meaning that some visitors have been exposed to more than one variant. This might require the experiment to be stopped and the issue investigated.

Risks of aborting experiments

False positives: Stopping early when the experiment is underpowered increases the likelihood of incorrect conclusions. A negative impact on the primary metric can be a false positive
Novelty effects: Early negative signals may resolve over time as users adapt to changes. Stopping early will prevent users from capturing this effect and might lead to wrong decisions about the true impact of a change.
Limited evidence: Decisions are based on signals rather than hard evidence due to incomplete data.
Winner’s curse: Early stopping can exaggerate effect sizes, making subsequent validation less reliable.
Missing long-term impacts: Early termination may prevent capturing delayed outcomes or unintended side effects.

Best practices for stopping or aborting experiments

Use primary and guardrail metrics

Both primary and guardrail metrics should guide early stopping decisions. When chosen correctly, these metrics measure the impact of changes on key KPIs and ensure that critical business objectives are protected.

Pre-define stopping and abort criteria

Set clear rules before launching the experiment. Criteria might include:

Specific thresholds for primary metrics to validate success or detect failure.
Defined guardrail thresholds for key metrics like error rates, performance, or user behavior.

Those early stopping criteria can be recorded in ABsmartly on the last step of the experiment creation form.

Investigate and validate with follow-up analysis

Treat early stops as signals, not definitive conclusions. Early stopping should always be followed by an investigation to try and understand the cause. When needed follow-up experiment can be run once the issue has been mitigated.

Summary

Early stopping or aborting an experiment is a decision that requires careful consideration. In fixed-horizon experiments, early stopping is rare and should only occur for significant negative impacts or critical bugs. In GST experiments, planned stops at interim analyses are valid and expected, while unplanned experiment abortions should be reserved for emergencies. Both primary and guardrail metrics are essential for guiding early stopping decisions. To minimize risks, establish pre-defined criteria, validate results, and monitor metrics comprehensively.

When to abort an experiment?

Aborting fixed-horizon experiments​

Aborting Group Sequential experiments​

When can an experiment be aborted?​

Negative results with a potential high cost​

Critical bugs or errors​

Experiment Health Checks alert​

Risks of aborting experiments​

Best practices for stopping or aborting experiments​

Use primary and guardrail metrics​

Pre-define stopping and abort criteria​

Investigate and validate with follow-up analysis​

Summary​