Guide to interpreting metrics in experiment results
This guide outlines how to understand and interpret the metrics displayed in ABsmartly. It will cover key elements such as confidence intervals, significance, and statistical power. This guide is split into two sections: Fixed Horizon Metrics and Group Sequential Metrics.
Fixed Horizon Metrics
Understanding the metrics table
The metrics table provides the following key data points for each variant:
- Participants: The number of individuals assigned to each variant, expressed as a count and percentage of the total participants.
- Value: The total count or sum associated with the metric (e.g., number of "buyers") for each variant.
- Mean: The average performance of the metric for each variant, calculated as:
Mean = Value / Participants
- Confidence: The statistical confidence that the observed difference between variants is real, not due to random chance.
- Impact: The difference in metric performance between variants, shown visually and as a percentage with a confidence interval (CI). In this example the observed impact is +1.35% with a confidence interval going from +0.38% to +2.32%.
Interpreting confidence intervals and results
The confidence interval (CI) and the color-coded indicator help assess the significance of the results:
- Positive Impact (Green): The confidence interval is entirely above 0, indicating a statistically significant improvement for the variant.
- Negative Impact (Red): The confidence interval is entirely below 0, indicating a statistically significant negative effect.
- Insignificant (Grey): The confidence interval crosses 0, meaning the results are not statistically significant.
Evaluating Power
Power indicates the likelihood that the experiment would detect a true effect if one exists. The power shown in ABsmartly indicates how well powered the primary metric is compared to the target power.
- Power Achieved: The actual power achieved on the primary metric based on the data collected. In the example above it is 96.1%.
- Objective/Target Power: The pre-specified experiment power threshold. In the example above target was 80%.
Step-by-step interpretation of a metric
Example using the example above
- Review the Baseline (Variant A):
- Participants: 30,752 (49.92% of total)
- Value: 20,050 (observed count of "buyers")
- Mean: 65.20%
- Compare against the test variant (Variant B):
- Participants: 30,855 (50.08% of total) -> No Sample Ratio Mismatch ratio detected
- Value: 20,388 (observed count of "buyers")
- Mean: 66.08%
- Confidence Level: 97.77% -> Statistically significant (target was 90%).
- Power: 96.1% -> Sufficiently powered (target was 80%)
- Impact: +1.35% (Confidence Interval: +0.38% to +2.32%, color-coded green).
- Draw Conclusions:
- Variant B significantly outperforms Variant A in terms of "buyers"
- High confidence (97.77%) and a positive impact suggest adopting Variant B may be beneficial.
Group Sequential Testing Metrics
Understanding the Metrics Table
The metrics table provides the following key data points for each variant at the last interim analysis:
- Mean: The GST adjusted average performance of the metric for the variant.
- Observed Mean: The actual observed average performance of the metric during the experiment.
- Impact: The percentage change in the metric compared to the baseline. This is a GST adjusted value. In this example +1.74% with a confidence interval going from -1.88% to +5.49%.
- Z-Score: A statistical measure that represents how many standard deviations the result is from the null hypothesis (no effect). Positive Z-scores indicate an improvement, while negative Z-scores suggest a decline.
- P-Value: The probability that the observed result occurred by chance if the null hypothesis is true. Lower P-values (e.g., below 0.05) indicate statistical significance.
These data points provide a summary of the ongoing analysis for the selected variant at the last analysis, helping to evaluate its performance relative to the baseline.
Understanding the Group Sequential Graph
Group Sequential Testing, makes it easy to visually interpret results. The graph displays the evolution of statistical evidence over time, allowing decisions to be made at predefined checkpoints during the experiment. It includes the following elements:
- X-Axis (Time): Represents the progress of the experiment in time, with dates marking each past and future interim analyses.
- Y-Axis (Standard Deviations, Z-Scale): Represents the Z-Score, showing how far the observed result is from the null hypothesis.
- Z-Score Trajectory (Orange line): The path of the observed Z-Score over time. It starts at 0 and moves based on accumulating data.
- Efficiency Boundary (Green Region): The upper boundary. If the Z-Score trajectory crosses this boundary, the variant shows a statistically significant improvement, and the experiment can be stopped early for success.
- Futility Boundary (Pink Region): The lower boundary. If the Z-Score trajectory crosses this boundary, the variant is deemed unlikely to show meaningful improvement, and the experiment can be stopped early for futility.
- Fixed Horizon (Vertical Dotted Line): Represents the moment in time where the equivalent Fixed Horizon test would have been completed. All interim analyses before that dotted line are opportunities to make an early decision.
Interpreting the Boundaries
Efficiency Boundary (Green):
- Crossing this boundary means there is enough evidence to conclude the variant performs significantly better than the baseline.
- The experiment can be stopped early, and the variant can be considered successful.
Futility Boundary (Pink):
- Crossing this boundary indicates the variant is unlikely to show significant improvement.
- In case of a binding futility type (see experiment setup), the experiment is completed (no more interim analysis will happen) and can be stopped as further data collection is unlikely to change the conclusion.
- In case of a non-binding futility type, you can decide to keep the experiment running to the following interim analyses.