Results & Analysis
Understand experiment outcomes with statistical analysis and AI-powered insights.
How Results Are Computed
When an experiment is running, the platform automatically analyzes its data at regular intervals. It compares how users in each variation are behaving and determines whether the differences are meaningful or just due to random chance.
The platform supports two approaches to statistical analysis. You can choose either one when setting up your experiment:
| Approach | What It Tells You | When to Use It |
|---|---|---|
| Frequentist | Answers the question: "If there were truly no difference between variations, how unlikely is it that we would see results this extreme?" A high confidence level (e.g., 95%) means it is very unlikely the difference is due to chance. | The default and most widely used approach. Best when you want a clear yes-or-no answer about whether a difference exists. |
| Bayesian | Answers the question: "Given the data we have observed, what is the probability that each variation is the best?" It also tells you the expected cost of choosing the wrong variation. | Useful when you want a more intuitive probability statement ("Variation B has a 92% chance of being better") or when you have prior knowledge about expected performance. |
Both approaches are valid and will generally lead to the same conclusions. The Bayesian method can be easier to communicate to stakeholders because its output — a probability of being best — is more intuitive than a confidence level.
Key Metrics
For each variation in your experiment, the platform computes and displays the following metrics:
| Metric | What It Means |
|---|---|
| Impressions | The number of users who were assigned to this variation and exposed to the experience. |
| Conversions | The number of users in this variation who completed the goal action (for unique conversion goals) or the total number of goal events (for event count goals). |
| Conversion Rate | The percentage of users who converted, or the average number of events per user — depending on the goal type. |
| Lift Over Control | How much better (or worse) this variation is compared to the control, expressed as a percentage. For example, "+12% lift" means the variation's conversion rate is 12% higher than the control's. |
| Confidence Level | How confident the analysis is that the observed difference is real and not due to random chance. Typically, 95% or higher is considered statistically significant. |
What "Statistical Significance" Means in Practice
When results are "statistically significant," it means the platform has enough evidence to conclude that the difference between variations is real — not just noise in the data. Specifically, at a 95% confidence level, there is only a 5% chance that you would see a difference this large if the variations were actually the same.
This matters because with small sample sizes, you will often see apparent differences that are actually just random variation. Statistical significance protects you from making decisions based on these false signals.
CUPED — Faster, More Precise Results
CUPED (Controlled Experiment Using Pre-Experiment Data) is a technique that uses data about how users behaved before the experiment to produce more precise results. The basic idea is straightforward: if you know a user's baseline behavior, you can better isolate the effect of the experiment from natural variation between users.
In practice, CUPED typically reduces the required sample size by 30-50%, meaning your experiments reach conclusive results faster without sacrificing statistical rigor. This is especially valuable for teams that want to run experiments quickly or have limited traffic.
CUPED works best when the metric you are measuring has high natural variation between users and when users have a stable pattern of behavior before the experiment. It is configured on a per-goal basis and uses a lookback window (typically 14 days) to gather the pre-experiment baseline.
Analysis Reports
Each time the analysis runs, it produces a report that summarizes the current state of the experiment and provides an actionable recommendation:
| Recommendation | What It Means |
|---|---|
| Ship | The treatment variation is winning with sufficient statistical confidence. You can proceed with rolling it out to all users. |
| Keep Running | There is not yet enough data to make a confident decision. The experiment needs more time to collect additional observations. |
| Inconclusive | The experiment has run for a sufficient period but no meaningful difference has been detected. The variations appear to perform similarly. |
| Halt | The treatment variation is underperforming the control, or a guardrail metric has regressed. Consider stopping the experiment. |
These recommendations also drive the rollout policy controller — adaptive rollouts advance on "Ship," hold on "Keep Running" or "Inconclusive," and halt on "Halt."
AI Insights
When enabled, the platform generates AI-powered commentary on your experiment results. These insights are designed to surface patterns you might miss and provide actionable suggestions:
- Hypothesis Assessment — Evaluates your original hypothesis against the actual results. Did the data support your prediction, and if not, what might explain the difference?
- Cross-Goal Patterns — Identifies relationships between multiple goals. For example: "The treatment increases signup rate but reduces 7-day retention, suggesting lower-quality signups."
- Follow-Up Experiments — Suggests next steps based on the results. If a variation showed promise in one area, the AI might suggest testing it more broadly or refining it further.
- Sample Ratio Mismatch Debugging — If the observed traffic split differs significantly from the expected split (a warning sign that something is wrong with the experiment setup), the AI provides hypotheses about the cause, such as bot traffic, redirect issues, or caching problems.
When to Trust Results
Not all experiment results are equally trustworthy. Before acting on results, check for these conditions:
- Sufficient sample size: Has the experiment reached its target sample size? Results from small samples are unstable and may change as more data comes in.
- Adequate runtime: Has the experiment run long enough to capture a full business cycle? Experiments that only run on weekdays may miss weekend behavior patterns, and vice versa. A minimum of one to two weeks is recommended for most experiments.
- No sample ratio mismatch: Is the actual traffic split close to the expected split? If you configured a 50/50 split but observe 60/40, something may be wrong with the experiment setup. The platform will warn you if a significant mismatch is detected.
- Consistent results over time: Have the results stabilized, or are they fluctuating significantly between analysis runs? Stable results are more trustworthy than results that swing back and forth.