Results & Analysis

Understand experiment outcomes with statistical analysis and AI-powered insights.

How Results Are Computed

When an experiment is running, the platform automatically analyzes its data at regular intervals. It compares how users in each variation are behaving and determines whether the differences are meaningful or just due to random chance.

The platform supports two approaches to statistical analysis. You can choose either one when setting up your experiment:

Approach	What It Tells You	When to Use It
Frequentist	Answers the question: "If there were truly no difference between variations, how unlikely is it that we would see results this extreme?" A high confidence level (e.g., 95%) means it is very unlikely the difference is due to chance.	The default and most widely used approach. Best when you want a clear yes-or-no answer about whether a difference exists.
Bayesian	Answers the question: "Given the data we have observed, what is the probability that each variation is the best?" It also tells you the expected cost of choosing the wrong variation.	Useful when you want a more intuitive probability statement ("Variation B has a 92% chance of being better") or when you have prior knowledge about expected performance.

Both approaches are valid and will generally lead to the same conclusions. The Bayesian method can be easier to communicate to stakeholders because its output — a probability of being best — is more intuitive than a confidence level.

Key Metrics

For each variation in your experiment, the platform computes and displays the following metrics:

Metric	What It Means
Impressions	The number of users who were assigned to this variation and exposed to the experience.
Conversions	The number of users in this variation who completed the goal action (for unique conversion goals) or the total number of goal events (for event count goals).
Conversion Rate	The percentage of users who converted, or the average number of events per user — depending on the goal type.
Lift Over Control	How much better (or worse) this variation is compared to the control, expressed as a percentage. For example, "+12% lift" means the variation's conversion rate is 12% higher than the control's.
Confidence Level	How confident the analysis is that the observed difference is real and not due to random chance. Typically, 95% or higher is considered statistically significant.

What "Statistical Significance" Means in Practice

When results are "statistically significant," it means the platform has enough evidence to conclude that the difference between variations is real — not just noise in the data. Specifically, at a 95% confidence level, there is only a 5% chance that you would see a difference this large if the variations were actually the same.

This matters because with small sample sizes, you will often see apparent differences that are actually just random variation. Statistical significance protects you from making decisions based on these false signals.

A result can be statistically significant but practically insignificant. If a variation improves conversion rate by 0.01% with high confidence, the result is real but probably not worth the effort of shipping. Always consider the size of the lift alongside the confidence level.

CUPED — Faster, More Precise Results

CUPED (Controlled Experiment Using Pre-Experiment Data) is a technique that uses data about how users behaved before the experiment to produce more precise results. The basic idea is straightforward: if you know a user's baseline behavior, you can better isolate the effect of the experiment from natural variation between users.

In practice, CUPED typically reduces the required sample size by 30-50%, meaning your experiments reach conclusive results faster without sacrificing statistical rigor. This is especially valuable for teams that want to run experiments quickly or have limited traffic.

CUPED works best when the metric you are measuring has high natural variation between users and when users have a stable pattern of behavior before the experiment. It is configured on a per-goal basis and uses a lookback window (typically 14 days) to gather the pre-experiment baseline.

Analysis Reports

Each time the analysis runs, it produces a report that summarizes the current state of the experiment and provides an actionable recommendation:

Recommendation	What It Means
Ship	The treatment variation is winning with sufficient statistical confidence. You can proceed with rolling it out to all users.
Keep Running	There is not yet enough data to make a confident decision. The experiment needs more time to collect additional observations.
Inconclusive	The experiment has run for a sufficient period but no meaningful difference has been detected. The variations appear to perform similarly.
Halt	The treatment variation is underperforming the control, or a guardrail metric has regressed. Consider stopping the experiment.

These recommendations also drive the rollout policy controller — adaptive rollouts advance on "Ship," hold on "Keep Running" or "Inconclusive," and halt on "Halt."

AI Insights

When enabled, the platform generates AI-powered commentary on your experiment results. These insights are designed to surface patterns you might miss and provide actionable suggestions:

Hypothesis Assessment — Evaluates your original hypothesis against the actual results. Did the data support your prediction, and if not, what might explain the difference?
Cross-Goal Patterns — Identifies relationships between multiple goals. For example: "The treatment increases signup rate but reduces 7-day retention, suggesting lower-quality signups."
Follow-Up Experiments — Suggests next steps based on the results. If a variation showed promise in one area, the AI might suggest testing it more broadly or refining it further.
Sample Ratio Mismatch Debugging — If the observed traffic split differs significantly from the expected split (a warning sign that something is wrong with the experiment setup), the AI provides hypotheses about the cause, such as bot traffic, redirect issues, or caching problems.

When to Trust Results

Not all experiment results are equally trustworthy. Before acting on results, check for these conditions:

Sufficient sample size: Has the experiment reached its target sample size? Results from small samples are unstable and may change as more data comes in.
Adequate runtime: Has the experiment run long enough to capture a full business cycle? Experiments that only run on weekdays may miss weekend behavior patterns, and vice versa. A minimum of one to two weeks is recommended for most experiments.
No sample ratio mismatch: Is the actual traffic split close to the expected split? If you configured a 50/50 split but observe 60/40, something may be wrong with the experiment setup. The platform will warn you if a significant mismatch is detected.
Consistent results over time: Have the results stabilized, or are they fluctuating significantly between analysis runs? Stable results are more trustworthy than results that swing back and forth.

Resist the urge to check results too early or too frequently. "Peeking" at results before reaching your target sample size can lead to false conclusions, because the data is noisy and volatile early in an experiment. Trust the process and wait for the analysis to reach statistical significance.