Interpreting Confidence Intervals Correctly

Understanding Variability in Medical Model Performance: The Role of Confidence Intervals

Evaluating medical models extends beyond simply reporting a single performance metric; it critically involves understanding and reporting the variability in that estimate. This article explores how confidence intervals can be effectively used to illustrate this variability.

The Challenge of Estimating Population Accuracy

Imagine a hospital with 50,000 patients who undergo chest X-rays. If we wanted to determine the accuracy of a chest X-ray model on every single patient in this hospital—the entire population—we would ideally run the model and obtain the ground truth for all of them. This would yield the model’s performance on the entire population. For example, if we were measuring accuracy, we might find it to be 0.78 across all 50,000 patients. This ideal, true accuracy for the entire group is known as the population accuracy (p).

In reality, however, testing a model on an entire patient population is often infeasible due to its sheer size and the resources required. Consequently, the true population accuracy (p) remains unknown. This leads to a crucial question: Can we still gain a reliable understanding of the model’s performance on the population by using a smaller, more manageable sample of patients?

Let’s consider a scenario where we sample 100 patients from the hospital. On this smaller dataset, the model might achieve an accuracy of 0.8. While this sample accuracy provides an estimate, it doesn’t tell us the range within which the true population accuracy (p) is likely to lie. This is where confidence intervals become invaluable.

Introducing Confidence Intervals

Confidence intervals allow us to quantify the uncertainty around our sample estimate and express a range for the population parameter. Using our sample of 100 patients, for instance, we might state that we are 95% confident that the population accuracy (p) falls within the interval 0.72 to 0.88.

0.72 is the lower bound of this interval.
0.88 is the upper bound of this interval.

While the calculation of confidence intervals is complex and beyond the scope of this discussion, understanding their interpretation is crucial. When reporting the accuracy of a model based on a sample, it is best practice to include both the mean (sample accuracy) and the associated confidence intervals. For our example, reporting an accuracy of 0.8 with a 95% confidence interval of (0.72, 0.88) provides a much more complete picture of the model’s performance.

Interpreting 95% Confidence

The concept of “95% confidence” is often misunderstood. It does not mean:

There is a 95% probability that the true population accuracy (p) lies within this specific interval (0.72, 0.88).
95% of all possible sample accuracies would fall within this specific interval.

Instead, the interpretation of 95% confidence is more nuanced and relates to the process of repeated sampling:

Imagine if we could repeatedly draw many different samples of 100 patients from the same hospital population.
For each sample, we would calculate a new sample accuracy and its corresponding 95% confidence interval.
If we were to plot these multiple sample accuracies (e.g., as circles) along with their respective confidence intervals (e.g., as bars), and also plot the true, but unobserved, population accuracy (p) as a dotted line, we would observe a pattern.

What the 95% confidence level signifies is that, in this hypothetical scenario of repeated sampling, approximately 95% of the confidence intervals generated would contain the true population accuracy (p). For example, if we took seven such samples, roughly six or seven of the calculated intervals would “capture” the true population accuracy line, while one or two might miss it.

Thus, the precise interpretation of 95% confidence is: In repeated sampling, this method produces intervals that include the true population accuracy in about 95% of samples.

In practice, we typically only compute model performance on one sample. For that single computed confidence interval, we can be 95% confident that it contains the true population accuracy, even though we cannot definitively know if it does.

The Impact of Sample Size on Confidence Interval Width

One significant factor influencing the width of a confidence interval—how close the lower and upper bounds are—is the sample size.

Consider again our initial sample of 100 patients. Now, imagine we drew another sample from the population, but this time with 500 patients—five times larger than our previous sample. We would expect a larger sample to provide a better, more precise estimate of the population accuracy.

While the sample accuracy might still be, for example, 0.8 for both the 100-patient and 500-patient samples, a critical difference emerges in their confidence intervals:

The larger sample (500 patients) will yield tighter (narrower) confidence intervals.
The smaller sample (100 patients) will result in wider confidence intervals.

A tighter confidence interval indicates that our estimate of the population accuracy is more precise. This demonstrates that a larger sample size generally leads to a better estimate of the true population accuracy because the range of likely values for ‘p’ becomes narrower.

Conclusion

Confidence intervals are an indispensable tool in evaluating medical models, particularly when it’s impractical to test on an entire population. They enable us to use results from a sample to express the range within which we can be reasonably confident the true population accuracy lies. By providing this vital measure of variability and uncertainty, confidence intervals contribute significantly to a more comprehensive and robust evaluation of medical models, moving beyond single-point estimates to truly understand when and how a model performs on patients.

Core Concepts

Population Accuracy (p): The true, often unobservable, performance metric of a medical AI model if it were tested on every single patient in a hospital or target demographic.
Sample Accuracy: The observed performance metric of a medical AI model when evaluated on a smaller, representative subset of the overall patient population.
Confidence Interval (CI): A calculated range of values derived from sample data that is likely to contain the true, unknown population parameter with a specified level of confidence.
Confidence Level: The probability, expressed as a percentage (e.g., 95%), that if a sampling process were repeated many times, the calculated confidence intervals would include the true population parameter.
Width of Confidence Interval: The range between the lower and upper bounds of a confidence interval, which indicates the precision of the estimate derived from the sample, with narrower intervals suggesting greater precision.

Concept Details and Examples

Population Accuracy (p)

Detailed Explanation: Population accuracy represents the definitive performance of an AI model if it could be tested on every single individual within its intended deployment population. Because testing on an entire population is almost always impractical or impossible, population accuracy remains an unknown but crucial target parameter we aim to estimate. It serves as the “ground truth” that sample-based evaluations try to approximate.
Examples:
- Example 1: For a chest X-ray model, the population accuracy would be its true accuracy if it were applied to every single patient who receives a chest X-ray across all hospitals globally, or within a specific hospital over its entire operational history.
- Example 2: If an AI model is designed to detect diabetic retinopathy, its population accuracy would be its performance if it were applied to the retinal scans of every person diagnosed with or at risk of diabetes worldwide.
Common Pitfalls/Misconceptions: Believing that the sample accuracy is the population accuracy, rather than just an estimate. It’s easy to forget that the true population value is generally unobservable.

Sample Accuracy

Detailed Explanation: Sample accuracy is the practical way we assess a medical AI model’s performance. By selecting a carefully chosen subset of the overall patient population, we run the model on this sample and calculate its performance metrics. This sample accuracy then serves as our best estimate for the unknown population accuracy, providing a tangible number that reflects the model’s current performance on available data.
Examples:
- Example 1: Evaluating a chest X-ray model on 100 randomly selected patient scans from a hospital to estimate its overall performance, yielding a sample accuracy of 0.80.
- Example 2: Testing an AI-powered ECG analysis tool on 500 patient ECGs collected from a clinical trial, resulting in a sample F1-score of 0.85 for arrhythmia detection.
Common Pitfalls/Misconceptions: Assuming that a high sample accuracy automatically means the model will perform similarly well on all future, unseen patients without considering the variability implied by the sample size or sampling method.

Confidence Interval (CI)

Detailed Explanation: A Confidence Interval provides a range around a sample estimate (like sample accuracy) to quantify the uncertainty associated with that estimate. It allows us to express how confident we are that this range contains the true, unknown population parameter. CIs are vital for understanding the reliability and precision of a model’s reported performance in a real-world setting.
Examples:
- Example 1: For a chest X-ray model with a sample accuracy of 0.80, a 95% CI of [0.72, 0.88] means we are 95% confident that the true population accuracy lies somewhere between 0.72 and 0.88.
- Example 2: A melanoma detection AI model tested on a sample has a sensitivity of 0.92 with a 90% CI of [0.89, 0.95]. This suggests that the true sensitivity in the population is likely within this tighter range.
Common Pitfalls/Misconceptions:
- Misconception 1: “There is a 95% probability that the population accuracy (p) lies within this specific interval.” This is incorrect; once the interval is calculated, p either is or isn’t in it. The 95% refers to the method over repeated sampling.
- Misconception 2: “95% of future sample accuracies will fall within this interval.” This is also incorrect; the CI is about the population parameter, not other sample estimates.

Confidence Level

Detailed Explanation: The confidence level (e.g., 95%) defines the long-run success rate of the method used to construct confidence intervals. It quantifies our trust in the process: if we were to repeatedly draw samples from the population and construct a CI for each, then 95% of those intervals would capture the true population parameter. It’s a statement about the reliability of the estimation procedure, not a probability about a single, already calculated interval.
Examples:
- Example 1: If a research team calculates 100 different 95% confidence intervals for a new diagnostic model’s specificity, based on 100 different patient samples, approximately 95 of those intervals are expected to contain the model’s true population specificity.
- Example 2: When reporting a 90% confidence interval for a model’s F-score, it implies that if this procedure were replicated many times, 90% of the resulting intervals would enclose the actual F-score of the model on the entire population.
Common Pitfalls/Misconceptions: Interpreting “95% confident” as a 95% chance that the specific interval derived from one sample contains the population parameter. Instead, it refers to the reliability of the method over many hypothetical repetitions.

Width of Confidence Interval

Detailed Explanation: The width of a confidence interval (upper bound - lower bound) directly reflects the precision of our sample-based estimate of the population parameter. A narrower interval indicates a more precise estimate, suggesting less uncertainty about where the true population value lies. Conversely, a wider interval implies greater uncertainty or less precision. This width is primarily influenced by sample size and the variability within the sample.
Examples:
- Example 1: A chest X-ray model’s accuracy on a small sample of 100 patients might yield a 95% CI of [0.70, 0.90] (width 0.20), while on a larger sample of 500 patients, it might be [0.77, 0.83] (width 0.06). The narrower interval from the larger sample shows a more precise estimate.
- Example 2: If an AI model for predicting disease progression has a mean absolute error (MAE) of 1.5 months with a 95% CI of [0.5, 2.5] (width 2.0 months), it’s less precise than if it had a CI of [1.3, 1.7] (width 0.4 months), even if the mean MAE is the same.
Common Pitfalls/Misconceptions: Assuming that a wide interval means the model is bad. A wide interval simply means the estimate is less precise due to limited data or high variability, not necessarily that the model’s true performance is poor (though it could be anywhere within that wide range).

Application Scenario

Imagine a startup developing an AI model to detect early-stage glaucoma from fundus images. They have developed a promising prototype and need to evaluate its performance before seeking regulatory approval.

To apply the lesson’s concepts, they would first define the population (e.g., all patients undergoing eye exams). Since testing on the entire population is infeasible, they would draw a representative sample of fundus images to calculate the model’s sample accuracy. Alongside this, they would compute confidence intervals for key metrics (like sensitivity and specificity) to express the variability and precision of their estimates, thereby conveying their confidence level that the true population performance lies within the reported ranges, taking into account the width of the confidence intervals for assessing estimate reliability.

Quiz

Multiple Choice: What is the primary reason we use confidence intervals when evaluating medical AI models, rather than just reporting sample accuracy? a) To make the results appear more statistically significant. b) To show the exact range where all future model predictions will fall. c) To quantify the uncertainty and variability in our estimate of the true population performance. d) To prove that the model’s accuracy is exactly what we observed in the sample.
True/False: If a 95% confidence interval for a model’s accuracy is [0.80, 0.90], it means there is a 95% probability that the model’s true population accuracy is within this specific interval.
Short Answer: You evaluated a medical AI model on a sample of 200 patients and found its accuracy to be 0.85 with a 95% confidence interval of [0.81, 0.89]. Later, you get access to a larger sample of 1000 patients and re-evaluate the model, finding an accuracy of 0.86. What would you expect to happen to the width of the confidence interval, and why?
Multiple Choice: Which of the following is the most accurate interpretation of a 95% confidence level in the context of model evaluation? a) We are 95% sure that our specific sample accuracy is correct. b) If we repeat the sampling process many times, about 95% of the calculated confidence intervals will contain the true population parameter. c) There is a 95% chance that the next patient tested will fall within the calculated accuracy range. d) The model will perform correctly on 95% of patients.

---ANSWERS---

c) To quantify the uncertainty and variability in our estimate of the true population performance.
- Explanation: Confidence intervals provide a range that accounts for the fact that a sample estimate is unlikely to be exactly the true population value. They give a measure of precision and reliability of the estimate.
False.
- Explanation: This is a common misconception. The 95% confidence refers to the reliability of the method of constructing the interval over repeated sampling. Once a specific interval is calculated, the true population accuracy is either within it or not; there’s no probability associated with that particular interval anymore. It means that if you were to repeat the sampling and interval calculation process many times, about 95% of those intervals would contain the true population accuracy.
Explanation: You would expect the width of the confidence interval to decrease (become narrower). This is because a larger sample size provides more information about the population, leading to a more precise estimate of the true population accuracy and thus a tighter confidence interval.
b) If we repeat the sampling process many times, about 95% of the calculated confidence intervals will contain the true population parameter.
- Explanation: This correctly defines the frequentist interpretation of a confidence level. It describes the long-run behavior of the interval estimation procedure, not a probability about a single interval or single patient performance.

Tools
Radar
Test
Toolkit

Community
X
Discord
YouTube
GitHub