Checking your Model Performance

Evaluating Medical Diagnosis Models: Key Considerations for Data Splitting and Ground Truth

Developing effective medical diagnosis models requires not only robust training but also rigorous testing. This article explores the critical aspects of evaluating machine learning models in a medical context, focusing on the proper use of training, validation, and test sets, and the essential role of strong ground truth. We will delve into three common challenges encountered when building and evaluating these datasets in medicine and offer practical solutions.

Understanding Data Set Splits for Model Evaluation

When applying machine learning to a dataset, the standard practice is to initially divide it into a training set and a test set.

The training set is used for the development and selection of models.
The test set is reserved for the final reporting of model performance.

In reality, the training dataset is often further subdivided:

The training set is used to learn the model’s parameters.
The validation set (also known as a tuning set or dev set) is used for hyperparameter tuning and to provide an estimate of the model’s performance on unseen data, effectively mimicking the test set’s distribution.

Sometimes, the split into a training and validation set is performed multiple times using a method called cross-validation. This technique helps reduce variability in the estimate of model performance by training and validating on different subsets of the data.

While various terms exist for these sets (e.g., development set for training, holdout for test), for clarity, we will consistently use training set, validation set, and test set.

Challenges in Building Medical Datasets for Model Testing

Building robust datasets for testing medical AI models presents unique challenges, particularly concerning data independence, sampling strategy, and ground truth determination.

Challenge 1: Ensuring Independent Test Sets (Patient Overlap)

A significant problem in medical imaging datasets is patient overlap, where images from the same patient inadvertently appear in different data splits (e.g., training and test sets).

The Problem: Consider a scenario where a patient has two X-rays taken, one in June and one in November, both times wearing a distinctive necklace. If the June X-ray is in the training set and the November X-ray is in the test set, a deep learning model might unintentionally memorize specific, rare, or unique aspects of the training data—such as the necklace—to predict the correct outcome (“normal”). While the model appears to perform well on the test set, it’s not due to genuine understanding of the underlying medical condition but rather to memorization of the patient’s unique characteristics. This leads to an overly optimistic test set performance, making the model appear more capable than it truly is.

Illustration of Patient Overlap in Traditional Splitting: When a dataset is split traditionally by randomly assigning images to sets, X-rays from the same patient can end up in different sets. For instance, X-ray 1 (Patient 20) might be in the training set, X-ray 2 (Patient 20) in validation, and X-ray 0 (Patient 20) in the test set. This demonstrates the problem of patient overlap.

The Solution: Splitting by Patient To tackle patient overlap, ensure that all X-rays belonging to the same patient are contained within a single set (training, validation, or test).

For example, if X-rays 0, 1, and 2 all belong to Patient 20, they would all be placed in the training set. Similarly, all X-rays for Patient 32 might be in the validation set, and neither Patient 20 nor Patient 32 would appear in the test set.

By splitting data by patient, if the model memorizes a patient’s unique feature (like a necklace), it won’t gain an unfair advantage on the test set because that patient, with their unique features, will not be present in the test set. This ensures a more accurate and generalizable assessment of model performance.

Challenge 2: Strategic Set Sampling

Randomly sampling a test set, especially when dealing with medical data that often has a small total dataset size and rare disease examples, can lead to an unrepresentative test set.

The Problem: If a test set is randomly sampled (e.g., 10% of the full dataset), there’s a risk that it might not include any examples of the minority class—the class for which there are few examples, such as a specific disease. For instance, if the disease “mass present” is rare, a random sample of hundreds of examples might not include any patients with a mass. This would make it impossible to evaluate the model’s performance on positive cases. Human comparison studies, where test sets are annotated by human readers, face a similar bottleneck in size, making this issue particularly relevant.

The Solution: Stratified Sampling for Minority Classes One effective strategy for creating test sets is stratified sampling, which ensures a minimum percentage of examples from the minority class.

A common approach is to sample the test set such that at least X% of examples belong to the minority class. For medical conditions, where one class (e.g., “mass present”) is significantly less frequent than another (“mass not present”), a common choice for X is 50%.
Thus, for a test set of 100 examples, 50 examples would be “mass present” and 50 would be “mass not present.” This guarantees sufficient numbers to reliably estimate the model’s performance on both disease and non-disease cases.

Sampling Order:

Test set: Sampled first using the stratified strategy to ensure minority class representation.
Validation set: Sampled next. To ensure the validation set reflects the distribution of the test set, the same stratified sampling strategy is typically applied (e.g., 50 examples of “mass present” and 50 of “mass not present” for a 100-example validation set).
Training set: The remaining patients are included in the training set. Due to the artificial sampling of the test and validation sets to include a large fraction of minority class examples, the training set will naturally have a much smaller fraction of these examples. However, models can still be trained effectively in the presence of imbalanced data.

Challenge 3: Establishing Ground Truth

A critical question in model testing is how to determine the correct label for an example, known as the ground truth in machine learning or the reference standard in medicine.

The Problem: Interobserver Disagreement In medical settings, differentiating between diseases can be complex, leading to interobserver disagreement—where different human experts provide different interpretations for the same case. For example, one expert might diagnose pneumonia from a chest X-ray, while another diagnoses a different condition. This disagreement poses a challenge for setting a definitive ground truth for algorithm evaluation.

Methods for Determining Ground Truth:

Consensus Voting Method: This method leverages a group of human experts to collectively determine the ground truth.
- Majority Vote: Multiple radiologists (e.g., three) independently review an X-ray for pneumonia. If two out of three agree on “yes,” then “yes” is assigned as the ground truth. Generally, the majority vote determines the answer.
- Group Discussion: Alternatively, the experts can convene to discuss their interpretations until they reach a single, agreed-upon decision, which then serves as the ground truth.
Using a More Definitive Test: This method relies on additional, more conclusive medical tests to provide definitive diagnostic information.
- CT Scan for Chest X-rays: To confirm a suspected mass on a chest X-ray, a CT scan can be performed. The CT scan provides a 3D view of the abnormality, offering more information. If a mass is confirmed on the CT, that diagnosis is assigned as the ground truth for the corresponding chest X-ray.
- Skin Lesion Biopsy for Dermatology: In dermatology studies, the ground truth for skin lesion images might be determined by a skin lesion biopsy. This procedure involves removing a tissue sample for laboratory testing. The lab results then establish the ground truth for the photographic image.

Limitations of Definitive Tests: The difficulty with the second method is that these additional, more definitive tests are not always available for every patient in an existing dataset. Not every patient receiving a chest X-ray also gets a CT scan, and not every suspicious skin lesion leads to a biopsy. Consequently, for many existing medical datasets, obtaining a reliable ground truth often necessitates using the first method—getting a consensus ground truth from the existing data through expert review. This is a common strategy in many medical AI studies.

Conclusion

Effectively testing medical diagnosis models requires careful consideration of how data is prepared and how ground truth is established. By addressing challenges such as patient overlap, unrepresentative sampling, and interobserver disagreement, we can ensure that model performance evaluations are robust, reliable, and reflect real-world generalization capabilities. Key strategies include splitting datasets by patient, employing stratified sampling for minority classes, and utilizing methods like consensus voting or definitive diagnostic tests to establish accurate ground truth. Implementing these practices is fundamental to developing high-performance, trustworthy AI models for medical applications.

Core Concepts

Training, Validation, and Test Sets: Distinct subsets of a dataset used respectively for model development, hyperparameter tuning, and final performance reporting.
Ground Truth (Reference Standard): The correct or true label for an example, essential for evaluating the accuracy of AI models in medical diagnosis.
Patient Overlap: A critical pitfall where data from the same patient appears in different dataset splits (e.g., training and test sets), potentially leading to overly optimistic model performance estimates.
Set Sampling (Minority Class): A strategy to ensure sufficient representation of rare or positive disease cases within validation and test sets, especially important in imbalanced medical datasets.
Consensus Voting: A method for establishing ground truth by aggregating the majority opinion of multiple human experts to resolve interobserver disagreement.
Definitive Test: A method for establishing ground truth by using a more accurate or conclusive medical procedure or test than the one being evaluated.

Concept Details and Examples

Training, Validation, and Test Sets

Detailed Explanation: In machine learning, datasets are divided into training, validation, and test sets to ensure robust model evaluation and development. The training set is used for the model to learn patterns and parameters, the validation set is crucial for tuning hyperparameters and providing an early estimate of performance, and the test set offers an unbiased final evaluation of the model’s generalization ability on truly unseen data. This structured approach helps prevent overfitting and ensures a reliable assessment of real-world performance.

Examples:

Chest X-ray Interpretation: A large collection of chest X-rays is split, with 70% for training the deep learning model to identify pneumonia, 15% for fine-tuning the model’s architecture (validation) by adjusting learning rates, and the remaining 15% reserved for a final, untouched assessment of its real-world performance (test).
Diabetic Retinopathy Detection: An ophthalmological dataset of retinal images is divided. Images used for training the model to detect microaneurysms are in the training set. Images used to decide the optimal model architecture and number of training epochs are in the validation set. Images never seen during development, but used for the final accuracy report published in a journal, are in the test set.

Pitfalls/Misconceptions: A common pitfall is ‘data leakage,’ where information from the validation or test set unintentionally influences the model during training, leading to inflated performance metrics. Misconception: The validation set is only for hyperparameter tuning; it also gives an unbiased estimate of performance before the final test set evaluation.

Ground Truth (Reference Standard)

Detailed Explanation: Ground truth refers to the indisputably correct label or diagnosis for a given medical example, serving as the benchmark against which an AI model’s predictions are compared. Establishing robust ground truth is crucial for accurate model evaluation, especially in medical AI where diagnostic precision is paramount. Without reliable ground truth, it’s impossible to objectively assess a model’s effectiveness and reliability.

Examples:

Skin Lesion Classification: For a dataset of dermatoscopic images, the ground truth for whether a lesion is benign or malignant is definitively established by a skin biopsy, which is then analyzed by a pathologist under a microscope.
Brain Tumor Segmentation: In MRI scans, the ground truth for tumor boundaries might be meticulously outlined by a neuro-radiologist, potentially with confirmation from surgical pathology reports, serving as the correct segmentation map for an AI model to learn from.

Pitfalls/Misconceptions: Pitfall: Assuming human expert labels are always 100% accurate without considering inter-observer variability or the need for consensus. Misconception: Ground truth is always objective and easy to obtain; in medicine, it can often be subjective, complex, and resource-intensive to acquire.

Patient Overlap

Detailed Explanation: Patient overlap occurs when different medical images or data points from the same individual are inadvertently distributed across different splits of a dataset (e.g., some in training, others in testing). This issue can lead to an artificially inflated assessment of a model’s performance, as the model might “memorize” patient-specific features rather than learning generalizable patterns, making it perform unrealistically well on familiar patients in the test set.

Examples:

Multiple X-rays from the Same Patient: A patient has two chest X-rays taken, one in March and one in September. If the March X-ray is in the training set and the September X-ray (from the same patient) is in the test set, the model might learn unique patient features (like a specific body habitus or an old scar) that help it predict accurately in the test set, rather than truly generalizing to new patients.
Follow-up Scans: A patient undergoing cancer treatment has CT scans before and after therapy. If the pre-treatment scan is used for training and the post-treatment scan for testing, the model might leverage the patient’s unique anatomy or persistent artifacts, leading to a misleadingly high performance on that specific patient.

Pitfalls/Misconceptions: Pitfall: Randomly splitting images without explicitly accounting for patient IDs, which is a common mistake when not specifically addressed. Misconception: Memorization only happens with rare features; models can memorize common patient traits too, still leading to over-optimistic results.

Set Sampling (Minority Class)

Detailed Explanation: Set sampling, particularly for minority classes, is a technique used during dataset splitting to ensure that rare but critical examples (like positive disease cases) are sufficiently represented in the validation and test sets. This addresses the challenge of imbalanced medical datasets where positive cases are often scarce, preventing evaluation sets from lacking the very examples needed to assess a model’s performance on the target condition.

Examples:

Rare Disease Detection: If a dataset of medical images contains only 1% positive cases for a rare disease, a purely random split might result in a test set with very few or no positive examples. Instead, set sampling ensures that 50% of the test set examples are positive cases, allowing for proper evaluation of the model’s ability to detect the rare disease.
Cancer Screening Model: For a mammography dataset, breast cancer cases are rare. When creating a test set, instead of a purely random selection, a stratified sampling approach is used to guarantee that a pre-defined percentage (e.g., 40-50%) of the images in the test and validation sets contain confirmed cancer.

Pitfalls/Misconceptions: Pitfall: Creating an evaluation set that is too small or too skewed, leading to high variance in performance estimates. Misconception: If the training set is imbalanced, the test set must also reflect that imbalance for a realistic evaluation; while the overall distribution might be imbalanced, the test set should be balanced for robust evaluation of both positive and negative cases.

Consensus Voting

Detailed Explanation: Consensus voting is a method for establishing a reliable ground truth by aggregating the interpretations or diagnoses of multiple human experts. When faced with inter-observer disagreement (where experts provide different diagnoses for the same case), the majority decision among a panel of experts is typically adopted as the definitive label, thereby creating a more robust and less subjective ground truth for model training and evaluation.

Examples:

Radiologist Agreement: Three radiologists independently review a chest X-ray for pneumonia. If Radiologist A says “yes,” Radiologist B says “yes,” and Radiologist C says “no,” the consensus (majority vote) would be “yes, pneumonia present,” which then serves as the ground truth.
Pathologist Review: For slide images of tissue biopsies, two pathologists might initially disagree on the grade of a tumor. They then engage in a discussion, jointly reviewing the slide until they reach a mutually agreed-upon single decision on the tumor grade, which becomes the ground truth.

Pitfalls/Misconceptions: Pitfall: Not all expert opinions are equally valid; sometimes an outlier opinion might be correct, or a weak consensus might obscure true ambiguity. Misconception: Consensus voting always requires experts to discuss; it can simply be a majority vote without direct discussion, though direct discussion can lead to more robust agreement.

Definitive Test

Detailed Explanation: A definitive test refers to the use of a more accurate, conclusive, or gold-standard medical procedure to establish ground truth, especially when initial assessments are ambiguous or less reliable. This method provides objective additional information to confirm a diagnosis, offering a higher degree of certainty for labeling examples in a dataset than relying solely on the primary imaging or clinical finding.

Examples:

Chest X-ray vs. CT Scan: To confirm the presence of a mass suspected on a chest X-ray, a CT scan is performed. If the CT scan definitively shows a mass, that finding establishes the ground truth for the original chest X-ray, even if the X-ray itself was equivocal.
Skin Lesion Photo vs. Biopsy: For an AI model evaluating smartphone photos of skin lesions, the ground truth for malignancy is determined not just by visual inspection but by a skin biopsy, where a tissue sample is sent to a lab for microscopic pathological examination, which is the gold standard.

Pitfalls/Misconceptions: Pitfall: Definitive tests are often invasive, expensive, or not routinely performed on all patients, making their widespread availability for dataset labeling challenging. Misconception: If a definitive test exists, it must always be used for ground truth; practical constraints (cost, patient invasiveness, data availability) often necessitate using consensus voting or other methods instead.

Application Scenario

A research team is developing an AI model to detect early-stage glaucoma from retinal fundus images, a condition often missed in routine screenings. They have collected thousands of images but face challenges in ensuring their model generalizes well and can accurately identify the relatively rare positive cases.

To address this, they first ensure no patient overlap across their training, validation, and test sets by assigning all images from a single patient to only one split. Next, they employ set sampling to ensure that their validation and test sets contain a balanced proportion (e.g., 50%) of confirmed glaucoma cases, even though glaucoma is rare in the general population. Finally, for ground truth labeling, they use consensus voting among three experienced ophthalmologists to resolve diagnostic disagreements for each image, supplementing this where possible with definitive tests like visual field test results or OCT scans for ambiguous cases.

Quiz

Multiple Choice: What is the primary reason to avoid patient overlap between training and test datasets in medical AI? A) To reduce computational training time. B) To ensure the model learns generalizable patterns rather than memorizing patient-specific features. C) To simplify the data annotation process. D) To prevent data corruption.
Short Answer: Explain why “set sampling for minority classes” is particularly important when creating test sets for medical AI models, especially for rare diseases.
True/False: If three radiologists vote on the presence of pneumonia in a chest X-ray (two say “yes,” one says “no”), and the consensus voting method is used, the ground truth for that X-ray will be “no pneumonia.” A) True B) False
Short Answer: Describe one advantage and one disadvantage of using a “definitive test” (e.g., a biopsy) versus “consensus voting” among experts to establish ground truth in a medical AI dataset.

ANSWERS

B) To ensure the model learns generalizable patterns rather than memorizing patient-specific features.
- Explanation: Patient overlap leads to data leakage, where the model might “memorize” unique characteristics of a patient present in the training set, leading to an overly optimistic performance on the same patient in the test set, thus failing to assess true generalization to new, unseen patients.
Explanation: In medical datasets, diseases are often rare (minority class). Purely random sampling might result in validation or test sets with very few or no positive disease cases. Set sampling ensures a sufficient number of minority class examples are present in these sets, allowing for a reliable and robust evaluation of the model’s performance on the disease of interest, rather than just on healthy cases, preventing the model’s ability to detect the disease from being un-testable.
B) False
- Explanation: With consensus voting, the majority opinion dictates the ground truth. If two out of three experts say “yes,” the majority is “yes,” meaning the ground truth would be “pneumonia present.”
Explanation:
- Advantage of Definitive Test: Provides a highly objective and often more accurate ground truth, as it relies on a gold-standard diagnostic procedure that is less susceptible to subjective interpretation compared to human consensus.
- Disadvantage of Definitive Test: Often invasive, expensive, time-consuming, or not routinely available for all patients in a dataset, making it impractical for large-scale data labeling compared to utilizing existing expert opinions or readily available clinical data.

Tools
Radar
Test
Toolkit

Community
X
Discord
YouTube
GitHub