Before an algorithm can be deployed to a clinical setting, it’s essential to test its performance.
This performance evaluation involves asking questions, such as ‘is the algorithm capable of correctly recognising the presence of a tumour in a scan?’ These questions are ‘asked’ using mathematical tests that measure how ‘accurate’ a particular algorithm is. For example, calculating how many times it correctly recognises a tumour and comparing this to how many times it misses the tumour.
The Concordance Index (or C-index) is one of a number of mathematical tests that can be used depending on the algorithm being assessed. It may be used to evaluate the performance of a predictive algorithm trained to recognise the presence of a tumour. It gives an algorithm multiple pairs of patients – one who has a tumour and one who doesn’t – and measures how many times the algorithm correctly predicts which of the pair has the tumour and which does not.
In practice, because all performance metrics have different strengths and weaknesses, it’s common for data scientists to test an algorithm using multiple performance metrics in addition to the C-index.
There is, however, a difference between testing the performance of an algorithm in a ‘lab’ setting as described above and testing it in real life environments. An algorithm that seems to perform well according to the C-index when only being used to ‘diagnose’ patients retrospectively on very clean data might not perform so well in a hospital where IT equipment might be out of date or the population of patients might be different to those whose data was used to test and train the algorithm. This is why it is important to both test an algorithm mathematically in ‘the lab’ and practically ‘in the clinic’ to see if the performance level stays the same.
Testing the performance of an AI algorithm is an essential component of the penultimate step in the algorithm development pipeline: evaluation. Robust evaluation must be completed before an algorithm can be safely deployed (or implemented) into a clinical setting. This typically involves multiple stages and types of evaluation, the first of which is statistical performance evaluation.
The list of available performance metrics is extremely long. Each metric has different strengths and weaknesses and is more or less suited to the evaluation of a different type of algorithm i.e., whether it is a classification, prediction, or regression algorithm.
Classification algorithms that have binary outputs (e.g., 1 = disease present, 0 = disease absent) and are evaluated retrospectively with a range of performance metrics that compare the classification provided by the algorithm (e.g., 1) to the actual observed classification (e.g., 0).
The starting point for all these metrics is the confusion matrix:
A confusion matrix is used to assess the performance of a model in healthcare settings
The table above reveals the model’s ability to perform across a range of metrics:
This measures the overall performance of the algorithm. It is the ratio between the number of correctly classified patients (i.e., total number of true positives and true negatives) and the total number of patients in the dataset. Performance is measured on a scale from 0- 1, where 1 represents perfect performance (i.e., all patients are classified correctly), and 0 represents failed performance (i.e., no patients are classified correctly).
- Recall (also known as Sensitivity)
This measures the ability of the algorithm to recognise disease. It is the ratio between the number of patients correctly classified as having the disease (i.e. number of true positives) and the total number of patients classified as having the disease (i.e., total number of true positives and false negatives).
This is the reverse of sensitivity. It is the ratio between the number of patients correctly classified as not having the disease (i.e., the number of true negatives) and the total number of patients classified as not having the disease (i.e., total number of true negatives and false positives)
This measures the overall ability of an algorithm to classify patients correctly. It is the ratio between correctly classified patients (i.e., either true positives or true negatives) and all the patients assigned to one class (i.e., either true positives + false positives, or true negatives + false negatives). When the precision of the positive class is being evaluated, the output of the calculation is the positive predictive value (PPV), and when the precision of the negative class is being evaluated, the output of the calculation is the negative predictive value (NPV).
- F1 Score
This is a measure of the balance between precision and recall and is calculated as the harmonic mean of the two measures.
Once these values are known, the ROC and AUROC can be derived:
- ROC (receiver operating characteristic curve)
This is a graph showing the performance of a classification model at different thresholds. It does this by plotting the True Positive Rate (i.e., the number of true positives divided by the number of true positives + false negatives) and the False Positive Rate (i.e., the number of false positives divided by the sum of the number of false positives and the number of true negatives).
- AUROC (area under the receiver operating characteristic curve)
This is a statistic which summarises the ROC into a single number that describes the overall performance of the model (i.e., its ability to discriminate between classes) for multiple thresholds at the same time. A score of 1 is a perfect score and a score of 0.5 is equivalent to random guessing.
Classification algorithms only deal with binary variables, but prediction models may deal with both binary and continuous variables, i.e. values that are obtained by measuring rather than counting (such as ‘height’). In addition, sometimes prediction models have to deal with censored variables i.e., variables that are not present for all patients in the sample (such as ‘death’). In such instances, the performance metrics described above may not be appropriate and so the C-index (or concordance index) is used instead.
This estimates the probability of concordance (i.e., agreement) between predicted and observed results (i.e., predicted prognosis and actual prognosis). It does this by looking at pairs of patients with different outcomes and evaluating whether the model is capable of assigning the correct outcome to the correct patient in the pair.
For example, if an algorithm has been developed to predict likelihood of hospital admission, and the predicted likelihood of admission was larger for the patient who was admitted first than for the patient who was admitted second then the prediction would be considered ‘concordant’ (or in agreement) with the actual outcome and a ‘1’ would be added to the count of concordant pairs in the numerator.
If there was a tie between the pair (i.e., both patients were admitted), then a value of 0.5 would be added instead to the numerator, and if the pair was not concordant (i.e., the model was not able to accurately predict the actual outcome), then 0 would be added to the numerator.
The denominator is the number of patients with the outcome of interest (i.e., the number of patients admitted to hospital) multiplied by the number of patients without the outcome of interest (i.e., the number of patients not admitted to hospital). An overall index value of 1 indicates perfect separation of patients with different outcomes, whilst a value of 0.5 indicates no ability to distinguish between patients with different outcomes (i.e., the model is no better than a guess).
Regression, or multivariable algorithms, are also retrospectively evaluated by comparing predicted outcomes with actual observed outcomes, and measuring the error rate. There are multiple metrics used to do this, including root mean squared error, mean absolute error, mean squared error and more. All effectively measure the likelihood that an algorithm will make an error, meaning that the predicted outcome will be different to the observed outcome (e.g., predicting disease that does not then develop) and the magnitude of the error (i.e., how different the predicted and actual outcomes are).
In practice, because all performance metrics have different strengths and weaknesses, it is common for data scientists to test an algorithm using multiple performance metrics.
It is also important to note that in healthcare, it is not always whether or not an algorithm is highly ‘accurate’ that matters, but whether it results in the right outcome for patients and clinicians when used in practice, and whether these outcomes are fair. Evaluating performance of this nature requires the use of alternative performance metrics and clinical trials.
- HARRELL Jr., Frank E., Kerry L. Lee, and Daniel B. Mark. 1996. ‘Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors’. Statistics in Medicine 15(4): 361–87.
- Hicks, Steven A. et al. 2022. ‘On Evaluation Metrics for Medical Applications of Artificial Intelligence’. Scientific Reports 12(1): 5979.
- Farah, Line et al. 2023. ‘Assessment of Performance, Interpretability, and Explainability in Artificial Intelligence–Based Health Technologies: What Healthcare Stakeholders Need to Know’. Mayo Clinic Proceedings: Digital Health 1(2): 120–38.
- Kelly, Christopher J. et al. 2019. ‘Key Challenges for Delivering Clinical Impact with Artificial Intelligence’. BMC Medicine 17(1): 195.
- Mbakwe, Amarachi B., Ismini Lourentzou, Leo Anthony Celi, and Joy T. Wu. 2023. ‘Fairness Metrics for Health AI: We Have a Long Way to Go’. eBioMedicine 90: 104525.
- Reddy, Sandeep et al. 2021. ‘Evaluation Framework to Guide Implementation of AI Systems into Healthcare Settings’. BMJ Health & Care Informatics 28(1): e100444.