Researchers have found that while many artificial intelligence (AI) systems for diagnosing skin conditions have a significant disparity in accuracy between patients with lighter and darker skin tones, those disparities can be reduced by retraining the AI with a more diverse sample of medical images.
“Algorithms are only as good as the data on which they are based,” said James Zou, PhD, Stanford Medicine assistant professor of biomedical data science and a machine learning expert, in a press release. “A massive, open database cataloging dermatological images from people of colour could help doctors assess whether these algorithms function accurately on all skin colours.”
A paper published in Science Advances details how Dr. Zou and his colleagues developed such a database and used it to evaluate the accuracy of three of the existing dermatology diagnostic algorithms to distinguish between benign and malignant lesions.
The new image database, the Diverse Dermatology Images (DDI) dataset, is intended to be a pathologically confirmed benchmark dataset with diverse skin tones. Images included in the DDI were retrospectively selected from reviewing histopathologically-proven lesions diagnosed in Stanford Clinics from 2010 to 2020. All lesions in the DDI had their Fitzpatrick skin type (FST) determined using a chart review of the in-person visit and consensus review by two board-certified dermatologists.
DDI includes a total of 208 images of FST I–II (159 benign and 49 malignant), 241 images of FST III–IV (167 benign and 74 malignant), and 207 images of FST V–VI (159 benign and 48 malignant).
The three algorithms Dr. Zou and his colleagues evaluated were ModelDerm, DeepDerm and HAM10000. According to the paper, these were chosen based on their popularity, availability, and previous demonstrations of state-of-the-art performance.
Researchers found that all three algorithms had good performance on the original datasets they were trained and tested on, identifying the malignant lesions in images with good accuracy. However, the three performed less well at identifying malignant lesions in the DDI data.
“People who are in the business of creating algorithms need to be aware of this problem and make sure they're testing their algorithm on all sorts of diverse skin tones,” said Roxana Daneshjou, MD, PhD, a practicing dermatologist at Stanford Medicine and lead author of the study, in the release. “It just emphasizes the importance of having diverse teams where both physicians and machine-learning experts from diverse backgrounds are involved.”
However, when the researchers used a subset of the DDI data to retrain or “fine tune” the DeepDerm and HAM10000 Ais, the difference in diagnostic performance between light and dark skin images was reduced.
Drs. Daneshjou and Zou have made the DDI available to scientists who want to use their data to fine-tune their algorithms or test for biases, according to the release. Dr. Zou also suggested that the database could be helpful for the general public.
“Often, people will spot something—a mole for instance—and want to look up previous cases on the internet,” he said. “This could be a valuable resource for patients who might not otherwise be able to find images of skin that looks like theirs.”