Available dermatology images and data insufficient to train AI to identify skin cancer
New research published at the NCRI (National Cancer Research Institute) Festival suggests that the images and accompanying data currently available for training artificial intelligence (AI) to spot skin cancer are insufficient for the task. In addition, there are very few images of darker skin for this use.
The open-access paper published in The Lancet (Nov. 9, 2021) evaluated all publicly available skin image datasets used for skin cancer diagnosis. This included exploring the characteristics, data access requirements, and associated image metadata in those image datasets.
Investigators conducted a search of MEDLINE, Google, and Google Dataset and identified 21 open access datasets containing 106,950 skin lesion images, 17 open access atlases, eight regulated access datasets, and three regulated access atlases.
“We found that for the majority of datasets, lots of important information about the images and patients in these datasets wasn’t reported,” said the study’s lead author Dr. David Wen, in a press release. “There was limited information on who, how and why the images were taken. This has implications for the programs developed from these images, due to uncertainty around how they may perform in different groups of people, especially in those who aren’t well represented in datasets, such as those with darker skin.”
“This can potentially lead to the exclusion or even harm of these groups from AI technologies. Although skin cancer is rarer in people with darker skins, there is evidence that those who do develop it may have worse disease or be more likely to die of the disease. One factor contributing to this could be the result of skin cancer being diagnosed too late,” said Dr. Wen, of the Oxford University Hospitals NHS Foundation Trust.
According to the release, Dr. Wen and his colleagues hope to create quality standards for health data used in AI development. This will include information on who should be represented in datasets and which patient characteristics should be recorded.