A number of AI programs trained to detect diabetic eye damage struggle to perform consistently in the real world despite apparently excelling in clinical tests, say scientists in the US.
Academics led by the University of Washington School of Medicine tested seven algorithms from five companies: Eyenuk and Retina-AI Health in America, Airdoc in China, Retmaker of Portugal, and OphtAI in France. All of the models have gone through clinical studies, and are used – or can be used –to diagnose diabetic retinopathy, a complication of diabetes that damages blood vessels in the eye, leading to impaired vision or blindness.
The research team said it found at least some of the software packages wanting during its own testing, and this month published its findings in the Diabetes Care journal.
“It’s alarming that some of these algorithms are not performing consistently since they are being used somewhere in the world,” said lead researcher Aaron Lee, assistant professor of ophthalmology at the university.
Top doctors slam Google for not backing up incredible claims of super-human cancer-spotting AI
The team tested the code by showing it a dataset of 311,604 photos from 23,724 patients at hospitals in Seattle and Atlanta from 2006 to 2018, and found some of the software’s diagnosis of these patients was sub-par. When the algorithms’ decisions were compared to a real physician, the team said three performed reasonably well, only one of them was as good as a human expert, and the rest were worse.
The AI models tended to over-predict if a patient had the disease or not, Lee told IAIDL. Although it’s better to be safe than sorry, it meant the systems would more often than not flag up patients for examinations by professional eye doctors. Instead of reducing the workload for ophthalmologists, by filtering out those without the disease, the software would increase it.
“The study design prevents us from disclosing which company supplied which algorithm unfortunately,” Lee added. “It is my understanding that all of these algorithms are in clinical use somewhere in the world however.”
The programs did better with imagery from Atlanta, we’re told, a sign that performance depends heavily on the quality of the data. “We believe one of the reasons for the discrepancy in performance was that Atlanta has a more stringent protocol for image quality at the time of screening,” Lee told us. “This suggests that AI models may be more sensitive to image quality issues than human beings.”
The academics suggested medical algorithms should be evaluated on larger real-world datasets before being validated for public use. “AI algorithms are not all created equal and they can, but not always, recapitulate biases in datasets,” Lee warned.
So, are they safe for use?
Airdoc declined to comment on the study, and Retmaker did not respond to El Reg‘s questions.
Stephen Odaibo, CEO and founder of Retina-AI Health, told us in a statement he thought the researchers’ conclusions were not supported by the study.
“First, the study was a retrospective study based on heterogeneous unstructured data from the [veteran patients],” he said. “The data included pictures that were not of the retina, for example, pictures of people’s faces, eyelids, or even their driver’s licenses. The algorithms were made to first sort through to identify which of the images were of retinas, and then which were of right eyes or left eyes; after which it was then to determine the disease stage.”
“Furthermore, it was not known from what types of cameras the images had been taken over the years. This is a completely different scenario from the use case for which these AI algorithms were developed and subsequently clinically validated in prospective clinical trials for FDA approval,” Odaibo continued, referring to America’s medical watchdog, the Food and Drug Administration.
“The indications of use are in primary care settings and with a trained camera operator who selects two specific images per eye from a known camera device on which the algorithm has been specifically validated prospectively. The above discrepancy is the source of the big logical gap between the study and its claims. To make an evidence-based recommendation to the FDA one would need to design a study that reflects the indications of use and intended use of the medical device.”
Frank Cheng, president and chief customer office of the other US-based company in the study, Eyenuk, agreed that the experiments carried out by the academics didn’t quite mirror how a system would be tested after FDA approval: “Our view is that systems that have FDA clearance are already going through more rigorous prospective clinical trial validation than the University of Washington study, additional testing is not necessary, so long as photographers and imaging protocol training takes place … In real world clinical use, FDA-cleared systems such as Eyenuk’s are integrated with the camera, and photographers are trained on the imaging protocol to be used.”
Cheng said he believed “Eyenuk’s EyeArt AI system is very much ready for prime time and is available for clinical use,” and said he thought the “study analysis was well conducted in general.”
OphtAI’s CTO Bruno Lay told IAIDL that the research group’s conclusions were fair. Lay claimed OphtAI’s algorithms were ranked as the best and second-best of the seven algorithms tested, and that the technology from three out of the five companies trialed probably isn’t yet good enough to be used in the real-world.
“The experiments were very challenging,” he said. “We had no idea of the quality of the images used in the test. We were able to process the whole dataset in just three days, and our system is already available for use in hospitals in France.”
Diabetic retinopathy is a widely studied area in medical AI research. Several Alphabet subsidiaries, including Google, Verily, and DeepMind have demonstrated how machine-learning software can automatically analyze retinal scans. ®