Recent advances in artificial intelligence have highlighted the potential of large multimodal models (LMMs) for various applications, including medical diagnostics. However, a groundbreaking study titled "Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA" shows, however, that these models may not be as reliable as previously thought.
The study, conducted by researchers from different institutions and published on arXiv in May 2024, evaluates the diagnostic accuracy of LMMs in answering visual questions (Med-VQA). The aim is to determine the reliability of these models and their applicability in the real world. To this end, a unique dataset, ProbMed, was introduced to test the diagnostic capabilities of the models.
Large multimodal models such as GPT-4V and Gemini Pro have been praised for their performance in general benchmarks. However, their application in specialized areas such as medical diagnostics requires rigorous validation. The study aims to assess whether these models can reliably answer medical questions based on visual data, a crucial aspect for their use in healthcare.
To evaluate the LMMs, the researchers developed a new dataset called ProbMed. This dataset contains complex medical questions designed to test the models' diagnostic reasoning and their ability to deal with adversarial questions. The assessment focused on diagnostic procedures and adversarial questions to evaluate the model's ability to interpret medical images, suggest appropriate diagnostic steps, and handle tricky questions designed to test the depth of the model's understanding.
The results were surprising and somewhat disconcerting. The performance of advanced models such as GPT-4V and Gemini Pro fell well short of expectations. In many cases, these models performed worse than random guessing in correctly answering diagnostic questions. The models often failed to provide correct answers to diagnostic procedure questions, highlighting the discrepancy between their perceived and actual abilities. In addition, the models had difficulty with adversarial questions and often gave incorrect or nonsensical answers. This indicates a superficial understanding of medical content rather than deep diagnostic thinking.
Implications for AI in healthcare
These findings have far-reaching implications for the use of AI in medical diagnostics. Although LMMs promise to improve healthcare through automation and decision support, their current state of development is not reliable enough for critical applications. The study highlights the need for more robust evaluation frameworks and better training methods to ensure that these models can be trusted in medical practice.
The study suggests several ways to improve the reliability of LMMs in medical diagnostics. These include improved training data, the inclusion of more diverse and representative medical data to improve model understanding, and specialized evaluation metrics that better reflect the complexity of medical diagnostic tasks. In addition, interdisciplinary collaboration involving experts in the development and evaluation process is essential to ensure that the results of the models are clinically relevant.
The "Worse than Random?" study is an important reminder that while AI technology is advancing rapidly, its application in sensitive areas such as healthcare requires careful and thorough validation. Despite its impressive capabilities in other domains, the current generation of large multimodal models is inadequate when it comes to answering visual questions. This research calls for a renewed focus on developing AI that can truly understand and support medical diagnoses to ensure the safety and accuracy of healthcare applications.