In last week's Google I/O developer conference, there was an announcement that PaLM 2 (Google's latest generational large language model) will have multimodal capabilities. This means PaLM 2 can also interpret images and videos in addition to text interpretation and generation. Previously, Open AI announced that GPT-4 would have these capabilities too. In other words, the new generation of large language models will have multi-modal capabilities as a standard offering. How is this significant to the healthcare domain in which I operate?
Medical practice, by default, operates on multi-modal functionality. A clinician must interpret the patient or laboratory records, take an oral history, undertake visual examination, and interpret waveform and radiological investigations. Collectively, these inform the clinician's diagnosis or management of the patient. The previous generation of AI models could only contribute to a narrow set of medical tasks, say electronic record analysis or medical image interpretation, not in combination. This was mainly due to how the machine learning models were trained (supervised learning/annotated/labelled process) and the intrinsic limitation of the algorithms (even advanced ones) to perform accurately on multi-modal datasets. While regulatory authorities and vendors had a relatively easy task of having the application certified for its task boundaries and safety, they really fulfilled a narrow set of the customers (health services, medical doctors...etc.) requirements. Considering the need to integrate these applications into existing information systems, the economies of scale and ROI were minimal, if not non-existent.
The availability of multi-modal (and potentially multi-outcome) functionality may considerably change AI in the healthcare landscape. An ability of a single AI application to not only analyse a radiological investigation but link it back to the patient's history derived from analysis of the electronic health record and pathology investigations will be revolutionary. This will negate the need for stakeholders/purchasers to source multiple AI applications and become more accessible for the health service to set up a governance mechanism to monitor the deployment and delivery of AI-enabled services. At a clinical level, by utilizing multi-modal AI, physicians have a more comprehensive view when making a diagnosis.
Now such applications are not far from entering the commercial space. In the research domain, I was last year fascinated by this study from South Korea, where the authors demonstrated a multi-modal algorithm which adopts a BERT-based architecture to maximize generalization performance for both vision-language understanding tasks (diagnosis classification, medical image-report retrieval, medical visual question answering) and vision-language generation task (radiology report generation). As you may know, BERT is a masked language model based on the Transformer architecture. Since this study, I have seen a wave of studies showcasing the efficacy of multimodal AI, such as this and this.
Back to Google's announcement last week, as part of the customised offering of PaLM 2 in various domains, Med-PaLM 2 developed to generate medical analysis was demonstrated too. As per this blog, Med-PaLM 2 will interpret and generate text (answer questions and summarise insights) and have multi-modal functionalities to analyse/interpret medical image modalities. Considering GPT-4 can analyse images and offers access to their API to external developers, it is not hard to foresee multi-modal medical AI applications in the market. Of course, as I see it, multi-modal AI is not going to be restricted to LLM architecture, and there are different ways to develop such applications. Also, it is not enough to have multi-modal functionality; you also need to have multi-outcome features.
I write this article not only to signal to healthcare stakeholders (policymakers, funders, health services. Etc.) about the future of medical AI software but also to forewarn narrow use case medical AI developers to pivot their development strategy to multi-modal AI functionality or be swept away as the floodgates of multi-modal AI is unleased.
Health System Academic