Generative AI, and the associated models, including large language models (LLMs) have great promise to revolutionize healthcare and other industries but comes with substantial risks as well. There are a variety of paradigms to classify the risks of AI in general, including the National Institute of Standards and Technology’s (NIST’s) Artificial Intelligence Risk Management Framework (AI RMF) and the aligned Blueprint for Trustworthy AI from the Coalition for Health AI (CHAI).
NIST’s framework for evaluating AI trustworthiness asks if a model is valid and reliable, safe, secure, and resilient, explainable and interpretable, explainable and interpretable, privacy-enhanced, and fair with harmful bias managed”. Balance between these model factors is critical in ensuring that AI models are trustworthy, and in helping to avoid any untoward outcomes.
Validity and reliability are crucial for trustworthy Generative AIin healthcare since failures can be catastrophic. Generative AI is prone to “hallucinations” (generated content that is factually incorrect). These events occur because generative AI is designed to predict the most likely completion of given content without truly understanding the content. Notably, some authors recommend alternative terms to “hallucinations” such as “confabulations” or “AI misinformation” to reduce perceptions of model sentience and to avoid stigmatization of people who have hallucinations (NCBI). Regardless of what they are called, these errors can take many forms. Examples in healthcare, specifically to large language models (LLMs), include:
These are all actual examples of hallucinations and illustrate one of the risks in reliance on LLMs. Non-existent references are especially pernicious as they may make model output appear more authoritative and reliable, and many people may not understand the need to vet them carefully.
A cautionary tale from another industry: a law firm was recently fined for one of its lawyers submitting a court brief that contained references to non-existent cases. These references had been suggested by a LLM in a classic example of a hallucination (Business Insider).
A number of strategies have been described to reduce the risk of hallucinations (e.g., asking a LLM itself to review output for inaccuracies) but whatever strategy is adopted, they are clearly a risk to be aware of and monitor closely (The New England Journal of Medicine).
In addition to hallucinations, the specifics of how a model is implemented may introduce other risks. For example, a LLM given the History of Present Illness (HPI) from a series of 35-40 Emergency Department charts, only generated a differential diagnosis including the correct diagnosis half the time (Medium) (Fast Company, 2019) . The LLM missed several life-threatening conditions such as ectopic pregnancy, aortic rupture, and brain tumor. The author of that report expressed concerns that the LLM, if used to generate a differential diagnosis, would reinforce cognitive errors of the physician – e.g., if the HPI didn’t include questions probing a particular diagnosis, the model would be less likely to suggest that diagnosis in the differential. Using such a model to develop a differential diagnosis from an HPI to assist a provider could encourage or even cause Premature Closure errors (Annals of Internal Medicine).
This is just a specific example – the underlying concept is that whenever implementing a generative AI model, such as a LLM, close consideration must be given to how it could increase or decrease safety. Some potential use cases are much higher risk than others.
Bias is another significant risk with LLMs and can be overt or subtle. Underlying training data often is scraped from the internet and as such can contain biased content. Even with aggressive attempts to label that training data and/or filter out biased responses, they can come through, sometimes egregiously (Business Insider). More subtly, models may be likely to associate certain characteristics with a given race, ethnicity, gender, religion, or other demographic characteristics within the training data set. (arXiv). Importantly for healthcare, many LLMs have been demonstrated to perpetuate race-based medicine (medRxiv).
There are other potential risks and pitfalls of generative AI, and related models, including:
Discussing these risks in detail is beyond the scope of this article.
As described elsewhere in this series, LLMs and generative AI have immense promise in healthcare as in other industries. However, there are significant risks and pitfalls to be aware of, especially in the high-stakes arena of healthcare. It is critical when implementing generative AI models, like the LLM, to carefully consider these risks and how to mitigate them.
In the next section of this series, section five, we will discuss the regulatory factors that needs to be top of mind for legal and ethical matters relating to LLMs. To view pervious sections in this series, please see the links below.
Quanta Magazine. (2022, December 8). What Causes Alzheimer’s? Scientists Are Rethinking the Answer. Quanta Magazine. https://www.quantamagazine.org/what-causes-alzheimers-scientists-are-rethinking-the-answer-20221208/
Fast Company. (2019, June 6). ChatGPT, the AI language model, struggled in medical diagnosis, report says. Fast Company. https://www.fastcompany.com/90863983/chatgpt-medical-diagnosis-emergency-room
The views and opinions expressed in this content or by commenters are those of the author and do not necessarily reflect the official policy or position of HIMSS or its affiliates.