Section 4 - Critical Elements to Manage with Generative AI, specifically LLMs

Generative AI, and the associated models, including large language models (LLMs) have huge promise to revolutionize healthcare and other industries but comes with substantial risks as well. There is a variety of paradigms to classify the risks of AI in general, including the National Institute of Standards and Technology’s (NIST’s) Artificial Intelligence Risk Management Framework (AI RMF) and the aligned Blueprint for Trustworthy AI from the Coalition for Health AI.   

NIST’s framework for trustworthiness asks if a model is Valid and Reliable (Safe, Secure, and Resilient, Explainable and Interpretable, Privacy Enhanced, and Fair with Harmful Bias Managed) and if it is Accountable and Transparent. Privacy and Security have particular importance in healthcare; so the next article in this series is dedicated to Security and Privacy considerations in the context of LLMs in healthcare.  

Valid and Reliable

Validity and reliability are crucial for Generative AI, and related models, in healthcare, since failures in healthcare can be catastrophic. Generative AI is prone to “hallucinations” (generated content that is factually incorrect). These events occur because generative AI, including LLM, is designed to predict the most likely completions of given content but does not truly understand the content. Notably, some authors recommend alternative terms to “hallucinations” such as “confabulations” or “AI misinformation” to reduce perceptions of model sentience and to avoid stigmatization of people who have hallucinations.   

Regardless of what they are called, these errors can take many forms. Examples in healthcare, specifically to large language models (LLMs), which is the text-generating portion of generative AI, include:  

  • A Large Language Model (LLM) asked to summarize the mechanism of action for a disease that gives partially incorrect information supported by non-existent references. For example, in the 1990s, Alzheimer’s disease was felt to be related to the “amyloid cascade hypothesis”, where amyloid formed sticky plaques between neurons resulting in neuronal death.  Years of research around drugs aimed at preventing this plaque ensued, with almost none of these trials showing meaningful improvement.  To this day, it is not clear whether amyloid is the cause of the disease, or a result of the disease (Quanta Magazine, 2022). A chart summation LLM that makes up specific events in a patient’s history. 
  • A LLM generating an encounter summation from a transcript that makes up information (e.g., a patient with an eating disorder’s BMI). 
  • A LLM generating insurance denial appeal letters that makes up non-existent references. 
  • A LLM generating abstracts but generating nonexistent references nearly a third of the time. 
  • A LLM chatbot outlandishly claiming to have a Master’s Degree in public health and volunteer experience with diabetes nonprofits when answering questions about diabetes medications. 

These are all actual examples of hallucinations and illustrate one of the risks in reliance on LLMs. Non-existent references are especially pernicious as they may make model output appear more authoritative and reliable, and many people may not understand the need to vet them carefully.   

A cautionary tale from another industry: a law firm was recently fined for one of its lawyers submitting a court brief that contained non-existent cases referenced; these had been suggested by a LLM in a classic example of a hallucination.  

A number of strategies have been described to reduce the risk of hallucinations (e.g., asking a LLM itself to review output for inaccuracies) but whatever strategy is adopted, they are clearly a risk to be aware of and monitor closely.   


In addition to hallucinations, the specifics of how a model is implemented may introduce other risks. For example, a LLM given the HPI from a series of 35-40 Emergency Department charts, only generated a differential diagnosis including the correct diagnosis half the time (Fast Company, 2019). It missed several life-threatening conditions such as ectopic pregnancy, aortic rupture, and brain tumor. The author of that report also expressed concerns that the LLM, if used to generate a differential diagnosis, would reinforce cognitive errors of the physician – e.g., if the HPI didn’t include questions probing a particular diagnosis, the model would be less likely to suggest that diagnosis in the differential. I.e., using a model to develop a differential diagnosis from a HPI to help a provider could encourage or even cause Premature Closure errors.   
This is just a specific example – the underlying concept is that whenever implementing a generative AI model, such as LLM, close consideration must be given to how it could increase or decrease safety. Some potential use cases are much higher risk than others.  

Fair with Harmful Bias Managed  

Bias is another significant risk with LLMs and can be overt or subtle. Underlying training data often is scraped from the internet and as such can contain biased content. Even with aggressive attempts to label that training data and/or filter out biased responses, they can come through, sometimes egregiously. More subtly, models may be likely to associate certain characteristics with a given race, gender, religion, etc. Importantly for healthcare, many LLMs have also been shown to perpetuate race-based medicine.  

Other Risks  

There are other potential risks and pitfalls of generative AI, and related models, including:  

  • Medicolegal uncertainty  
  • A regulatory landscape that is likely to continue to evolve rapidly in coming years  
  • Monetary and environmental cost to train and run generative AI models  
  • The psychological burden placed on low-paid workers labeling training data  
  • Intellectual property concerns over use of training data and the risk of plagiarism if generative AI content is included in a work  
  • Patients using LLMs directly to get medical advice  
  • Inexperienced providers utilizing generative AI suggestions without further validation.  

Discussing these in detail is beyond the scope of this article.  


As described elsewhere in this series, LLMs, and generative AI as a whole, has immense promise in healthcare as in other industries. However, there are significant risks and pitfalls to be aware of, especially in the high-stakes arena of healthcare. It is critical when implementing generative AI models, like the LLM, to carefully consider these risks and how to mitigate them.   

In the next section of this series, section five, we will discuss the security that needs to be top of mind for data and privacy in relation to LLMs. To view pervious sections in this series, please see the links below.  


Quanta Magazine. (2022, December 8). What Causes Alzheimer’s? Scientists Are Rethinking the Answer. Quanta Magazine.  
Fast Company. (2019, June 6). ChatGPT, the AI language model, struggled in medical diagnosis, report says. Fast Company.  

Other Sections

The views and opinions expressed in this content or by commenters are those of the author and do not necessarily reflect the official policy or position of HIMSS or its affiliates.