In a recent study published in the journal natural medicinean international team of scientists identified optimal large-scale language models and adaptive methods for clinically summarizing large amounts of electronic medical record data, and compared the performance of these models to that of medical professionals. did.
Research: Adapted large-scale language models can outperform medical experts in summarizing clinical text. Image credit: takasu / Shutterstock
background
A tedious but essential aspect of medical practice is documenting a patient's medical record, including progress reports, diagnostic tests, and treatment history across professions. Clinicians often spend a significant portion of their time compiling large amounts of textual data, and even the most experienced physicians can encounter errors in this process and can lead to serious medical and diagnostic problems.
The shift from paper records to electronic medical records appears to only increase the amount of clinical documentation work, with the report stating that clinicians document clinical data from a single patient interaction. Each person spends about 2 hours on it. Nurses spend nearly 60% of their time on clinical documentation, and the temporary demands associated with this process often result in significant stress and burnout, and reduced clinician job satisfaction. and ultimately lead to worse patient outcomes.
Large-scale language models offer excellent options for clinical data summarization, and while these models have been evaluated for common natural language processing tasks, their efficiency and accuracy in clinical data summarization have not been extensively evaluated. Not.
About research
In this study, researchers evaluated eight large-scale language models across four clinical summary tasks: patient questions, radiology reports, doctor-patient interactions, and progress notes.
They first used quantitative natural language processing metrics to determine which models and adaptation methods performed best across four summarization tasks. Ten physicians then conducted a clinical reader survey, comparing the best summaries from a large language model with summaries from medical experts along parameters such as brevity, accuracy, and completeness. did.
Finally, the researchers evaluated safety aspects and identified challenges such as fabrication of information and the potential for medical harm that exists in summarizing clinical data by medical experts or large-scale language models.
Two broad language generation approaches (autoregressive and seq2seq models) were used to evaluate eight large-scale language models. Training seq2seq models requires paired datasets because they use an encoder/decoder architecture that maps inputs to outputs. These models perform efficiently on tasks including summarization and machine translation.
Autoregressive models, on the other hand, do not require paired datasets and are suitable for tasks such as dialogue, question-and-answer interactions, and text generation. In this study, we will use open-source autoregressive and seq2seq large-scale language models, as well as several proprietary autoregressive models and general-purpose pre-trained large-scale language models to adapt them to perform domain-specific tasks. We evaluated two methods.
The four areas of tasks used to evaluate large-scale language models are: summarizing radiology reports using detailed data from radiology analyzes and results; summarizing questions from patients into summary questions; Use progress notes to create a list of medical problems and diagnoses, and summarize interactions between doctor and patient in paragraphs about assessment and planning.
result
The results showed that 45% of the summaries from the best-adapted large-scale language model were equivalent to summaries from medical experts, and 36% were better. Furthermore, in a survey of clinical readers, large language model summaries scored higher than medical experts' summaries in all three of his parameters: conciseness, accuracy, and completeness.
Additionally, scientists have found that “prompt engineering,” the process of adjusting or changing input prompts, can significantly improve model performance. This was particularly evident along the conciseness parameter, where specific prompts that told the model to summarize the patient's question into a query with a certain number of words helped to meaningfully condense the information.
Radiology reports are one point where the conciseness of large language model summaries is lower than those provided by medical professionals, and scientists have found that prompts for summarizing radiology reports do not require specific Since the content was not specified, we predicted that the ambiguity of the input prompt might be the cause. Character limit. However, we also believe that the accuracy of this process can be significantly improved by incorporating checks not only from human operators but also from other large language models and ensembles of models.
conclusion
Overall, this study found that using large-scale language models to summarize data from patient health records performed as well as or better than summarizing data by medical professionals. Most of these large-scale language models scored higher than human operators on natural language processing metrics, concisely and accurately summarizing the data completely. This process could potentially be implemented with further modifications and improvements to save clinicians valuable time and improve patient care.
Reference magazines:
- Wien, V., Uden, V., Blankemeyer, L., Delbruck, J., Ahli, A., Brusgen, C., Parikh, A., Polasin, M., Reis, EP, Seehofnerova, A. ., Rohatgi, N., Hosamani, P., Collins, W., Ahuja, N., Langlotz, C. P., Hom, J., Gatidis, S., Pauly, J., and Chaudhary, A. S. (2024). The adapted large-scale language model outperforms medical experts in summarizing clinical text. natural medicine. DOI: 10.1038/s41591024028555, https://www.nature.com/articles/s41591-024-02855-5