Specific learning objectives
The pre-clerkship phase SLOs identified by Sage Poe, Claude-Instant, and ChatGPT are listed in the electronic supplementary materials 1–3, respectively. In general, a broad homology in SLOs generated by the three AI platforms was observed. All AI platforms identified appropriate action verbs as per Bloom’s taxonomy to state the SLO; action verbs such as describe, explain, recognize, discuss, identify, recommend, and interpret are used to state the learning outcome. The specific, measurable, achievable, relevant, time-bound (SMART) SLOs generated by each AI platform slightly varied. All key domains of antihypertensive pharmacology to be achieved during the pre-clerkship (pre-clinical) years were relevant for graduating doctors. The SLOs addressed current JNC Treatment Guidelines recommended classes of antihypertensive drugs, the mechanism of action, pharmacokinetics, adverse effects, indications/contraindications, dosage adjustments, monitoring therapy, and principles of monotherapy and combination therapy.
The SLOs to be achieved by undergraduate medical students at the time of graduation identified by Sage Poe, Claude-Instant, and ChatGPT listed in electronic supplementary materials 4–6, respectively. The identified SLOs emphasize the application of pharmacology knowledge within a clinical context, focusing on competencies needed to function independently in early residency stages. These SLOs go beyond knowledge recall and mechanisms of action to encompass competencies related to clinical problem-solving, rational prescribing, and holistic patient management. The SLOs generated require higher cognitive ability of the learner: action verbs such as demonstrate, apply, evaluate, analyze, develop, justify, recommend, interpret, manage, adjust, educate, refer, design, initiate & titrate were frequently used.
A-type MCQs
The MCQs for the pre-clerkship phase identified by Sage Poe, Claude-Instant, and ChatGPT listed in the electronic supplementary materials 7–9, respectively, and those identified with the search query based on the clinical vignette in electronic supplementary materials (10–12).
All MCQs generated by the AIs in each of the four domains specified [mechanism of action (MOA); pharmacokinetics; adverse drug reactions (ADRs), and indications for antihypertensive drugs] are quality test items with potential content validity. The test items on MOA generated by Sage Poe included themes such as renin-angiotensin-aldosterone (RAAS) system, beta-adrenergic blockers (BB), calcium channel blockers (CCB), potassium channel openers, and centrally acting antihypertensives; on pharmacokinetics included high oral bioavailability/metabolism in liver [angiotensin receptor blocker (ARB)-losartan], long half-life and renal elimination [angiotensin converting enzyme inhibitors (ACEI)-lisinopril], metabolism by both liver and kidney (beta-blocker (BB)-metoprolol], rapid onset- short duration of action (direct vasodilator-hydralazine), and long-acting transdermal drug delivery (centrally acting-clonidine). Regarding the ADR theme, dry cough, angioedema, and hyperkalemia by ACEIs in susceptible patients, reflex tachycardia by CCB/amlodipine, and orthostatic hypotension by CCB/verapamil addressed. Clinical indications included the drug of choice for hypertensive patients with concomitant comorbidity such as diabetics (ACEI-lisinopril), heart failure and low ejection fraction (BB-carvedilol), hypertensive urgency/emergency (alpha cum beta receptor blocker-labetalol), stroke in patients with history recurrent stroke or transient ischemic attack (ARB-losartan), and preeclampsia (methyldopa).
Almost similar themes under each domain were identified by the Claude-Instant AI platform with few notable exceptions: hydrochlorothiazide (instead of clonidine) in MOA and pharmacokinetics domains, respectively; under the ADR domain ankle edema/ amlodipine, sexual dysfunction and fatigue in male due to alpha-1 receptor blocker; under clinical indications the best initial monotherapy for clinical scenarios such as a 55-year old male with Stage-2 hypertension; a 75-year-old man Stage 1 hypertension; a 35-year-old man with Stage I hypertension working on night shifts; and a 40-year-old man with stage 1 hypertension and hyperlipidemia.
As with Claude-Instant AI, ChatGPT-generated test items on MOA were mostly similar. However, under the pharmacokinetic domain, immediate- and extended-release metoprolol, the effect of food to enhance the oral bioavailability of ramipril, and the highest oral bioavailability of amlodipine compared to other commonly used antihypertensives were the themes identified. Whereas the other ADR themes remained similar, constipation due to verapamil was a new theme addressed. Notably, in this test item, amlodipine was an option that increased the difficulty of this test item because amlodipine therapy is also associated with constipation, albeit to a lesser extent, compared to verapamil. In the clinical indication domain, the case description asking “most commonly used in the treatment of hypertension and heart failure” is controversial because the options listed included losartan, ramipril, and hydrochlorothiazide but the suggested correct answer was ramipril. This is a good example to stress the importance of vetting the AI-generated MCQ by experts for content validity and to assure robust psychometrics. The MCQ on the most used drug in the treatment of “hypertension and diabetic nephropathy” is more explicit as opposed to “hypertension and diabetes” by Claude-Instant because the therapeutic concept of reducing or delaying nephropathy must be distinguished from prevention of nephropathy, although either an ACEI or ARB is the drug of choice for both indications.
It is important to align student assessment to the curriculum; in the PBL curriculum, MCQs with a clinical vignette are preferred. The modification of the query specifying the search to generate MCQs with a clinical vignette on domains specified previously gave appropriate output by all three AI platforms evaluated (Sage Poe; Claude- Instant; Chat GPT). The scenarios generated had a good clinical fidelity and educational fit for the pre-clerkship student perspective.
The errors observed with AI outputs on the A-type MCQs are summarized in Table 2. No significant pattern was observed except that Claude-Instant© generated test items in a stereotyped format such as the same choices for all test items related to pharmacokinetics and indications, and all the test items in the ADR domain are linked to the mechanisms of action of drugs. This illustrates the importance of reviewing AI-generated test items by content experts for content validity to ensure alignment with evidence-based medicine and up-to-date treatment guidelines.
The test items generated by ChatGPT had the advantage of explanations supplied rendering these more useful for learners to support self-study. The following examples illustrate this assertion: “A patient with hypertension is started on a medication that works by blocking beta-1 receptors in the heart (metoprolol)”. Metoprolol is a beta blocker that works by blocking beta-1 receptors in the heart, which reduces heart rate and cardiac output, resulting in a decrease in blood pressure. However, this explanation is incomplete because there is no mention of other less important mechanisms, of beta receptor blockers on renin release. Also, these MCQs were mostly recall type: Which of the following medications is known to have a significant first-pass effect? The explanation reads: propranolol is known to have a significant first pass-effect, meaning that a large portion of the drug is metabolized by the liver before it reaches systemic circulation. Losartan, amlodipine, ramipril, and hydrochlorothiazide do not have significant first-pass effect. However, it is also important to extend the explanation further by stating that the first-pass effect of propranolol does not lead to total loss of pharmacological activity because the metabolite hydroxy propranolol also has potent beta-blocking activity. Another MCQ test item had a construction defect: “A patient with hypertension is started on a medication that can cause photosensitivity. Which of the following medications is most likely responsible?” Options included: losartan, amlodipine, ramipril, hydrochlorothiazide, hydrochlorothiazide/triamterene. The explanation states that hydrochlorothiazide (HCTZ) can cause photosensitivity, which is increased sensitivity to sunlight that can cause skin rashes and sunburns. Losartan, amlodipine, ramipril, and HCTZ/triamterene are less likely to cause photosensitivity. However, it is irrational to claim that HCTZ/triamterene, one of the wrong options, is less likely to cause photosensitivity. The difficulty index of such test items is likely unacceptable in high-stake tests to make equitable psychometric decisions.
The responses from Sage Poe, Claude-Instant, and ChatGPT on guidance for standard-setting methods are listed in the electronic supplementary materials 13–15. Sage Poe offered this information with the caveat that “the results of this simulation were purely hypothetical and do not reflect the actual performance of any real students”. The simulation estimates of examinees correctly answering test items ranged from 60 to 90% for the mechanism of action, 65–85% for pharmacokinetics, 60–80% for adverse reactions, and 75–90% for clinical indications of antihypertensives.
In contrast, Claude-Instant admitted that “Performing a full simulation with 100 medical students is beyond the scope of what I can do as an AI assistant. I can generate sample MCQ questions and answers, but I do not have access to a population of medical students to administer the questions. A true simulation that administers questions to real students and analyzes the results would require access to a database of medical students, a platform to deliver the questions, and code to analyze the responses. As an AI assistant, I lack these capabilities. While I cannot perform a full simulation, here are some considerations for administering MCQ questions to medical students: provide details on the purpose and instructions for the test to ensure students understand what is expected out of them; consider including some easier questions in addition to more challenging ones to boost students’ confidence and motivation; avoid ambiguous wording and ensure each question has only one unambiguously correct answer option; when analyzing results, look for questions factors that may have impacted performance, like unclear wording or answer options that are too similar. Focusing on weak questions can help improve future tests; compare performance across different student groups to identify areas where more teaching may be needed; provide detailed answer explanations and review sessions after the test to help students learn from any errors and reinforce correct concepts”. However, ChatGPT offered the simulation result estimates for the 20 MCQ on antihypertensive drugs: mechanism of action ranged from 67 to 92%, pharmacokinetics 63–86%, adverse effects 65–82%, and clinical indications 64–89%. Furthermore, it also stated that “Overall, the performance of the students was quite good, with most questions having a response rate of over 70%. However, there were some questions where the projected correct response rate was rather low, such as Question #5 (Mechanism of Action of Hydrochlorothiazide; see Electronic Supplementary Material 12) and Question 18 (Indications for Verapamil; see Electronic Supplementary Material 10). This may suggest areas where students need more focused education or review.”
Integrated case cluster MCQs
We asked AI assistants to generate 20 integrated case cluster MCQs with 2 test items in each cluster with five options for undergraduate medical students in the pre-clerkship phase integrating pharmacology and physiology related to systemic hypertension with a case vignette and the responses by Sage Poe, Claude-Instant, and ChatGPT are listed in the electronic supplementary materials (16–18). In all instances, the test items generated had focused case descriptions in the form of a clinical vignette, and horizontal integration across the pathophysiology of hypertension and pharmacology of antihypertensive drugs. These test items mostly targeted the ‘knows (knowledge)’ or ‘knows how (competence)’ level on Miller’s pyramid and are suitable for assessing the clinical competence of pre-clerkship medical students, especially in an integrated PBL curriculum. Both the AI assistants generated excellent clinical vignettes and themes; however, most of the cluster MCQs by ChatGPT had “None of the above” as an option, which is often considered a test item construction flaw. Notwithstanding these limitations, case cluster integrated test items are valuable for learners to integrate their knowledge of different basic medical sciences and their application to clinical sciences. This integrated approach can be used for both instructional and student assessment purposes to make the course more meaningful. Indeed, one of the basic tenets of PBL is curriculum integration.
In the next query, we asked AI assistants to write integrated case cluster MCQs with 2 test items in each cluster with 5 options for undergraduate medical students at the time of graduation integrating pharmacology and physiology of hypertension with a case vignette. The responses are listed in the electronic supplementary materials 19–21. Claude-Instant has provided only partial responses to both the queries. Overall, the information provided in case vignettes requires the learner to interpret the clinical information and make therapeutic decisions. Such test items are appropriate for assessing the minimum competency of medical school graduates to start internship or residency programs; these MCQs are generally comparable to test items used in USMLE Part 2. The evaluation of the integrated case cluster test items provided by all the AI platforms is summarized in Table 3. In general, the AI outputs had at least one construction defect, and did not include all the core sub-topics to a sufficient degree.
When asked to simulate the above-generated test items to 100 undergraduate medical students at the time of graduation and provide a percentage of students who are likely to correctly answer each item, the AI assistant responded:
“As an AI language model, I cannot simulate the responses of 100 undergraduate medical students likely to obtain the correct answer for each question. However, I can provide some general information about the difficulty level of each question based on the complexity of the concepts being tested and the level of integration between pharmacology and physiology”.
AI assistant went on to clarify that questions that require knowledge of basic pharmacology principles, such as the mechanism of action of specific drugs, are likely to be easier for students to answer correctly. Test items that require an understanding of the physiological mechanisms underlying hypertension and correlating with symptoms are likely to be more challenging for students. The AI assistant sorted these test items into two categories accordingly. Overall, the difficulty level of the test item is based on the level of integration between pharmacology and pathophysiology. Test items that require an understanding of both pharmacological and physiological mechanisms are likely to be more challenging for students requiring a strong foundation in both pharmacology and physiology concepts to be able to correctly answer integrated case-cluster MCQs.
Short answer questions
The responses to a search query on generating SAQs appropriate to the pre-clerkship phase Sage Poe, Claude-Instant, and ChatGPT generated items are listed in the electronic supplementary materials 22–24 for difficult questions and 25–27 for moderately difficult questions.
It is apparent from these case vignette descriptions that the short answer question format varied. Accordingly, the scope for asking individual questions for each scenario is open-ended. In all instances, model answers are supplied which are helpful for the course instructor to plan classroom lessons, identify appropriate instructional methods, and establish rubrics for grading the answer scripts, and as a study guide for students.
We then wanted to see to what extent AI can differentiate the difficulty of the SAQ by replacing the search term “difficult” with “moderately difficult” in the above search prompt: the changes in the revised case scenarios are substantial. Perhaps the context of learning and practice (and the level of the student in the MD/medical program) may determine the difficulty level of SAQ generated. It is worth noting that on changing the search from cardiology to internal medicine rotation in Sage Poe the case description also changed. Thus, it is essential to select an appropriate AI assistant, perhaps by trial and error, to generate quality SAQs. Most of the individual questions tested stand-alone knowledge and did not require students to demonstrate integration.
The responses of Sage Poe, Claude-Instant, and ChatGPT for the search query to generate SAQs at the time of graduation are listed in the electronic supplementary materials 28–30. It is interesting to note how AI assistants considered the stage of the learner while generating the SAQ. The response by Sage Poe is illustrative for comparison. “You are a newly graduated medical student who is working in a hospital” versus “You are a medical student in your pre-clerkship.”
Some questions were retained, deleted, or modified to align with competency appropriate to the context (Electronic Supplementary Materials 28–30). Overall, the test items at both levels from all AI platforms were technically accurate and thorough addressing the topics related to different disciplines (Table 3). The differences in learning objective transition are summarized in Table 4. A comparison of learning objectives revealed that almost all objectives remained the same except for a few (Table 5).
A similar trend was apparent with test items generated by other AI assistants, such as ChatGPT. The contrasting differences in questions are illustrated by the vertical integration of basic sciences and clinical sciences (Table 6).
Taken together, these in-depth qualitative comparisons suggest that AI assistants such as Sage Poe and ChatGPT consider the learner’s stage of training in designing test items, learning outcomes, and answers expected from the examinee. It is critical to state the search query explicitly to generate quality output by AI assistants.
OSPEs
The OSPE test items generated by Claude-Instant and ChatGPT appropriate to the pre-clerkship phase (without mentioning “appropriate instructions for the patients”) are listed in the electronic supplementary materials 31 and 32 and with patient instructions on the electronic supplementary materials 33 and 34. For reasons unknown, Sage Poe did not provide any response to this search query.
The five OSPE items generated were suitable to assess the prescription writing competency of pre-clerkship medical students. The clinical scenarios identified by the three AI platforms were comparable; these scenarios include patients with hypertension and impaired glucose tolerance in a 65-year-old male, hypertension with chronic kidney disease (CKD) in a 55-year-old woman, resistant hypertension with obstructive sleep apnea in a 45-year-old man, and gestational hypertension at 32 weeks in a 35-year-old (Claude-Instant AI). Incorporating appropriate instructions facilitates the learner’s ability to educate patients and maximize safe and effective therapy. The OSPE item required students to write a prescription with guidance to start conservatively, choose an appropriate antihypertensive drug class (drug) based on the patients’ profile, specifying drug name, dose, dosing frequency, drug quantity to be dispensed, patient name, date, refill, and caution as appropriate, in addition to prescribers’ name, signature, and license number. In contrast, ChatGPT identified clinical scenarios to include patients with hypertension and CKD, hypertension and bronchial asthma, gestational diabetes, hypertension and heart failure, and hypertension and gout (ChatGPT). Guidance for dosage titration, warnings to be aware, safety monitoring, and frequency of follow-up and dose adjustment. These test items are designed to assess learners’ knowledge of P & T of antihypertensives, as well as their ability to provide appropriate instructions to patients. These clinical scenarios for writing prescriptions assess students’ ability to choose an appropriate drug class, write prescriptions with proper labeling and dosing, reflect drug safety profiles, and risk factors, and make modifications to meet the requirements of special populations. The prescription is required to state the drug name, dose, dosing frequency, patient name, date, refills, and cautions or instructions as needed. A conservative starting dose, once or twice daily dosing frequency based on the drug, and instructions to titrate the dose slowly if required.
The responses from Claude-Instant and ChatGPT for the search query related to generating OSPE test items at the time of graduation are listed in electronic supplementary materials 35 and 36. In contrast to the pre-clerkship phase, OSPEs generated for graduating doctors’ competence assessed more advanced drug therapy comprehension. For example, writing a prescription for:
(1) A 65-year- old male with resistant hypertension and CKD stage 3 to optimize antihypertensive regimen required the answer to include starting ACEI and diuretic, titrating the dosage over two weeks, considering adding spironolactone or substituting ACEI with an ARB, and need to closely monitor serum electrolytes and kidney function closely.
(2) A 55-year-old woman with hypertension and paroxysmal arrhythmia required the answer to include switching ACEI to ARB due to cough, adding a CCB or beta blocker for rate control needs, and adjusting the dosage slowly and monitoring for side effects.
(3) A 45-year-old man with masked hypertension and obstructive sleep apnea require adding a centrally acting antihypertensive at bedtime and increasing dosage as needed based on home blood pressure monitoring and refer to CPAP if not already using one.
(4) A 75-year-old woman with isolated systolic hypertension and autonomic dysfunction to require stopping diuretic and switching to an alpha blocker, upward dosage adjustment and combining with other antihypertensives as needed based on postural blood pressure changes and symptoms.
(5) A 35-year-old pregnant woman with preeclampsia at 29 weeks require doubling methyldopa dose and consider adding labetalol or nifedipine based on severity and educate on signs of worsening and to follow-up immediately for any concerning symptoms.
These case scenarios are designed to assess the ability of the learner to comprehend the complexity of antihypertensive regimens, make evidence-based regimen adjustments, prescribe multidrug combinations based on therapeutic response and tolerability, monitor complex patients for complications, and educate patients about warning signs and follow-up.
A similar output was provided by ChatGPT, with clinical scenarios such as prescribing for patients with hypertension and myocardial infarction; hypertension and chronic obstructive pulmonary airway disease (COPD); hypertension and a history of angina; hypertension and a history of stroke, and hypertension and advanced renal failure. In these cases, wherever appropriate, pharmacotherapeutic issues like taking ramipril after food to reduce side effects such as giddiness; selection of the most appropriate beta-blocker such as nebivolol in patients with COPD comorbidity; the importance of taking amlodipine at the same time every day with or without food; preference for telmisartan among other ARBs in stroke; choosing furosemide in patients with hypertension and edema and taking the medication with food to reduce the risk of gastrointestinal adverse effect are stressed.
The AI outputs on OSPE test times were observed to be technically accurate, thorough in addressing core sub-topics suitable for the learner’s level and did not have any construction defects (Table 3). Both AIs provided the model answers with explanatory notes. This facilitates the use of such OSPEs for self-assessment by learners for formative assessment purposes. The detailed instructions are helpful in creating optimized therapy regimens, and designing evidence-based regimens, to provide appropriate instructions to patients with complex medical histories. One can rely on multiple AI sources to identify, shortlist required case scenarios, and OSPE items, and seek guidance on expected model answers with explanations. The model answer guidance for antihypertensive drug classes is more appropriate (rather than a specific drug of a given class) from a teaching/learning perspective. We believe that these scenarios can be refined further by providing a focused case history along with relevant clinical and laboratory data to enhance clinical fidelity and bring a closer fit to the competency framework.