Large language models as educational collaborators: developing non-conventional teaching aids in pharmacology & therapeutics

Table of Contents

Key findings

This comparative evaluation of two LLMs, DeepSeek and ChatGPT, demonstrated that both can generate high-quality educational content across various formats, including SLOs, reading materials, CBL cases, and assessment items (MCQs, OSPEs, and SAQs), tailored to the learner’s educational stage. However, DeepSeek outperformed ChatGPT in aligning its outputs with the specified rubrics, particularly in terms of cognitive progression across Bloom’s taxonomy, structural precision, and contextual specificity. It demonstrated superior integration of assessment readiness, clinical applicability, and inclusivity in its outputs. While ChatGPT’s responses were conceptually sound, linguistically accessible, and pedagogically relevant, they occasionally lacked the depth, assessment scaffolding, and time-bound specificity seen in DeepSeek’s materials. Strikingly, DeepSeek’s output reflected tighter alignment with real-world clinical and educational standards, particularly those set by the WPATH and Endocrine Society guidelines, making it a more rubric-compliant and implementation-ready tool across all educational phases.

Comparison with existing literature

The present study highlights the capacity of two LLMs to generate instructional content aligned with sound pedagogical principles, specifically across various teaching tools, including assessment items, for a topic not traditionally covered in core pharmacology and therapeutics curricula. The topic of GAHT, while clinically relevant and culturally sensitive, remains largely absent in standard pharmacology textbooks, underscoring the need for alternative content generation approaches. One of the enduring global challenges in medical education, especially in the context of curriculum reform, is faculty resistance to change and the inertia of traditional educational paradigms [19]. Faculty, particularly those trained within historically rigid curricula, often encounter difficulties in creating content for novel or interdisciplinary topics, a challenge compounded in resource-limited academic environments typical of many developing countries [20]. Despite these systemic barriers, LLMs have seen increasing adoption among medical educators. For instance, a recent report indicated that over 60% of faculty members at a U.S. medical school have incorporated ChatGPT into their instructional design or teaching workflow [21]. This growing utilization reflects both the accessibility of such tools and their potential to enhance curricular responsiveness to emerging topics. The findings of this study substantiate the utility of LLMs as pedagogical adjuncts by demonstrating that both models maintained cognitive alignment with Bloom’s taxonomy throughout the three-tiered educational phases. Particularly, DeepSeek exhibited a sophisticated understanding of cognitive hierarchies, systematically transitioning from fundamental pharmacological concepts in the preclerkship phase to more complex educational tasks such as policy critique and research methodology in the master’s program. This progression supports the assertion that LLMs, when guided by structured prompts, can produce outputs that respect learners’ evolving developmental and educational needs.

While prior systematic reviews examining LLMs in medical education have predominantly focused on ChatGPT’s application in generating or validating multiple-choice questions [5, 22], relatively few studies have explored broader instructional design applications across an integrated curriculum. For example, one study explored the use of LLMs for creating anatomy MCQs and found promise, albeit tempered by technical limitations [22], while another study utilized ChatGPT as a virtual tutor for anatomical education [23]. Both concluded that while LLMs show promise, their use is tempered by risks such as factual inaccuracies, underlining the need for careful oversight. Unlike anatomy, where educational content remains relatively stable, pharmacology and therapeutics are dynamic disciplines that require frequent curricular updates to stay aligned with evolving treatment guidelines and public health needs. In our earlier investigation centered on antihypertensive pharmacotherapy, a topic central to core pharmacology, we identified substantial limitations in LLM outputs, including construction errors, ambiguous answer options, and misalignment with learners’ educational levels [5]. The contrast between those findings and the current study highlights the critical role of structured prompting. In this study, we employed rigorously developed prompts informed by best practices in instructional design, including the articulation of clear criteria and contextual boundaries. This methodological enhancement yielded LLM responses that were markedly more appropriate, consistent, and pedagogically sound. Additionally, the progressive refinement of LLMs over time may have also contributed to the observed outcomes.

To date, only one published study has directly compared the outputs of LLMs to conventional learning materials such as textbook summaries. That study found that while textbook summaries were rated as more comprehensive, ChatGPT’s responses were favored for their clarity, coherence, and ease of understanding [24]. This observation is in alignment with our findings, particularly with respect to the readability and user-friendliness of the reading materials generated by both LLMs. DeepSeek further distinguished itself by embedding visual elements and structuring content in a modular format conducive to learner engagement. The absence of conventional textbook content specific to GAHT means that LLM-generated materials fill a critical educational gap. The reading resources developed through this study could serve as a foundational template for faculty tasked with teaching this emerging topic, offering both content and structure that can be readily customized or expanded upon to meet institutional and learner-specific needs.

A particularly salient feature of this study was the incorporation of inclusive language and culturally responsive content by both models. The outputs not only addressed gender diversity with sensitivity but also embedded key concepts such as patient autonomy, social determinants of health, and shared decision-making within their instructional frameworks. DeepSeek demonstrated exceptional capacity to center non-binary identities and produced contextually appropriate scenarios that reflect contemporary clinical realities. ChatGPT, too, provided ethically sound and patient-centered content, particularly in areas related to communication and informed consent. However, the consistency and comprehensiveness of its inclusive efforts varied across instructional tools. These observations affirm the potential of LLMs to produce culturally competent educational materials aligned with current standards of care, including those set forth by the WPATH and the Endocrine Society [17, 18]. Given the central role of inclusivity in medical education, and its influence on learner values and professionalism as part of the “hidden curriculum”, this ability to generate respectful, representative content is not only desirable but essential [25].

Taken together, the insights gained from this study offer practical guidance for integrating LLMs into medical curriculum development. It is imperative that educators exercise careful judgment in selecting the appropriate model, designing effective prompts, and validating generated content before implementation. While both DeepSeek and ChatGPT demonstrated considerable potential, their outputs were optimized only when informed by educational frameworks and validated by expert faculty. The findings reinforce the notion that LLMs should not be viewed as replacements for faculty expertise, but rather as collaborative tools that, when coupled with pedagogical oversight, can greatly enhance instructional quality and innovation. These results advocate for the strategic adoption of LLMs in curriculum design, especially for emerging topics that fall outside the scope of traditional teaching resources and point toward a transformative future in medical education driven by thoughtful integration of artificial intelligence.

Strengths, weakness and way forward

A key strength of this study lies in its systematic rubric-based evaluation of diverse educational deliverables, ranging from SLOs to OSPEs, generated by two advanced LLMs, which allowed for a phase-specific and criterion-sensitive assessment for a recently emerging topic that is not listed in the standard textbooks. The inclusion of outputs from preclerkship, clerkship, and master’s levels enhanced the generalizability of findings across the continuum of medical education. Additionally, the use of well-defined, multidimensional rubric ensured rigorous and transparent comparisons. However, the study is limited by its reliance on a single prompt per task and LLM, which may not capture the full variability or potential adaptability of each model. Furthermore, the rubric application, though structured, may carry inherent subjectivity, especially in rating nuanced aspects such as construction quality or cultural sensitivity. Another limitation is that the study did not evaluate the actual impact of these outputs on learner performance or engagement, leaving a gap in understanding their real-world educational effectiveness. Moving forward, researchers should explore longitudinal implementation studies incorporating LLM-generated materials in curricula, assessing learning outcomes, student satisfaction, and faculty feedback. Medical educationists are encouraged to collaborate with AI developers to refine prompt engineering and rubric alignment, ensuring that future AI tools not only generate content but do so in a pedagogically sound, contextually relevant, and culturally sensitive manner.

link