A Systematic Review of The Limitations and Associated Opportunities of ChatGPT
A Systematic Review of The Limitations and Associated Opportunities of ChatGPT
A Systematic Review of The Limitations and Associated Opportunities of ChatGPT
To cite this article: Ngo Cong-Lem, Ali Soyoof & Diki Tsering (08 May 2024): A Systematic
Review of the Limitations and Associated Opportunities of ChatGPT, International Journal of
Human–Computer Interaction, DOI: 10.1080/10447318.2024.2344142
ABSTRACT KEYWORDS
This systematic review explores the limitations and opportunities associated with ChatGPT’s appli ChatGPT; large language
cation across various fields. Following a rigorous screening process of 485 studies identified model; limitation;
through searches in Scopus, Web of Science, ERIC, and IEEE Xplore databases, 33 high-quality opportunity; review
empirical studies were selected for analysis. The review identifies five key limitations: accuracy and
reliability concerns, limitations in critical thinking and problem-solving, multifaceted impacts on
learning and development, technical constraints related to input and output, and ethical, legal,
and privacy concerns. However, the review also highlights five exciting opportunities: educational
support and skill development, workflow enhancement, information retrieval, natural language
interaction and assistance, and content creation and ideation. While this review provides valuable
insights, it also highlights some gaps. Limited transparency in the studies regarding specific
ChatGPT versions used hinders generalizability. Additionally, the extent to which these findings
can be transferred to more advanced models like ChatGPT-4 remains unclear. By acknowledging
both limitations and opportunities, this review offers a foundation for researchers, developers, and
practitioners to consider when exploring the potential and responsible application of ChatGPT and
similar evolving AI tools.
CONTACT Ngo Cong-Lem [email protected] BehaviourWorks Australia, Monash University, 8 Scenic Boulevard, Clayton, Victoria 3168, Australia;
Faculty of Foreign Languages, Dalat University, Dalat, Vietnam
� 2024 The Author(s). Published with license by Taylor & Francis Group, LLC.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License (http://creativecommons.org/licenses/by-nc-nd/4.0/),
which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited, and is not altered, transformed, or built upon in any way.
The terms on which this article has been published allow the posting of the Accepted Manuscript in a repository by the author(s) or with their consent.
2 N. CONG-LEM ET AL.
foster a deeper understanding of the nuances involved in writing reports to patients or colleagues could benefit from
leveraging state-of-the-art language models. As we navigate evaluating the following criteria: (1) identifying the clear
the intricate landscape of artificial intelligence, a nuanced purpose of the report, (2) considering the limitations and
understanding of ChatGPT’s strengths and weaknesses is advantages of using ChatGPT for report writing (including
paramount for harnessing its full potential and driving the level of automation and ethical considerations), and (3)
innovation in natural language processing. determining the appropriate level of human intervention to
ensure accuracy, professionalism, and adherence to ethical
guidelines. By following this approach, GPs can optimize the
2. A brief overview of research background on
use of ChatGPT for report writing while maintaining control
ChatGPT
and responsibility for the final content.
Research has convincingly demonstrated that ChatGPT Despite the promising applications, there is still a limited
offers significant advantages and can contribute to various comprehensive understanding of ChatGPT’s limitations and
fields of study. Notably, it assists experts across different dis opportunities across fields, based on a synthesis of empirical
ciplines in composing reports for various experiments. For findings. This systematic review aims to address this gap by
instance, Aydın and Karaarslan (2023) found that ChatGPT critically examining existing empirical studies on ChatGPT.
can be a valuable tool for paraphrasing and academic writ
ing in the healthcare field. Similarly, a study by Kumar
(2023) demonstrated that within biomedical sciences, 3. Method
ChatGPT can be employed to produce well-organized and The current systematic review followed the PRISMA
grammatically sound English academic writing. Beyond (Preferred Reporting Items for Systematic Reviews and
these specific examples, ChatGPT offers broader advantages Meta-Analyses) guidelines for conducting and reporting sys
for users. It can address complex inquiries by providing tematic reviews, ensuring a transparent and methodologic
comprehensive insights, ranging from general overviews to ally rigorous approach throughout the review process (Page
detailed analyses of intricate phenomena (Tan et al., 2023). et al., 2021).
Overall, the current body of research suggests that ChatGPT
can be a valuable digital resource across diverse fields of
study. 3.1. Search strategies
The literature highlights education as one of the primary
A comprehensive search strategy was implemented to locate
domains where ChatGPT can make significant contributions
primary studies that report empirical evidence on the limita
(e.g., Bitzenbauer, 2023; Poole, 2022; Rudolph et al., 2023;
tions of ChatGPT. The search was conducted across various
Su & Yang, 2023). This potential, however, necessitates
databases including Web of Science, Scopus, IEEE Xplore,
responsible use. Studies by Bitzenbauer (2023) suggest that
and ERIC. Keywords utilized in our search comprised
ChatGPT can enhance critical thinking skills among second
"ChatGPT" AND "limitations," along with synonyms such as
ary school students in Germany. In another study, Poole
(2022) reported that ChatGPT benefits language teachers by "weakness," "drawback," "challenge," and "pitfall" (for detailed
assisting them in designing exercises and lesson plans. search strings, refer to Appendix A).
Additionally, ChatGPT can empower teachers to create per For establishing inclusion and exclusion criteria, studies
sonalized learning experiences and exercises tailored to indi were considered eligible if they were published as journal
vidual student needs (Su & Yang, 2023). Furthermore, articles, provided empirical findings assessing ChatGPT’s
ChatGPT has the potential to revolutionize higher educa performance, and included a discussion on its limitations
tion, particularly in assessment, learning, and teaching meth (details in Table 1). Publications characterized as conceptual
odologies (Rudolph et al., 2023). or lacking an empirically supported examination of
While the integration of ChatGPT into the education sec ChatGPT’s performance and limitations were excluded. No
tor appears promising, Su and Yang (2023) advocate for date constraints were imposed on the search process. Results
careful consideration of several factors to maximize its were managed using Covidence, a web-based systematic
effectiveness. These factors include determining the expected review management tool facilitating deduplication and
outcome, defining the appropriate level of automation, con screening.
sidering both the ethical and unethical aspects of use, and The initial screening involved evaluating titles and
measuring the efficacy of ChatGPT in achieving the desired abstracts collaboratively by all three authors to identify
learning objectives. potentially relevant studies. Subsequently, the screening pro
The recommendations outlined by Su and Yang (2023) cess progressed to a full-text evaluation of studies identified
for the field of education can be broadly applied to various during the initial screening phase. This thorough examin
fields of study where experts leverage ChatGPT for different ation was conducted in duplicate by the authors to ensure
purposes. In other words, it is crucial for experts in different rigor and comprehensiveness. Any discrepancies encoun
fields to first determine their desired outcomes and then tered during the initial and full-text screening were resolved
carefully consider the level of automation, ethical implica through discussion and consensus among the authors or,
tions, and overall effectiveness of ChatGPT within their spe when necessary, by consulting an additional member of the
cific contexts. As an example, General Practitioners (GPs) research team. This systematic approach aimed to enhance
INTERNATIONAL JOURNAL OF HUMAN–COMPUTER INTERACTION 3
the reliability and precision of the study selection process in shortcomings in educational settings. Engineering (9.09%)
the systematic review. and other fields like psychology, chemistry, and physics (all
around 3%) were also represented, showcasing a broader
exploration of limitations across various disciplines. This
3.2. Data analysis distribution underscores the widespread interest in evaluat
A thematic analysis, following the guidelines outlined by ing ChatGPT’s limitations across diverse application areas,
Braun and Clarke (2006), was utilized to uncover and cat with a particular emphasis on ensuring its safe and effective
egorize the reported limitations of ChatGPT across the use in healthcare and education.
diverse studies included. This methodical approach involved In terms of the version of ChatGPT employed in the
thoroughly familiarizing ourselves with each study to gain a studies reviewed, 23 studies (69.7%) did not report the spe
deep understanding of ChatGPT’s limitations. Subsequently, cific version of ChatGPT used. The remaining studies pro
initial codes were systematically generated to organize key vided version information, with ChatGPT-3 (15.15%)
concepts into a structured data extraction table. Importantly, appearing most frequently, followed by versions 3.5
data extraction was conducted in duplicate by two of the (12.12%) and a single study exploring a combination of 3.5
authors to ensure precision and reliability in capturing the and 4 (3.03%). To investigate the potential influence of ver
nuances of ChatGPT’s limitations. sion differences on the limitations identified in this review,
As we explored relationships between these codes, initial we conducted a sub-analysis of studies examining versions 3
themes began to emerge, offering a holistic view of recurring and 3.5. This analysis revealed no significant discrepancies
patterns that represented more abstract categories of from the overall findings on limitations and opportunities
ChatGPT’s limitations. The subsequent review and refine discussed below. Due to the limited presence of research on
ment of these themes aimed to ensure clarity and precision ChatGPT-4 (only one study identified), a similar sub-ana
in encapsulating the multifaceted challenges identified in the lysis for this version was not feasible. The limitations associ
included literature. This qualitative approach, rooted in the ated with the scarcity of research on ChatGPT-4 and the
matic analysis and bolstered by the dual extraction per generalizability of the review’s conclusions on ChatGPT lim
formed by two authors, provided a rigorous and structured itations will be addressed later in our discussion of the limi
framework for synthesizing the diverse findings across the tations of the review.
included studies, contributing to a comprehensive under
standing of ChatGPT’s limitations. 4.2. RQ1: What limitations of ChatGPT are documented
Data visualizations throughout this review were created in prior empirical literature?
using the Matplotlib library in Python (Hunter, 2007).
Our analysis of 33 studies identified limitations associated
with ChatGPT. The most prevalent limitation concerned
4. Findings accuracy and reliability, with these issues found in 47.06%
4.1. Overview of the included studies of the total instances identified across all studies. This high
lights ChatGPT’s potential to generate misleading or incor
Figure 1 presents the PRISMA flowchart of searching for rect information. Limitations in critical and problem-solving
and screening studies for eligibility in this review. A review thinking were present in 22.06% of instances, suggesting
of 33 studies identified a diverse range of fields where shortcomings in handling complex scenarios that require
ChatGPT limitations were investigated (see Figure 2). The independent analysis. Ethical considerations, including
most prevalent field of study was health (48.48%), highlight potential biases and discriminatory outputs due to training
ing the growing interest in understanding potential limita data, were observed in 13.24% of instances, raising concerns
tions of large language models in critical healthcare about ethical, legal, and privacy issues. Furthermore, limita
applications. This focus on health suggests a cautious tions in understanding context and suitability for in-depth
approach to ensure responsible use of ChatGPT in this exploration of specialized topics were found in 11.76% of
domain. Education (15.15%) emerged as the second most ChatGPT interactions, potentially leading to adverse effects
frequent field, indicating concern about potential on users’ learning and development. Finally, 10.29% of
4 N. CONG-LEM ET AL.
instances suggested limitations in handling diverse inputs inaccuracies, with ChatGPT providing unreliable and poorly
and outputs, potentially hindering the usability of ChatGPT informed responses, especially on contentious health issues.
for complex tasks. These findings underscore the need for Wagner and Ertl-Wagner’s findings (2023) further underscored
continued development and responsible use of large lan this concern, indicating up to 33% of responses by ChatGPT’s
guage models like ChatGPT (Table 2 and Figure 3). to radiology questions were inaccurate, highlighting a substan
tial deficiency in accuracy within the medical domain.
4.2.1. Accuracy and reliability concerns Au Yeung et al. (2023) tasked ChatGPT with predicting
ChatGPT faces a significant limitation concerning its accuracy medical diagnoses based on clinical histories. While the AI
and reliability, particularly evident in evaluations within the provided overall high-quality responses in terms of relevance
health science domain (Ali, 2023; Ariyaratne et al., 2023; Clark, (83%), it missed crucial diagnoses in 60% of its outputs.
2023). Ali’s (2023) examination revealed significant factual This deficiency poses a significant risk, particularly in
INTERNATIONAL JOURNAL OF HUMAN–COMPUTER INTERACTION 5
healthcare contexts, where ChatGPT is likely to generate questions highlighted inadequacies in addressing specialized
misleading outputs, potentially perpetuating harmful health topics, particularly critical thinking skills for complex issues
beliefs or reinforcing biases. like thumb arthritis.
Fergus et al.’s (2023) study in the pharmaceutical program In educational contexts, Giannos and Delardas (2023)
domain of chemistry found inconsistencies in ChatGPT’s reported ChatGPT’s subpar performance on critical thinking
responses to test questions. Each answer contained a different and mathematical questions, with more incorrect than correct
error attributed to technical randomness. Similarly, Hoch responses. Parsons and Curry’s (2024) evaluation echoed these
et al.’s (2023) medical quiz study revealed significant domain- concerns. They assessed ChatGPT’s capability in completing a
specific variations in ChatGPT’s performance. It achieved a graduate instructional design assignment for a 12th-grade
72% accuracy rate for allergology, a field studying hypersensi media literacy course. The chatbot primarily provided superfi
tive reactions of the immune system, questions, and the rest cial information and demonstrated a limited capacity to cus
of the responses were inaccurate (Hoch et al., 2023). tomize its responses or justify them with details. Rahman and
Beyond health science, accuracy concerns persist. Clark’s Watanobe’s (2023) scrutiny of ChatGPT’s mathematical capa
(2023) evaluation in a chemistry test resulted in a concern bilities found dissatisfaction in generating codes and correcting
ing accuracy rate: only 44% of responses were correct, falling errors, exposing weaknesses in basic mathematical tasks.
below the average score of participants. This inaccuracy Kortemeyer (2023) further found that ChatGPT narrowly
extended to medical assessments, with ChatGPT falling short passed the introductory course in Physics and exhibited "many
of the required score in the GP test. Reliability issues were of the preconceptions and errors of a beginning learner" (p. 1).
noted by Clark (2023), Duong and Solomon (2023), and The problem-solving capability of ChatGPT for coding
Seth et al. (2023), highlighting inconsistencies in ChatGPT’s practices was also questioned in the study by Shoufan
answers to identical questions and criticizing its suitability (2023), where the chatbot showed inconsistent responses
as a source of sample answers for examinations. Lai (2023) and struggled to complete the given codes, even the ones it
explored the AI chatbot’s potential use in addressing inqui generated itself. Collectively, these findings underscore
ries of library service users and found that it performed ChatGPT’s limited capacity for critical thinking and prob
poorly on advanced research questions, complex inquiries, lem-solving across diverse domains.
and queries involving locally specific information.
Seth et al. (2023) further exposed a troubling aspect of
ChatGPT’s behavior—the generation of fake references, labeled 4.2.3. Multifaceted impact on learning and development
as hallucination of references. Similar findings on hallucina This section explores the multifaceted impact ChatGPT’s
tions of the chatbot were reported in the study by McIntosh responses might have on users’ learning and development.
et al. (2024). Wagner and Ertl-Wagner (2023) discovered that Concerns include learners’ potential overreliance on the tool
63.8% of ChatGPT’s references in response to radiology ques leading to declines in critical thinking skills. Additionally,
tions were fabricated, accentuating broader reliability concerns. the risk of bias and incomplete information in ChatGPT’s
Hoch’s extensive study involving a medical board certification responses is another consideration. Finally, the potential
test revealed domain-specific performance challenges, with psychological effects on vulnerable individuals seeking inter
ChatGPT’s accuracy varying significantly by domain. action and decision-making support from the AI tool war
In summary, the themes of accuracy and reliability rant consideration.
emerge as prominent limitations in ChatGPT. These limita In the realm of education, Alafnan et al. (2023) highlighted
tions encompass technical inaccuracies, inconsistencies in the positive impact of ChatGPT in providing reliable input to
responses, and domain-specific performance challenges. answer test questions. However, they cautioned against overre
liance and irresponsible use of the AI tool, emphasizing the
potential consequences of "human unintelligence and unlearn
4.2.2. Limitations in critical thinking and problem-solving ing" if not used judiciously (p. 60). Clark (2023) echoed con
A second limitation concerns ChatGPT’s capability for cerns about overreliance on ChatGPT, suggesting that excessive
accomplishing critical thinking, problem-solving, and math dependence could result in passivity and a decline in critical
ematical tasks (Cascella et al., 2023; Clark, 2023; Giannos & thinking skills among learners. Notably, the challenge of detect
Delardas, 2023). Cascella et al.’s (2023) evaluation, involving ing logical fallacies in ChatGPT is a particular concern. The
the composition of a medical note, highlighted deficiencies model’s ability to provide seemingly logical explanations, even
in addressing causal relations among health conditions, indi when flawed, may mislead users who lack specific expertise in
cating inadequacy in complex reasoning. Clark (2023) the subject matter.
emphasized the model’s proficiency in addressing general Giannos and Delardas (2023) assessed ChatGPT’s capabil
questions over problem-solving or skill-specific queries, ity for education and test preparation, concluding that while
while Duong and Solomon’s (2023) study revealed the AI chatbot is adept at providing tutoring support for
ChatGPT’s preference for memory-based questions rather general problem solving and reading comprehension, its lim
than critical thinking tasks. Sanmarchi et al. (2023) assessed itations in scientific and mathematical knowledge and skills
ChatGPT’s ability to design studies and suggest plastic sur render it an unreliable independent tool for supporting stu
gery options, revealing limitations in constructing concep dents. They also underscored the potential for misuse, high
tual frameworks and narrative structures. Seth et al.’s (2023) lighting concerns about cheating and gaining unfair
examination of ChatGPT’s responses to plastic surgery advantages during standardized admission tests.
INTERNATIONAL JOURNAL OF HUMAN–COMPUTER INTERACTION 7
Ibrahim et al. (2023) raised the issue of potential bias in plagiarism in ChatGPT’s output, concluding that the Turnitin
ChatGPT’s responses, asserting that the model might be influ report failed to raise any alerts necessitating further investiga
enced by the dataset used for training, aligning more closely tion into academic integrity (p. 1674). This inability to detect
with the political and philosophical values of Western and generated content raises concerns about the potential misuse of
more developed countries. Sallam et al. (2023) echoed these ChatGPT and its impact on academic honesty.
concerns, particularly in medical education, where biased, out Furthermore, educators face challenges in distinguishing
dated, and incomplete content in ChatGPT’s responses could between students’ original work and content generated by
pose risks to learners. They noted potential adverse consequen ChatGPT, making assessment of individual abilities more
ces, including discouraging critical thinking and communica complex. Alafnan et al. (2023) argued that the high accuracy
tion skills among medical students. and reliability of ChatGPT’s responses may impede instruc
Additionally, Stojanov (2023) warned of the psychological tors’ ability to differentiate between independently working
impact on vulnerable individuals, such as those grieving or students and those heavily reliant on automation. This, in
extremely shy, who may turn to ChatGPT for solace and turn, can compromise the evaluation of learning outcomes,
interaction. Stojanov also highlighted the risk of individuals causing a significant challenge in assessing students’ per
relying on the AI tool for crucial life decisions, potentially formance. The implications of ChatGPT on academic integ
weakening their personal agency and responsibility. These rity, underscored by these studies, highlighting the need for
varied concerns collectively emphasize the need for a cau careful consideration and regulation in its educational use.
tious and informed approach to the integration of ChatGPT Other legal and ethical issues, including privacy and
in educational settings. copyright infringements, were also raised in the literature.
The answers generated by ChatGPT raise privacy concerns
4.2.4. Technical constraints related to input and output that may lead to further legal ramifications (Au Yeung et al.,
The effectiveness of ChatGPT is further contingent upon 2023; Ibrahim et al., 2023; Sallam et al., 2023; Sanmarchi
technical constraints of its input and output. This limitation et al., 2023). Notably, the potential biases in ChatGPT’s
of ChatGPT-3 and ChatGPT-3.5 poses challenges, particu responses, possibly leaning towards specific political parties
larly in disciplines like mathematics and chemistry, where or perspectives, raise red flags regarding the validity of its
communication often involves signs and symbols. Fergus content (Au Yeung et al., 2023). Sallam et al. (2023) specific
et al. (2023) conducted examinations in the field of chemis ally assessed responses to health and public education
try, revealing instances where ChatGPT struggled, particu prompts, revealing concerns about plagiarism, copyright
larly in tasks requiring the drawing of structures between issues, academic dishonesty, and the absence of personal
reactants and products. and emotional interactions, which are essential for commu
Furthermore, the efficacy of ChatGPT is influenced by nication skills in healthcare education.
the type of questions posed to it. Notably, the chatbot exhib
ited a significantly higher performance when responding to 4.3. RQ2: In light of the limitations identified, what
single-choice questions compared to multiple-choice ques opportunities exist for enhancing the utilization of
tions (Hoch et al., 2023). In an extensive study encompass ChatGPT?
ing 2,576 questions, Hoch et al. (2023) observed a 63%
accuracy rate for single-choice questions, in contrast to a Table 3 presents a list of opportunities for ChatGPT identi
34% accuracy rate for multiple-choice questions. fied in this review, offering actionable insights for capitaliz
The phrasing of prompts for ChatGPT responses is also a ing on its strengths and capabilities.
pivotal factor affecting the chatbot’s performance. Sallam et al.
(2023) acknowledged that the formulation of prompts, coupled
4.3.1. Educational support and skill development
with the word limit imposed on ChatGPT’s output, could
ChatGPT’s impact on education is multifaceted. It provides
influence the amount of information generated, subsequently
educational content, aids in learning processes, and contrib
impacting the clarity and effectiveness of the responses.
Similarly, Stojanov (2023) reported that ChatGPT’s inherent utes to essential skills development. Scholars have discussed
word limit in its output may result in responses containing various ways ChatGPT can support this domain, including
incomplete information, posing challenges to comprehension. creating course materials, designing lesson plans and assess
ments, providing feedback, explaining complex knowledge,
and personalizing the learning experience (Clark, 2023;
4.2.5. Ethical, legal and privacy concerns Rahman & Watanobe, 2023). Day (2023) suggests using
Previous studies have addressed academic integrity, legal, priv
acy, and ethical concerns associated with the use of ChatGPT Table 3. Opportunities of ChatGPT identified in the included studies (n ¼ 44
(Au Yeung et al., 2023; Alafnan et al., 2023; Ibrahim et al., instances of opportunities).
2023; Sallam et al., 2023; Sanmarchi et al., 2023). Academic The opportunities of ChatGPT Frequency Percentage (%)
integrity emerges as a prominent concern, particularly in light Human like interaction and assistance 6 13.64
Education support and skill development 16 36.36
of the challenges posed by most plagiarism detection software Task automation and workflow enhancement 11 25
in identifying content generated by ChatGPT. Fergus et al. Content creation and ideation 4 9.09
(2023) conducted an examination using Turnitin to assess Information retrieval and application 7 15.91
8 N. CONG-LEM ET AL.
ChatGPT to develop writing course materials. Drawing on 4.3.5. Content creation and ideation
Vygotsky’s sociocultural theory, Stojanov (2023) discusses Finally, ChatGPT facilitates creative content generation, text
how ChatGPT could serve as a knowledgeable learning peer, transformation, and ideation processes, making it a versatile
aiding knowledge exploration. Similarly, Rahman et al. tool for content creators and innovators. Ariyaratne et al.
(2023) discuss benefits for learners, educators, and research (2023) discussed using ChatGPT for research, suggesting
ers. Learners can employ ChatGPT as a learning assistant that "the format of articles generated by ChatGPT can be
for exploring complex concepts, problem-solving, and used as a draft template to write an expanded version of the
receiving personalized guidance. Educators can leverage article" (p. 4). Similarly, ChatGPT can be used to enhance
ChatGPT for lesson planning, generating customized resour research processes by assisting researchers in generating
ces and activities, answering student questions, and assisting hypotheses, exploring literature, and translating research
with assessment. Researchers can improve their work by findings into a more understandable language (Cascella
using ChatGPT to check and improve writing, request litera et al., 2023). In education, ChatGPT can be used to create
ture summaries, or suggest research ideas. course materials, such as for writing courses (Day, 2023).
Regarding ideation capability, Clark (2023) demonstrated
that ChatGPT could be used to support problem conceptual
4.3.2. ChatGPT as a workflow enhancer ization in chemistry education. A similar conclusion is
Beyond education, ChatGPT’s ability to automate tasks and reached in engineering education as Nikolic et al. (2023)
enhance professional workflows optimizes operational effi indicated that ChatGPT can support students by aiding in
ciency and resource utilization. In the construction industry, the generation of project ideas, providing information,
Prieto et al. (2023) tested ChatGPT’s application in creating assisting with project structure, delivering summaries, and
a coherent and logical construction project schedule. offering feedback on ethical considerations and workplace
Participants found it satisfactory and indicated its potential health and safety risks associated with their projects. The
for automating preliminary and time-consuming tasks. text transforming function is another advantageous feature
Similarly, Sanmarchi et al. (2023) suggests ChatGPT as a of this generative AI tool. Prieto et al. (2023) indicated the
valuable tool for designing research studies and following use of ChatGPT is useful for transforming research writing
international guidelines, for both experienced and less expe into more readily understandable language (Figure 4).
rienced researchers.
5. Discussion
4.3.3. Information retrieval powerhouse
ChatGPT’s prowess in retrieving and applying information This systematic review identified five key limitations associ
across various domains empowers users with informed deci ated with ChatGPT’s application across diverse fields.
sion-making and problem-solving. Alafnan et al. (2023) dis Accuracy and reliability emerged as a primary concern, par
covered that ChatGPT has the potential to function as a ticularly in critical domains like healthcare (Fergus et al.,
valuable platform for students seeking information on 2023). Additionally, limitations were found in ChatGPT’s
diverse topics. They asserted that ChatGPT’s capabilities ability to perform complex cognitive tasks such as critical
could potentially replace traditional search engines by offer thinking and problem-solving (Clark, 2023). Studies identi
ing students accurate and reliable information. Duong and fied potential negative effects on learners’ development due
Solomon (2023) compared ChatGPT’s ability to respond to to overreliance on the tool, potentially hindering the devel
genetics questions against human performance, revealing opment of critical thinking skills (Alafnan et al., 2023;
that the chatbot approached human-level proficiency. Sallam et al., 2023). Finally, ethical considerations surround
Stojanov (2023) discusses how ChatGPT played a crucial ing academic integrity, privacy, and copyright infringement
role in providing valuable content, aiding in the ongoing emerged as limitations requiring careful attention when
pursuit of learning and exploration of new knowledge. deploying ChatGPT in educational and professional settings
(Ibrahim et al., 2023; Puthenpura et al., 2023).
The analysis of included studies also revealed five key
4.3.4. Natural language interaction and assistance themes highlighting potential opportunities presented by
ChatGPT’s ability to engage users in natural conversations ChatGPT. One area of potential lies in information retrieval,
and provide human-like assistance positions it as a valuable where research suggests it can be a valuable tool for finding
virtual companion. Lahat et al. (2023) explored using information across various subjects. Another promising area
ChatGPT to answer 110 real-life medical questions from is natural interaction and support, with ChatGPT’s ability to
patients, finding it relatively useful and satisfactory, albeit hold natural conversations making it a potential candidate
with moderate effectiveness. Other scholars interacted with as a virtual companion or assistant in fields like medicine
the chatbot for tasks such as creating a construction project (Lahat et al., 2023) and creative endeavors (Seth et al.,
(Prieto et al., 2023) or discussing a plastic surgery topic 2023). Studies also indicate that ChatGPT may automate
(Seth et al., 2023). Prieto et al. (2023) highlighted that the tasks and improve workflow efficiency (Prieto et al., 2023;
conversation-based chatbot is advantageous compared to Sanmarchi et al., 2023). Within the educational domain,
other single-prompted AI tools as it allows users to modify research explores its potential for personalized learning
project aspects as needed. experiences, creating course materials, and supporting
INTERNATIONAL JOURNAL OF HUMAN–COMPUTER INTERACTION 9
generalizability of the identified limitations to more review suggests promise in the ability of ChatGPT to gener
advanced iterations of ChatGPT, such as version 4, remains ate creative text formats and support ideation processes,
unclear. While the studies included explored limitations of highlighting its potential as a tool for content creation,
earlier versions (3 and 3.5), it is uncertain if these limita research, and brainstorming. By acknowledging both limita
tions persist or change in the latest iteration. Additionally, a tions and opportunities, this review offers valuable insights
significant portion (69.7%) of the reviewed studies did not for researchers, developers, and users to consider when
report the specific ChatGPT version used. This lack of trans exploring the potential and responsible application of
parency hinders our ability to definitively assess how limita ChatGPT.
tions might vary across different versions.
However, while ChatGPT-4 reportedly leverages larger
datasets, potentially leading to enhanced performance and Disclosure statement
incorporating plugin functionalities, previous scholars indi No potential conflict of interest was reported by the author(s).
cated that many limitations identified in ChatGPT-3.5 are
still applicable to it. While advancements have been made,
OpenAI (2023) acknowledges ChatGPT-4 still exhibits limi ORCID
tations from earlier versions, including hallucinations, unre Ngo Cong-Lem http://orcid.org/0000-0002-5257-8264
liability, and a limited context window, and lacks the ability Ali Soyoof http://orcid.org/0000-0002-8037-5632
to learn from experience. Supporting this, Suchman et al. Diki Tsering http://orcid.org/0000-0003-0157-4009
(2023) found no demonstrable advantage for ChatGPT-4 in
a medical test, even showing a performance deficit com References
pared to the free version (ChatGPT-3.5) on gastroenterology
Alafnan, M. A., Dishari, S., Jovic, M., & Lomidze, K. (2023). ChatGPT
self-assessment tests.
as an educational tool: Opportunities, challenges, and recommenda
Regarding future research directions, it is crucial for tions for communication, business writing, and composition courses.
researchers to explicitly report the specific version of Journal of Artificial Intelligence and Technology, 3(2), 60–68. https://
ChatGPT used in their studies to enhance the generalizabil doi.org/10.37965/jait.2023.0184
ity and reliability of research findings in the future. This Ali, M. J. (2023). ChatGPT and lacrimal drainage disorders:
facilitates comparisons across studies and allows for a more Performance and scope of improvement. Ophthalmic Plastic and
Reconstructive Surgery, 39(5), 515–514. https://doi.org/10.1097/IOP.
nuanced understanding of how limitations evolve across
0000000000002418
ChatGPT versions. Next, developing best practices for edu Amin, M. M., Cambria, E., & Schuller, B. W. (2023). Will affective
cators, assessment methods that leverage ChatGPT’s computing emerge from foundation models and general artificial
strengths, and research on its impact on learning outcomes intelligence? A first evaluation of ChatGPT. IEEE Intelligent Systems,
are essential next steps. Finally, integration with specific 38(2), 15–23. https://doi.org/10.1109/MIS.2023.3254179
domains presents a promising avenue for future research. Ariyaratne, S., Iyengar, K. P., Nischal, N., Chitti Babu, N., & Botchu,
R. (2023). A comparison of ChatGPT-generated articles with
Investigating the potential of integrating ChatGPT with spe human-written articles. Skeletal Radiology, 52(9), 1755–1758. https://
cialized tools in various contexts, along with domain-specific doi.org/10.1007/s00256-023-04340-5
training methods and the associated ethical considerations, Au Yeung, J., Kraljevic, Z., Luintel, A., Balston, A., Idowu, E., Dobson,
is recommended. R. J., & Teo, J. T. (2023). AI chatbots not yet ready for clinical use.
Frontiers in Digital Health, 5, 1161098. https://doi.org/10.3389/fdgth.
2023.1161098
6. Conclusion € & Karaarslan, E. (2023). Is ChatGPT leading generative AI?
Aydin, O.,
What is beyond expectations? Academic Platform Journal of
This systematic review examined limitations and opportuni Engineering and Smart Systems, 11(3), 118–134. https://doi.org/10.
ties associated with ChatGPT’s application across various 21541/apjess.1293702
fields. By analyzing 33 carefully screened empirical studies, Bitzenbauer, P. (2023). ChatGPT in physics education: A pilot study
on easy-to-implement activities. Contemporary Educational
it offers a comprehensive picture of ChatGPT’s capabilities. Technology, 15(3), ep430. https://doi.org/10.30935/cedtech/13176
The review identified five key limitations: accuracy concerns Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology.
in critical domains like healthcare, limitations in complex Qualitative Research in Psychology, 3(2), 77–101. https://doi.org/10.
cognitive tasks, potential negative impacts on learners’ devel 1191/1478088706qp063oa
opment due to overreliance, and ethical considerations sur Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E.,
Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., Nori, H., Palangi,
rounding privacy, copyright, and academic integrity.
H., Ribeiro, M. T., & Zhang, Y. (2023). Sparks of Artificial General
However, the review also highlights five opportunities. Intelligence: Early experiments with GPT-4. https://doi.org/10.48550/
ChatGPT has the potential to be a valuable tool for users ARXIV.2303.12712
seeking information across various domains. Its ability to Cadamuro, J., Cabitza, F., Debeljak, Z., De Bruyne, S., Frans, G., Perez,
engage in natural conversations positions it as a potential S. M., Ozdemir, H., Tolios, A., Carobene, A., & Padoan, A. (2023).
virtual companion or assistant. The review also found prom Potentials and pitfalls of ChatGPT and natural-language artificial
intelligence models for the understanding of laboratory medicine
ise in its ability to automate tasks and enhance workflows,
test results. An assessment by the European Federation of Clinical
leading to improved efficiency. Within education, ChatGPT Chemistry and Laboratory Medicine (EFLM) Working Group.
presents opportunities for personalized learning experiences, Clinical Chemistry and Laboratory Medicine, 61(7), 1158–1166.
course material creation, and student support. Finally, the https://doi.org/10.1515/cclm-2023-0355
INTERNATIONAL JOURNAL OF HUMAN–COMPUTER INTERACTION 11
Cascella, M., Montomoli, J., Bellini, V., & Bignami, E. (2023). institutional benchmarking and analysis of this generative artificial
Evaluating the feasibility of ChatGPT in healthcare: An analysis of intelligence tool to investigate assessment integrity. European
multiple clinical and research scenarios. Journal of Medical Systems, Journal of Engineering Education, 48(4), 559–614. https://doi.org/10.
47(1), 33. https://doi.org/10.1007/s10916-023-01925-4 1080/03043797.2023.2213169
Clark, T. M. (2023). Investigating the use of an artificial intelligence OpenAI. (2023). GPT-4 technical report. https://cdn.openai.com/papers/
chatbot with general chemistry exam questions. Journal of Chemical gpt-4.pdf
Education, 100(5), 1905–1916. https://doi.org/10.1021/acs.jchemed. Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann,
3c00027 T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J. M., Akl, E. A.,
Day, T. (2023). A preliminary investigation of fake peer-reviewed cita Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M.,
tions and references generated by ChatGPT. The Professional Hr�objartsson, A., Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson,
Geographer, 75(6), 1024–1027. https://doi.org/10.1080/00330124. E., McDonald, S., … Moher, D. (2021). The PRISMA 2020 state
2023.2190373 ment: An updated guideline for reporting systematic reviews.
Duong, D., & Solomon, B. D. (2023). Analysis of large-language model International Journal of Surgery, 88(2021), 105906. https://doi.org/10.
versus human performance for genetics questions. European Journal 1016/j.ijsu.2021.105906
of Human Genetics, 32(4), 466–468. https://doi.org/10.1038/s41431- Parsons, B., & Curry, J. H. (2024). Can ChatGPT pass graduate-level
023-01396-8 instructional design assignments? Potential implications of artificial
Fergus, S., Botha, M., & Ostovar, M. (2023). Evaluating academic intelligence in education and a call to action. TechTrends, 68(1), 67–
answers generated using ChatGPT. Journal of Chemical Education, 78. https://doi.org/10.1007/s11528-023-00912-3
100(4), 1672–1675. https://doi.org/10.1021/acs.jchemed.3c00087 Poole, F. (2022). Using Chatgpt to design language material and exer
Giannos, P., & Delardas, O. (2023). Performance of ChatGPT on UK cises. https://fltmag.com/chatgpt-design-material-exercises/
standardized admission tests: Insights from the BMAT, TMUA, Prieto, S. A., Mengiste, E. T., & Garc�ıa de Soto, B. (2023). Investigating
LNAT, and TSA examinations. JMIR Medical Education, 9, e47737. the use of ChatGPT for the scheduling of construction projects.
https://doi.org/10.2196/47737 Buildings, 13(4), 857. https://doi.org/10.3390/buildings13040857
Gregorcic, B., & Pendrill, A. M. (2023). ChatGPT and the frustrated Puthenpura, V., Nadkarni, S., DiLuna, M., Hieftje, K., & Marks, A.
Socrates. Physics Education, 58(3), 035021. https://doi.org/10.1088/ (2023). Personality changes and staring spells in a 12-year-old child:
1361-6552/acc299 A case report incorporating ChatGPT, a natural language processing
Hassani, H., & Silva, E. S. (2023). The role of ChatGPT in data science: tool driven by artificial intelligence (AI). Cureus, 15(3), e36408.
How AI-assisted conversational interfaces are revolutionizing the https://doi.org/10.7759/cureus.36408
field. Big Data and Cognitive Computing, 7(2), 62. https://doi.org/10. Rahman, M. M., & Watanobe, Y. (2023). ChatGPT for education and
3390/bdcc7020062 research: Opportunities, threats, and strategies. Applied Sciences,
Hoch, C. C., Wollenberg, B., L€ uers, J. C., Knoedler, S., Knoedler, L., 13(9), 5783. https://doi.org/10.3390/app13095783
Frank, K., Cotofana, S., & Alfertshofer, M. (2023). ChatGPT’s quiz Rahman, M., Terano, H. J. R., Rahman, N., Salamzadeh, A., &
skills in different otolaryngology subspecialties: An analysis of 2576 Rahaman, S. (2023). Chatgpt and academic research: A review and
single-choice and multiple-choice board certification preparation recommendations based on practical examples. Journal of Education,
questions. European Archives of Oto-Rhino-Laryngology, 280(9), Management and Development Studies, 3(1), 1–12. https://doi.org/10.
4271–4278. https://doi.org/10.1007/s00405-023-08051-4 52631/jemds.v3i1.175
Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Ray, P. P. (2023). ChatGPT: A comprehensive review on background,
Computing in Science & Engineering, 9(3), 90–95. https://doi.org/10. applications, key challenges, bias, ethics, limitations and future
1109/MCSE.2007.55 scope. Internet of Things and Cyber-Physical Systems, 3, 121–154.
Ibrahim, H., Asim, R., Zaffar, F., Rahwan, T., & Zaki, Y. (2023). https://doi.org/10.1016/j.iotcps.2023.04.003
Rethinking homework in the age of artificial intelligence. IEEE Rozado, D. (2023). The political biases of ChatGPT. Social Sciences,
Intelligent Systems, 38(2), 24–27. https://doi.org/10.1109/MIS.2023. 12(3), 148. https://doi.org/10.3390/socsci12030148
3255599 Rudolph, J., Tan, S., & Tan, S. (2023). ChatGPT: Bullshit spewer or the
Kortemeyer, G. (2023). Could an artificial-intelligence agent pass an end of traditional assessments in higher education? Journal of
introductory physics course? Physical Review Physics Education Applied Learning & Teaching, 6(1), 342–363. https://doi.org/10.
Research, 19(1), 1–18. https://doi.org/10.1103/PhysRevPhysEducRes. 37074/jalt.2023.6.1.9
19.010132 Sallam, M., Salim, N., Barakat, M., & Al-Tammemi, A. (2023).
Kumar, H. A. (2023). Analysis of chatgpt tool to assess the potential of ChatGPT applications in medical, dental, pharmacy, and public
its utility for academic writing in biomedical domain. Biology, health education: A descriptive study highlighting the advantages
Engineering, Medicine and Science Reports, 9(1), 24–30. https://doi. and limitations. Narra J, 3(1), e103. https://doi.org/10.52225/narra.
org/10.5530/bems.9.1.5 v3i1.103
Lahat, A., Shachar, E., Avidan, B., Glicksberg, B., & Klang, E. (2023). Sanmarchi, F., Bucci, A., Nuzzolese, A. G., Carullo, G., Toscano, F.,
Evaluating the utility of a large language model in answering com Nante, N., & Golinelli, D. (2023). A step-by-step researcher’s guide
mon patients’ gastrointestinal health-related questions: Are we there to the use of an AI-based transformer in epidemiology: An explora
yet? Diagnostics, 13(11), 1950. https://doi.org/10.3390/ tory analysis of ChatGPT using the STROBE checklist for observa
diagnostics13111950 tional studies. Journal of Public Health, 1–36. https://doi.org/10.
Lai, K. (2023). How well does ChatGPT handle reference inquiries? An 1007/s10389-023-01936-y
analysis based on question types and question complexities. College Segal, S., & Khanna, A. K. (2023). Anesthetic management of a patient
& Research Libraries, 84(6), 974–995. https://doi.org/10.5860/crl.84.6. with juvenile hyaline fibromatosis: A case report written with the
974 assistance of the large language model chatgpt. Cureus, 15(3),
Lo, C. K. (2023). What is the impact of ChatGPT on education? A e35946. https://doi.org/10.7759/cureus.35946
rapid review of the literature. Education Sciences, 13(4), 410. https:// Seth, I., Sinkjær Kenney, P., Bulloch, G., Hunter-Smith, D. J., Bo
doi.org/10.3390/educsci13040410 Thomsen, J., & Rozen, W. M. (2023). Artificial or augmented
McIntosh, T. R., Liu, T., Susnjak, T., Watters, P., Ng, A., & authorship? A conversation with a Chatbot on base of thumb arth
Halgamuge, M. N. (2024). A culturally sensitive test to evaluate ritis. Plastic and Reconstructive Surgery. Global Open, 11(5), e4999.
nuanced GPT hallucination. IEEE Transactions on Artificial https://doi.org/10.1097/GOX.0000000000004999
Intelligence, 1–13. https://doi.org/10.1109/TAI.2023.3332837 Shoufan, A. (2023). Can students without prior knowledge use
Nikolic, S., Daniel, S., Haque, R., Belkina, M., Hassan, G. M., Grundy, ChatGPT to answer test questions? An empirical study. ACM
S., Lyden, S., Neal, P., & Sandison, C. (2023). ChatGPT versus Transactions on Computing Education, 23(4), 1–29. https://doi.org/
engineering education assessment: A multidisciplinary and multi- 10.1145/3628162
12 N. CONG-LEM ET AL.
Singh, H., & Singh, A. (2023). Chatgpt: Systematic review, applications, games, second language vocabulary learning, and Informal Digital
and agenda for multidisciplinary research. Journal of Chinese Language Learning of English (IDLE).
Economic and Business Studies, 21(2), 193–212. https://doi.org/10.
1080/14765284.2023.2210482 Diki Tsering is a research officer at Monash Sustainable Development
Sok, S., & Heng, K. (2024). Opportunities, challenges, and strategies for Institute. Her main interest lies in applying systematic review principles
to deliver high-quality evidence reviews that translate research know
using ChatGPT in higher education: A literature review. Journal of
ledge into practice and make positive contributions to society.
Digital Educational Technology, 4(1), ep2401. https://doi.org/10.
30935/jdet/14027
Stojanov, A. (2023). Learning with ChatGPT 3.5 as a more knowledge
able other: An autoethnographic study. International Journal of
Educational Technology in Higher Education, 20(1), 35. https://doi.
Appendix A
org/10.1186/s41239-023-00404-7
Su, J., & Yang, W. (2023). Unlocking the power of chatgpt: A frame Databases and Search Strings
work for applying generative ai in education. ECNU Review of
Database: Scopus
Education, 6(3), 355–366. https://doi.org/10.1177/ Date of Search: 26 June 2023
20965311231168423 Yield: 169
Suchman, K., Garg, S., & Trindade, A. J. (2023). Chat generative pre Search string: TITLE-ABS-KEY (chatgpt AND (limitation� OR
trained transformer fails the multiple-choice American College of challenge� OR drawback� OR problem� OR challenge� OR issue� OR
Gastroenterology Self-Assessment Test. The American Journal of concern� OR risk� OR disadvantage� OR flaw� OR weakness� OR
Gastroenterology, 118(12), 2280–2282. https://doi.org/10.14309/ajg. shortcoming� OR pitfall� OR downside� OR bias� OR error� OR
0000000000002320 ethic�)) AND (LIMIT-TO (DOCTYPE, "ar"))
Tan, T. F., Thirunavukarasu, A. J., Campbell, J. P., Keane, P. A., Database: Web of Science
Pasquale, L. R., Abramoff, M. D., Kalpathy-Cramer, J., Lum, F., Date of Search: 26 June 2023
Kim, J. E., Baxter, S. L., & Ting, D. S. W. (2023). Generative artifi Yield: 150
cial intelligence through chatgpt and other large language models in Search string: TS¼(ChatGPT AND (limitation� OR challenge� OR
ophthalmology: Clinical applications and challenges. Ophthalmology drawback� OR problem� Or challenge� OR issue� OR concern� OR
Science, 3(4), 100394. https://doi.org/10.1016/j.xops.2023.100394 risk� OR disadvantage� OR flaw� OR weakness� OR shortcoming� OR
Thirunavukarasu, A. J., Hassan, R., Mahmood, S., Sanghera, R., pitfall� OR downside� OR bias� OR error� OR ethic�))
Barzangi, K., El Mukashfi, M., & Shah, S. (2023). Trialling a large Database: ERIC
language model (ChatGPT) in General practice with the applied Search Date: 09 March 2024
knowledge test: Observational study demonstrating opportunities Yield: 108
and limitations in primary care. JMIR Medical Education, 9, e46599. Search string: ChatGPT AND (limitation� OR challenge� OR
https://doi.org/10.2196/46599 drawback� OR problem� OR issue� OR concern� OR risk� OR dis
Wagner, M. W., & Ertl-Wagner, B. B. (2023). Accuracy of information advantage� OR flaw� OR weakness� OR shortcoming� OR pitfall� OR
and references using ChatGPT-3 for retrieval of clinical radiological downside� OR bias� OR error� OR ethic�)
information. Canadian Association of Radiologists Journal, 75(1), 69– Database: IEEE Xplore
73. https://doi.org/10.1177/08465371231171125 Search Date: 09 March 2024
Yield: 58 (Filters applied: ‘Journals’ and ‘Early Access Articles’)
Search string: ("ChatGPT" AND (limitation� OR challenge� OR
About the authors drawback� OR problem� OR issue� OR concern� OR bias� OR risk�
OR disadvantage� OR flaw OR flaws OR weakness OR weaknesses OR
Ngo Cong-Lem is a Research Fellow at BehaviourWorks Australia,
shortcoming OR shortcomings OR pitfall OR pitfalls OR downside OR
Monash University and a lecturer at the Faculty of Foreign Languages,
downsides OR error OR errors OR ethic OR ethics))
Dalat University. His research interests involve educational technolo
gies, evidence synthesis and translation, second language studies, and
continuing professional learning.
Appendix B
Ali Soyoof is a research fellow at University of Macau. His field of
interest is Computer Assisted Language Learning (CALL), digital A Summary of the Included Studies
Continued.
Study, Country, ChatGPT version Aims Method Limitations Opportunities
Amin et al. (2023) To evaluate ChatGPT’s ChatGPT was asked to predict � lower performance � decent performance
NA capacity to perform text personalities, sentiment compared to a specialized compared to other text
Germany classification analysis and suicide language model classification models (i.e.,
tendency based on (RoBERTa-base) Word2Vec and bag-of-
prompts crafted from three words baseline)
relevant datasets on these � robustness against noisy
topics data
� no training by users
needed for ChatGPT
Ariyaratne et al. (2023) To compare the article ChatGPT was asked to write � 4 out of 5 articles written N/A
ChatGPT-3 writing of ChatGPT and about a radiology topic, by ChatGPT being
UK humans which was then assessed factually inaccurate
on a 5 point scale from � providing risky medical
being bad and inaccurate suggestions
to being excellent and � fictitious references
accurate by radiologists
Au Yeung et al. (2023) To test ChatGPT’s capacity to ChatGPT was given clinical � One or more critical N/A
NA predict medical diagnoses vignettes and asked to diagnoses were missing
UK predict diagnoses. in 60% of responses of
ChatGPT
� general prediction of
diseases only
� potential bias in clinical
diagnosis against Black
people
� “takes the truth of
prompts at face-value”,
which influences the
accuracy of its response
Cadamuro et al. (2023) To test ChatGPT’s capability ChatGPT was asked to � superficial interpretations, � able to recognise all
NA to interpret laboratory test interpret 10 simulated most of which lack laboratory tests
Austria, Italy, Croatia results laboratory reports, drafted coherence
as optimized prompts. Its � more suitable for test-by-
output was evaluated by test interpretation
experts in terms of
relevance, accuracy,
helpfulness and safety
Cascella et al. (2023) To test ChatGPT’s use in ChatGPT was provided with � lack capability in � aiding in the research
NA healthcare context some input and then interpreting or explaining process by generating
Italy asked to: causal relations among hypothesis, exploring
- compose a medical note components/ conditions literature, extracting
for a patient admitted to � no performance of important information
an emergency statistical analysis � communicating research
- write a research � often not aware of findings in a clear and
conclusion based on some limitations unless understandable manner
information bout the requested
research method and
finding
- write an abstract based
on csv (comma-separated
values) formatted data
Clark (2023) To examine the capability of ChatGPT was used to answer � inadequate in problem � potential use to create
NA ChatGPT in answering a closed (multiple choice) solving or answering assignments for students
USA chemistry test and opened ended questions requiring to analyze and improve
questions for a chemistry specific skills its response
test � only able to achieve 44%
score in the chemistry
test, i.e., well below the
class’s average score
at 69%
� providing seemingly
logical but flawed
explanations
� not well-equipped for
generating sample
responses for exam
purpose
Day (2023) To investigate the accuracy of ChatGPT was asked to answer � References generated � a supporting tool for
NA references generated by questions on various topics through a predictive teaching writing
Canada ChatGPT commonly of interest to process rather than facts
geographers � subject matter knowledge
is required to detect
incorrect information, a
skill students need to
develop
(continued)
14 N. CONG-LEM ET AL.
Continued.
Study, Country, ChatGPT version Aims Method Limitations Opportunities
Duong and Solomon (2023) To assess ChatGPT The responses from ChatGPT � not particularly adept at � rapid and accurate
NA performance in answering to 85 multiple-choice answering critical thinking responses to genetic
USA questions related to questions on human and calculation-based questions
biomedical field genetics were contrasted questions but more � potential use to support
with human responses. suitable for healthcare professionals
memorisation-based in treatment and
questions diagnosis and patients in
� inconsistency in answers having accessible medical
and explanations where information
one might select the
wrong answer but then
provide a correct
explanation
� not suitable for clinical or
high-stake uses
Fergus et al. (2023) To evaluate ChatGPT’s ChatGPT was asked to answer � failing to pass year end � able to provide well-
NA response to year end exam exam questions from two exams with the total articulated answers to
UK assessments modules of a grade on module 1 and 2 text-based questions
pharmaceutical program being 34.1% and 18.3%, � catalyst for discussion on
respectively academic integrity and
� unable to respond to assessment design
non-text questions
� ChatGPT-generated texts
not being detected by
Turnitin
Giannos and Delardas (2023) To test ChatGPT’s ChatGPT was asked to � limited scientific and � well-written responses
ChatGPT-3.5 performance on several respond to 509 multiple- mathematical skills � a catalyst for redesigning
UK uni admission tests choice questions on � poor performance on educational assessment
various topics. Its critical thinking and
responses were evaluated reasoning skills
against various skills such � providing more incorrect
as critical thinking, logical than correct responses
thinking, math, problem
solving, and reading
comprehension.
Gregorcic and Pendrill (2023) To test effectiveness of ChatGPT was asked to answer � inaccurate responses with � potential use for
NA ChatGPT in answering a physics question “A contradictions generating lesson
Sweden basic physics questions teddy bear is thrown into � not yet adequate to be a materials
the air. What is its cheating tool for physics
acceleration in the highest student or as a physics
point?” tutor
Hoch et al. (2023) To test ChatGPT’s ChatGPT was asked to answer � limited performance � a supplementary tool for
NA performance on a board 2576 single-choice and depending on the test/ otolaryngology board
Germany certification exam multiple-choice board question format and certification preparation
certification preparation specific domain of
questions knowledge; more
accurate in allergology
(72% correct responses)
compared to
otolaryngology (i.e., 71%
answers being incorrect)
� better performance in
answering open ended
questions rather than
multiple choice questions
Ibrahim et al. (2023) To evaluate potential risk of ChatGPT was asked to answer � failing to reach the � excellent grade on
NA plagiarism of ChatGPT questions from two passing grade in the questions from the
United Arab Emirates introductory and two advanced course introductory courses
advanced tertiary level questions
courses
Kortemeyer (2023) To assess whether ChatGPT ChatGPT’s ability to handle � demonstrating beginner- The necessity to develop
NA could successfully calculus-based physics like errors epistemologies when
Switzerland complete introductory content was assessed by � Presenting facts and ChatGPT assumes the role
physics courses administering fiction with similar of subject matter experts.
representative assessments confidence
from a real course. The � Probabilistic nature
model’s responses were leading to inconsistent
then graded using the results
same criteria applied to � core issues remaining for
student work. more newer versions
Lahat et al. (2023) To test ChatGPT’s ChatGPT was asked to answer � moderately accurate and � a useful source of
NA performance in answer 110 real-life questions reliable only reference information
United Arab Emirates patients’ real-life questions from the patients � quality of responses
depending on question
input
(continued)
INTERNATIONAL JOURNAL OF HUMAN–COMPUTER INTERACTION 15
Continued.
Study, Country, ChatGPT version Aims Method Limitations Opportunities
Lai (2023) To evaluate ChatGPT’s ChatGPT was assigned to � Struggling with factual Leveraging ChatGPT as a tool
ChatGPT 3.5 proficiency in managing answer questions about accuracy in its response for crafting neutral-tone
Canada various question types and references received by � Difficulty handling letters and professional
difficulty levels related to McGill University’s library advanced questions responses.
references in library services. Its responses were � Failing to answer
services. subsequently assessed questions requiring
using rubrics that nuance, additional
considered completeness, resources and referrals
accuracy, and the provision
of additional references if
the user’s inquiry was not
fully addressed.
McIntosh et al. (2024) to assess how different GPT Different versions of ChatGPT � Hallucinations NA
ChatGPT 3.5 and 4.0 models respond to a underwent a Culturally � Ethical concerns
Australia & New Zealand Culturally Sensitive Test Sensitive Test comprising � Inconsistent performance
aimed at detecting 70 questions spanning
hallucinations across real-world contexts. Model
diverse cultural and responses were scored as
linguistic contexts 0 for hallucinated and 1
for non-hallucinated
answers.
Nikolic et al. (2023) To examine ChatGPT’s ChatGPT was asked to � ChatGPT’s response � passable responses from
NA response to assessment respond to engineering having word limit, often ChatGPT with minimised
Australia prompts assessment prompts from being generic, lacking changes to authentic
10 subjects across 7 specific details, fabricating assessment input
Australian universities answers and inaccurate
calculations
� requiring pre-training
with background
information, which can be
time consuming
Parsons and Curry (2024) To explore ChatGPT’s This research employed a � Struggling to adapting Input for teacher and
ChatGPT-3 capability in fulfilling needs, task, and learner their response to specific curriculum specialist
USA graduate-level instructional analysis to evaluate context training in integrating AI
design assignments. ChatGPT’s capacity to � Only including knowledge capabilities.
generate instructional prior to September 2021
materials for a 12th-grade � Providing generic and
media literacy module. superficial responses
Expert evaluation and when asked to
grading rubrics were then contextualise its
used to benchmark the responses
quality of the bot’s � Responses depending on
outputs. the complexity and
format of questions
Prieto et al. (2023) To test ChatGPT’s ChatGPT was asked to � generic responses and � able to generate coherent
ChatGPT-3.5 performance to create a generate a construction fabricating (incorrect) schedule to fulfill the task
USA, United Arab Emirates construction project schedule for a simple answers requirements
schedule project � quality of response � potential for automating
largely depending on the preliminary and time-
input/ prompt consuming tasks
Puthenpura et al. (2023) To explore the benefit of Carefully drafted prompts � ChatGPT’s response � an assisting tool in
NA ChatGPT in assisting about a case (i.e., based containing incomplete streamlining the writing
USA writing up a case report on case presentation, information, which is process
diagnostic test results and difficult to interpret
treatment results) was without subject matter
inputted into ChatGPT for knowledge
response. � reference hallucination
� plagiarism concerns
Rahman and Watanobe (2023) To evaluate ChatGPT’s ChatGPT was asked to � poor mathematical skills: � nearly precise responses
NA performance in assisting generate codes based on failing calculation to technical queries
Bangladesh, Japan students in learning clear or partially clear (counting numbers) that across a diverse array of
coding skills information as well as elementary children subjects.
correct errors in codes can do
� codes generated by
ChatGPT may including
errors that require human
check
� concerns for students’
potential overreliance on
ChatGPT
Rozado (2023) To evaluate potential political ChatGPT was asked to answer � ChatGPT consistently N/A
NA bias in ChatGPT’s response 15 different political demonstrating bias
New Zealand orientation tests toward left-wing political
viewpoints (14/15 tests)
(continued)
16 N. CONG-LEM ET AL.
Continued.
Study, Country, ChatGPT version Aims Method Limitations Opportunities
Sallam et al. (2023) To examine pros and cons of ChatGPT was asked to � inaccurate and biased � enhancing personalized
NA ChatGPT in health and respond to prompts about content learning, clinical
Jordan public health education medical, dental and � privacy concerns reasoning skills and
pharmacy topics, each � students’ potential understanding of intricate
with 5 prompts; its deterioration in critical medical concepts
responses were then thinking due to
assessed by experts in overreliance on ChatGPT
terms of conciseness,
accuracy and clarity
Sanmarchi et al. (2023) To examine the potential of ChatGPT was asked to � ethical and legal � a valuable support for
ChatGPT-3 ChatGPT in designing and suggest study questions consequences due to researchers in setting up
Italy conducting and design based on an inaccurate data an epidemiological study
epidemiological study existing paper and then � reproducibility issue with with its response brinh
evaluate its response in ChatGPT due to its most effective in method,
light of coherence and inconsistent response data analysis and offering
relevance (by 3 senior � inadequate to design recommendations
researchers) conceptual and structure
of the study/paper
Segal and Khanna (2023) To investigate both the Asked to compose a text � providing erroneous � can be used to generate
NA capabilities and constraints about a case based on description of the a rough draft for research
USA of ChatGPT in aiding the crafted prompts with genetics and and and writing purposes
composition of a case relevant medical overestimating the
report information condition of the disease
� hallucinating references
Seth et al. (2023) To test the value of ChatGPT was asked to answer � superficial information; N/A
ChatGPT-3 ChatGPT’s input in medical 5 questions about plastic not creative and cannot
Australia, Denmark field (i.e, thumb arthritis), surgery regarding thumb be used to generate
particularly for research arthritis plastic surgery solutions
writing � hallucinating references
� able to provide accurate
and relevant information
(albeit superficial)
Shoufan (2023) to assess ChatGPT’s Computer engineering � Struggling with tasks Awareness of ChatGPT
NA effectiveness in aiding students (experiment involving code limitations shapes
United Arab Emirates students with no prior group: n ¼ 41–56) used completion, image educational practices,
knowledge in answering ChatGPT to answer analysis, and consistency prompting adjustments to
assessment questions previous test questions � Performance varying assessment tasks.
before learning about the depending on the type
related topics. Their scores and format of questions
were compared with those � Providing potentially
of previous-term students misleading and
(control group: n ¼ 24–61) incomplete responses
who answered the same
questions in a quiz or
exam setting.
Stojanov (2023) To report on experiences of An ethnographic study by the � responses with superficial � providing good general
ChatGPT-3.5 using and limitations author where he learnt and potentially knowledge of technical
New Zealand related to ChatGPT how to use ChatGPT, had contradictory information topics in a prompt and
conversation with it and efficient manner
reflected on his � serving as a learning aid
experiences or “a more
knowledgeable other”
(p. 1)
Thirunavukarasu et al. (2023) To evaluate strength and ChatGPT was asked to answer � unable to achieve � achieving a level of
NA weaknesses of ChatGPT in questions of Applied sufficient scores to pass proficiency comparable to
UK general practitioner setting Knowledge Test (AKT) (i.e., the test (60.17% vs that of a human expert
a medical test) 70.42% required to pass) � could be used to
� performance quality automate tasks or as an
inconsistent with the assistant in clinical
difficulty levels of the test settings
questions
Wagner and Ertl-Wagner (2023) To test ChatGPT’s accuracy ChatGPT was asked to answer � providing inaccurate N/A
ChatGPT-3 and reliability in answering 88 questions and responses (or responses
Canada radiologist questions evaluated by radiologists with errors): 33%
including the authenticity � hallucinating references:
of its answer 63.8% self-created
references
Note. NA ¼ Not applicable (i.e., the study did not provide sufficient information to determine the version of ChatGPT used).