Assessing the Quality and Reliability of ChatGPT’s Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4

Introduction

In an increasingly digitized society, patients frequently resort to the internet to access information about cancer [-]. However, despite being one of the most favored informational modalities, websites often require more content accuracy and better readability [].

Recently, artificial intelligence (AI)–powered chatbots such as ChatGPT have signified a potential paradigm shift in how patients with cancer can access a vast amount of medical information [,,]. The rise of these AI platforms, accessible to the general public, has escalated notably since OpenAI released version 3.5 of ChatGPT (GPT-3.5) on November 30, 2022 [-], which amassed over 1 billion users in March 2023 [].

ChatGPT, a large language model (LLM) [,-], uses natural language processing to offer varied responses to the same query considering the context of the conversation and individual user preferences []. Through text-to-text communication, ChatGPT can engage with humans [] and aims to deliver responses resembling human interactions [,,]. This model has undergone extensive training on a diverse corpus of text data encompassing a broad spectrum of sources, including books, scholarly articles, and web pages, enabling it to effectively comprehend and respond to natural language queries across a broad range of topics [,]. Moreover, the model’s performance is enhanced through reinforcement learning from human feedback, which enables it to produce more coherent and contextually relevant responses []. Additionally, ChatGPT can compose emails, essays, and medical reports, as well as solve problems and provide clarification [,,,].

On March 14, 2023, OpenAI announced the release ChatGPT-4 (GPT-4), which became available through a subscription-based model [,,]. This new version demonstrated outstanding performance across numerous academic and professional benchmarks, providing more refined and varied responses than GPT-3.5 [].

In this context, ChatGPT has emerged as a contender for traditional search engines, such as Google, because of its capacity to filter vast quantities of data and provide easily comprehensible responses [,]. Consequently, ChatGPT is a potentially reliable source of medical information to both the public and patients with cancer, and it is capable of offering insights regarding radiotherapy [,]. This is particularly significant given the general public’s limited knowledge of this treatment [,] and concerns regarding its possible side effects [].

Radiotherapy is a well-established treatment that delivers targeted ionizing radiation with precision with the aim of destroying cancer cells while minimizing damage to healthy tissues. Approximately half of all patients diagnosed with cancer undergo radiotherapy as a part of their care. Advances in radiotherapy have increased its complexity, requiring greater preparation and support for patients who may face physical and psychological challenges []. Considering that approximately 80% of patients have limited knowledge regarding radiotherapy and associated expectations regarding treatment, many have significant misconceptions. These commonly include concerns about radiation burns or the possibility of becoming radioactive as a result of the treatment [,]. Such misunderstandings, coupled with the unfamiliarity of radiotherapy for most patients and the inherent invisibility of the treatment, further complicate their ability to fully comprehend the process [,]. Therefore, providing clear and accessible information is essential for reducing patients’ fear of treatment []. Previous studies have explored radiotherapy educational resources, such as videos, and tested group education in radiotherapy settings. However, these studies did not specifically address individual patient education and support needs at key time points []. Alternatively, written documentation has proven effective for patients who may feel overwhelmed by excessive verbal information, as it allows them to process the material at their own pace and share it with family and friends []. Therefore, ChatGPT offers a convenient and accessible method for patients to obtain written information and support [].

Given that patient education is particularly crucial for patients with cancer because of the complexity of their treatment pathways [], providing them with comprehensive information about radiotherapy at appropriate stages may enhance adherence to the treatment plan, because inadequate information can lead to increased uncertainty, unnecessary anxiety, and distress among patients and their families [,,]. Additionally, poorly informed patients are likely to be dissatisfied with their care, have difficulty coping [], and have many follow-up questions regarding the treatment process. Moreover, patients with cancer often feel uncomfortable discussing their body image and sexual health with their clinicians. Consequently, patient communication with ChatGPT may lower these barriers [].

However, given that ChatGPT was not explicitly trained for oncology-related inquiries, the quality of the information it provides remains unverified [,,]. Evaluating the quality of responses is crucial, as misinformation can foster a false sense of knowledge and security, lead to noncompliance, and result in delays in receiving appropriate treatment [,,]. Nevertheless, various limitations of ChatGPT have been identified. It has been observed to fall below the expected educational level [], as health-related materials intended for patient consumption are typically recommended to have a reading level equivalent to fifth and sixth grades [,]. Furthermore, the training data for GPT-3.5 are outdated, limited to the information available up until September 2021, and lack access to newer knowledge beyond that date [,,]. To address this constraint, GPT-4 introduces a novel feature that allows the use of external plug-ins []. However, this new version is available exclusively through paid subscription [,,]. Additionally, ChatGPT tends to provide unreliable or inaccurate information, potentially generating incorrect or misleading responses [,]. This issue often arises from the dependence of models on their training data, which may not always be up-to-date or fully comprehensive [].

To date, limited research has been conducted on the application of language models in the medical domain and the effectiveness of ChatGPT in patient education remains indeterminate []. Although the literature addressing ChatGPT’s capabilities has proliferated in recent months, there remains a lack of data regarding the quality and reliability of the responses it provides [,]. This gap underscores the necessity for more comprehensive studies to evaluate the performance of language models, including ChatGPT, in the medical context. Ensuring that these models are equipped with the most current and comprehensive data is essential for their effective application in radiotherapy health care.

This study aimed to evaluate the quality and reliability of ChatGPT responses to common patient queries regarding radiotherapy to ascertain its potential as a reliable source of patient information. Additionally, it aimed to compare the performance of GPT-3.5 with GPT-4 in generating responses to the same radiotherapy queries.

MethodsPrompt Generation

To determine the most common patient queries regarding radiotherapy, an assessment was conducted using articles that addressed topics related to the most relevant patient concerns. These served as the foundation for the development of 128 questions, 90 of which were derived from the studies by Halkett et al [,,] and Zeguers et al [], whereas the remaining 38 were sourced from the National Cancer Institute []. The questions were then organized into a table to facilitate the identification of duplicates and the selection of the most pertinent ones. There were 36 questions identified as duplicates, and 43 were deemed specific to certain pathologies or specialized treatments, leaving a total of 49 questions. Four authors (AG, CM, MC-R, and MC) excluded 9 additional questions upon agreement, resulting in a final set of 40 queries to be input into ChatGPT. This exclusion aimed to ensure that the responses could be applied to all patients receiving radiotherapy, thereby reflecting their primary concerns and doubts. The questions were intentionally phrased in the first person to mirror the way patients might typically frame their queries when interacting with ChatGPT [] and were structured to address the informational needs of patients at various stages of radiotherapy []. The final set of questions was categorized into three dimensions: general information (n=14), planning and treatment (n=16), and side effects (n=10) (). These dimensions were selected based on previous studies [,,,], which assessed the most critical information needs for patients receiving radiotherapy, and they were further chosen to evaluate the strengths and limitations of responses across various topics in radiotherapy.

Textbox 1. Common patient queries regarding radiotherapy by dimension inserted in ChatGPT.

General information

1. Why is radiotherapy recommended?

2. What does radiotherapy involve?

3. When should radiotherapy and chemotherapy be combined?

4. What’s the cost of radiotherapy treatment?

5. Who will be providing my radiotherapy treatment?

6. How does the radiotherapy treatment machine work?

7. What impact will radiotherapy treatment have on my life?

8. What impact will radiotherapy treatment have on my health in the future?

9. During the period of radiotherapy, will I have to follow a particular diet?

10. Will radiotherapy make me radioactive?

11. What does radiotherapy do to healthy cells?

12. How long does radiotherapy take to work?

13. Can I be cured of my disease through radiotherapy treatments?

14. What will happen after the radiotherapy treatment is finished?

Planning and treatment

1. Can I maintain my daily routine and activities during radiotherapy?

2. Can I keep working while undergoing radiotherapy treatments?

3. Are complementary medicines recommended while undergoing radiotherapy treatments?

4. What’s the planning appointment in radiotherapy and what does it involve?

5. Why is computed tomography (CT) planning necessary in radiotherapy?

6. Why are tattoos useful in radiotherapy CT planning?

7. What happens on the first day of radiotherapy treatment?

8. Will the radiotherapy treatment schedule be adjusted to my availability?

9. What am I expected to do during the radiotherapy treatment?

10. Does the radiotherapy machine make noise?

11. How close is the radiotherapy treatment machine going to get?

12. What happens during radiotherapy treatment?

13. Is there a possibility of experiencing pain due to the radiotherapy treatment?

14. How long does a radiotherapy session last?

15. What should I wear for radiotherapy treatment?

16. Will there be follow-up after the end of radiotherapy treatments?

Side effects

1. What are the side effects of radiotherapy?

2. What skin care should I have during and after radiotherapy?

3. Am I going to feel tired after the radiotherapy treatments?

4. What hygiene care should be taken after radiotherapy treatments?

5. Which steps should be taken to reduce radiotherapy side effects?

6. Will the radiotherapy treatment be interrupted if I experience adverse side effects?

7. Who can I go to if the radiotherapy side effects become too burdensome?

8. Will radiotherapy affect my fertility?

9. Will radiotherapy cause hair loss?

10. Will radiotherapy cause permanent damage?

Data Collection

Responses were collected from ChatGPT between April 6, 2024 and April 9, 2024. Each question was queried on both versions of ChatGPT in English. Each query was entered separately using the “New Chat” function, acknowledging that ChatGPT considers the context of the conversation, which can influence responses. Therefore, the memory retention option was disabled when the questions were introduced into ChatGPT to ensure independence of the responses. The queries were then regenerated in each version of ChatGPT, and both responses were documented to analyze consistency.

Various methods were then used, as described in later sections, to assess the quality and reliability of the response content, response consistency, response readability, and similarity between responses from GPT-3.5 and GPT-4.

OutcomesQuality and Reliability

To evaluate the quality and reliability of the information provided by ChatGPT, we used a 5-point Likert scale, known as the General Quality Score (GQS), which has been used in previous studies [,]. The assessment criteria included accuracy, lay-language use, information flow, usefulness, and empathy. The 5-point Likert scale was defined as follows: (1) inaccurate information, poorly organized text, missing important details, and not helpful for patients; (2) limited accuracy, some relevant information is present, but still not easily understandable for patients; (3) adequately accurate information and some important details are explained in plain language; (4) accurate information, well-organized text, and most relevant details are presented in a patient-friendly manner; and (5) extremely accurate information, well-structured text, and all relevant details are presented in a compassionate and patient-friendly manner [].

The median GQS was calculated by averaging the ratings provided by 16 independent radiotherapy experts with substantial experience in managing oncology patients undergoing radiotherapy. The experts were randomly assigned to evaluate either GPT-3.5 or GPT-4, assuring that each expert evaluated only the responses from one of ChatGPT’s versions to reduce potential bias during the evaluation process, thereby decreasing the likelihood of altering assessments and enhancing their credibility []. All the experts received detailed instructions on the evaluation guidelines to promote a uniform understanding of the assessment process. Furthermore, the responses from ChatGPT were provided to the experts in paper format, and their evaluation was conducted in real time without internet access and without knowledge of which version the responses corresponded to, thereby ensuring a blinding effect. Moreover, the authors (CM and MC-R) who analyzed the obtained results were unaware of the identity of the radiotherapy experts.

Consistency and Similarity

The consistency and similarity of the responses were evaluated using the cosine similarity score. This method involves transforming the text information provided by ChatGPT into vectors, then calculating the cosine of the angle between the two vectors, indicating how similar the responses are to each other. Scores were calculated using an online tool. The cosine similarity score ranges from 0 to 1, where a score of 0 indicates complete dissimilarity between the texts, and a score of 1 indicates complete similarity [,].

To assess the similarity between the responses generated by GPT-3.5 and GPT-4, the initial responses to the same question provided by both versions were inserted into the web-based tool to determine the cosine similarity score between them.

The consistency of the responses generated by ChatGPT was assessed by entering the same question into both versions and calculating the cosine similarity score between the two responses to the same question. By regenerating the same question, we aim to assess whether ChatGPT can provide consistent information or if its responses vary widely.

Readability

To evaluate readability, responses from both versions were assessed using a web-based Flesch Reading Ease Score (FRES) calculator. This calculator determined the responses’ readability using two different indices: the FRES and the Flesch-Kincaid Grade Level (FKGL). These readability tests use mathematical formulas that consider factors such as sentence length and word count. The FRES is a numerical score ranging from 0 to 100, with higher numbers indicating better readability, meaning the content is easier to read and understand [,,] and corresponds to a lower grade level [,,] (). The FKGL score indicates the average number of years of education needed to comprehend a text, with lower scores suggesting better readability [,,] and correlating to the equivalent school level [,] ().

Statistical Analysis

The data were analyzed using SPSS statistical software (version 29.0; IBM Corp). The results were considered statistically significant at the 5% level (P=.05). Exploratory data analysis was carried out using frequency analysis (n, %) for categorical variables and median and interquartile range (IQR = Q3− Q1) for continuous variables. To test the normality of the data, the Shapiro-Wilk test was used. The Mann-Whitney test (since the normality assumption was not verified) and effect size were used to compare the evaluations between the 2 versions of ChatGPT. To analyze the question evaluations, scores were calculated for each question, considering the 8 experts assigned to each version. Krippendorff α and Fleiss κ were used to assess the agreement between experts. For this analysis, the experts’ assessments were considered for all questions in each dimension in each version of ChatGPT.

Ethical Considerations

This study did not qualify as human subjects research due to the lack of patient involvement and identifying data for the health professionals involved; therefore, it was deemed exempt from institutional review board approval. Additionally, the use of ChatGPT, a public platform accessible to all, meant no permission was required to use the information generated in this study.

ResultsQuality and Reliability

GPT-3.5 received primarily midrange scores, with most evaluations at levels 3 (n=72) and 4 (n=90), indicating generally accurate and comprehensible responses. Notably, many responses had the highest rating of 5 (n=110), providing extremely accurate and well-structured information. However, they also received low scores of 1 (n=13) and 2 (n=35), suggesting inaccurate or limited information, respectively.

Conversely, GPT-4 received the highest score of 5 (n=173), indicating a superior ability to provide accurate and well-structured information. A significant number of responses were assigned a score of 4 (n=97), while a smaller proportion received a score of 3 (n=38), demonstrating that it consistently provided responses that were accurate, well-organized, and accessible to patients. Remarkably, GPT-4 exhibited a lower number of low scores of 1 (n=4) and 2 (n=8) than GPT-3.5. The score breakdown by the question dimension is shown in .

‎

Figure 1. Number of scores assigned by radiotherapy experts to the total number of responses in each dimension from (A) ChatGPT-3.5 and (B) ChatGPT-4. The Likert scale was defined as follows: score 1=inaccurate information; score 2=limited accuracy; score 3=adequately accurate information; score 4=accurate information; and score 5=extremely accurate information.

Considering the general information dimension, statistically significant differences were detected between the 2 versions of ChatGPT regarding questions 3 (P=.03, effect size=0.6) and 10 (P=.04, effect size=0.5). Regarding planning and treatment, statistically significant differences were detected for questions 5 (P=.046, effect size=0.5), 7 (P=.002, effect size=0.8), 9 (P=.003, effect size=0.7), and 11 (P=.02, effect size=0.6). Finally, regarding side effects, statistically significant differences were detected for question 9 (P=.04, effect size=0.5). In either situation, GPT-4 showed higher ratings (). The high effect size values revealed a weak overlap in the response distributions between the 2 versions of ChatGPT. However, in the results of the comparison of the evaluation of the 2 versions of ChatGPT, for the other questions, the effect size was low, revealing overlapping distributions of responses, which is why no statistically significant differences were detected. It can also be seen that, although not significant, version 4 of ChatGPT presents higher evaluation scores.

Table 1. Comparison of responses to questions about general information; planning and treatment; and side effects between the 2 versions of ChatGPT, with Mann-Whitney test and effect size results.Dimension and questionsNumberMean rankP valueEffect sizeGPT-3.5GPT-4General informationQ1: Why is radiotherapy recommended?87.569.44.400.2Q2: What does radiotherapy involve?87.139.88.230.3Q3: When should radiotherapy and chemotherapy be combined?86.0011.00.030.6Q4: What’s the cost of radiotherapy treatment?89.197.81.530.2Q5: Who will be providing my radiotherapy treatment?88.138.88.720.1Q6: How does the radiotherapy treatment machine work?88.258.75.820.1Q7: What impact will radiotherapy treatment have on my life?87.139.88.200.3Q8: What impact will radiotherapy treatment have on my health in the future?87.139.88.200.3Q9: During the period of radiotherapy, will I have to follow a particular diet?87.759.25.440.2Q10: Will radiotherapy make me radioactive?86.2510.75.040.5Q11: What does radiotherapy do to healthy cells?88.758.25.820.1Q12: How long does radiotherapy take to work?87.139.88.210.3Q13: Can I be cured of my disease through radiotherapy treatments?89.697.31.300.3Q14: What will happen after the radiotherapy treatment is finished?87.259.75.220.3Planning and treatmentQ1: Can I maintain my daily routine and activities during radiotherapy?87.449.56.240.3Q2: Can I keep working while undergoing radiotherapy treatments?87.259.75.220.3Q3: Are complementary medicines recommended while undergoing radiotherapy treatments?89.507.50.260.3Q4: What’s the planning appointment in radiotherapy and what does it involve?87.0010.00.180.3Q5: Why is computed tomography (CT) planning necessary in radiotherapy?86.2510.75.0460.5Q6: Why are tattoos useful in radiotherapy CT planning?87.139.88.230.3Q7: What happens on the first day of radiotherapy treatment?85.1311.88.0020.8Q8: Will the radiotherapy treatment schedule be adjusted to my availability?87.889.13.440.2Q9: What am I expected to do during the radiotherapy treatment?85.1911.81.0030.7Q10: Does the radiotherapy machine make noise?88.009.00.540.2Q11: How close is the radiotherapy treatment machine going to get?85.7511.25.020.6Q12: What happens during radiotherapy treatment?86.8810.13.150.4Q13: Is there a possibility of experiencing pain due to the radiotherapy treatment?87.569.44.410.2Q14: How long does a radiotherapy session last?87.449.56.340.2Q15: What should I wear for radiotherapy treatment?87.0010.00.060.5Q16: Will there be follow-up after the end of radiotherapy treatments?87.389.63.270.3Side effectsQ1: What are the side effects of radiotherapy?86.7510.25.120.4Q2: What skin care should I have during and after radiotherapy?89.317.69.480.2Q3: Am I going to feel tired after the radiotherapy treatments?87.509.50.260.3Q4: What hygiene care should be taken after radiotherapy treatments?86.7510.25.100.4Q5: Which steps should be taken to reduce radiotherapy side effects?87.949.06.630.1Q6: Will the radiotherapy treatment be interrupted if I experience adverse side effects?88.388.63.910.0Q7: Who can I go to if the radiotherapy side effects become too burdensome?86.8110.19.140.4Q8: Will radiotherapy affect my fertility?87.639.38.410.2Q9: Will radiotherapy cause hair loss?86.3810.63.040.5Q10: Will radiotherapy cause permanent damage?86.7510.25.120.4

Based on the analysis of Krippendorff α and Fleiss κ coefficients across the 3 dimensions (general information; planning and treatment; and side effects), the results indicated a low level of agreement in the classification of questions for both GPT-3.5 and GPT-4. This trend of weak agreement was consistent across the overall set of queries in .

Consistency and Similarity

Regarding similarity and consistency, a cosine similarity score ranging from 0 to 1 was calculated, as previously described. Concerning similarity, the median (IQR) cosine similarity between GPT-3.5 and GPT-4 responses was 0.81 (IQR 0.05), indicating a reasonably good similarity between the 2 versions of ChatGPT. Notably, question 11 in the planning and treatment dimension exhibited the lowest similarity, with a value of 0.68. With respect to consistency, the cosine similarity median (IQR) for GPT-3.5 and GPT-4 responses were 0.85 (IQR 0.04) and 0.83 (IQR 0.04), respectively. In both versions, consistency was demonstrated to be good or very good, with values ranging between 0.74 and 0.92.

Readability

The word count, sentence count, FRES, and FKGL score for both versions are summarized in . A relevant disparity was observed in the median (IQR) word count between GPT-3.5 and GPT-4 (299.00, IQR 176.5 versus 344.50, IQR 74.75). Additionally, the sentence count was higher in GPT-4 compared to GPT-3.5 (20.00, IQR 10.5 versus 18.00, IQR 17).

Table 2. Word count, sentence count, Flesch Reading Ease Score, and Flesch-Kincaid grade level score of responses from GPT-3.5 and GPT-4.Dimension and questionsGPT-3.5GPT-4Word countSentence countFRESFKGLWord countSentence countFRESFKGLGeneral informationQ13322235.3112.083782532.3612.50Q24142735.9712.054532835.9712.26Q33041824.1114.093401725.8014.63Q4188713.53182681525.8114.10Q52461828.5812.673052121.7813.83Q63782735.9611.724312736.5512.13Q7422273512.263582741.9010.71Q83892232.7413.093511630.7914.41Q93322541.2310.813112557.928.27Q1084517.5614.982231621.5913.71Q113042636.9011.023522137.4512.2Q12178733.2114.942981547.2811.60Q13177827.6114.91231923.3016.39Q143482235.9212.184101319.2418Planning and treatmentQ13742849.199.724023052.449.27Q2229821.8817.323692252.9410.04Q3165815.2516.993592016.1910.93Q43612126.5113.834332539.0112.12Q53161616.7915.823782426.8013.43Q62141215.1915.573322130.5112.93Q73582234.3512.514722748.9310.78Q8140520.7017.332321533.2412.47Q93612740.9410.874163143.5410.52Q1076430.5913.71106531.2814.16Q1116460183141633.0713.52Q123352437.1011.553882839.7111.16Q132471637.3811.883151937.7312.12Q1414450182641328.2414.37Q153272452.019.393372262.508.35Q16183719.4217.053392037.9012.18Side effectsQ134026489.813241126.2816.91Q23002657.238.143543352.808.56Q3150736.1913.54270932.5716.17Q44112343.1711.683612848.229.74Q53972117.8115.474181713.4917.49Q6137532.0515.603492037.3812.38Q72982145.0910.503712738.2811.33Q8164823.5315.072771017.1517.75Q9108543.9112.502141557.948.72Q102121029.2914.443301226.1316.45

aPlease refer to Table 1 for the full questions.

bFRES: Flesch Reading Ease Score.

cFKGL: Flesch-Kincaid Grade Level.

The FRES median (IQR) for GPT-3.5 and GPT-4 responses were 32.98 (15.59) and 34.61 (16.07), respectively. This indicates that the responses generated by the two versions were considered college-level and difficult to read. The FKGL median (IQR) for GPT-3.5 and GPT-4 responses were 13.32 (3.79) and 12.32 (3.32), respectively. This suggests that at least 13 years of education (college-level) are required to understand the responses generated by GPT-3.5, whereas the responses from GPT-4 require at least of 12 years of education (college-level) for comprehension.

DiscussionPrincipal Findings

The power and utility of AI platforms in health care, such as ChatGPT, are rapidly evolving and improving and have the potential to significantly improve patient education [,]. This study sought to assess the quality and reliability of ChatGPT responses to common patient queries regarding radiotherapy with the aim of determining its potential as a reliable source of information for patients. We also aimed to compare the performances of GPT-3.5 and GPT-4 in generating responses to the same radiotherapy-related queries.

Although most responses were correct or close to correct, upon comparing the accuracy of responses between GPT-4 and GPT-3.5 in the 3 dimensions, it became evident that GPT-4 consistently offered improved elucidation of specific concepts relevant to radiotherapy treatment. In question 10 of the general information dimension, GPT-4 specifically delineated that patients are nonradioactive and may safely interact with others posttreatment (“You can safely be around others, including children and pregnant women, without any risk of exposing them to radiation”). However, this aspect was not as clearly articulated in GPT-3.5, which failed to mention that patients may come into contact with others after treatment. Additionally, within the side effects dimension, in questions 2 and 3, GPT-4 emphasized that the intended creams to use throughout radiotherapy treatment should only be those recommended by the health care provider (“Apply a fragrance-free moisturizer recommended by your healthcare provider”) and specified strategies to mitigate fatigue, a treatment-related side effect. However, this advice was not as detailed in the responses from GPT-3.5. Within the planning and treatment dimensions, GPT-3.5 demonstrated a propensity to diverge from directly addressing the queried issue in certain responses in contrast to GPT-4. In question 7, the response did not describe the first day of treatment but rather outlined the entire course of the patient’s radiotherapy. Question 11 failed to specify the distance between the equipment and the patient, a detail that was thoroughly addressed by GPT-4. In response to question 12, GPT-3.5 did not describe what occurs during treatment, instead reiterating the patient’s overall course. This indicates that GPT-3.5 exhibits reduced accuracy when responding to queries related to planning and treatment, as Valentini et al demonstrated [].

However, in GPT-4’s response to the 13th planning and treatment question, specific information was inaccurately presented as it erroneously stated that radiotherapy induces direct pain (“Direct Pain from Treatment Site: Radiotherapy can cause localized pain at the site of treatment”). This error may have occurred because not all web-based sources are reliable, and because the model is trained on a diverse array of internet texts, it may incorporate biased or outdated information. Consequently, misinformation regarding cancer continues to pose a significant concern in online communication, which could result in responses or recommendations that do not consider the most current, evidence-based medical practices [].

Moreover, there were a few occasions in both versions in which a lack of information was demonstrated. For instance, in question 7 of the side effects dimension, neither version mentioned that radiation therapists, who are team members that assist the patient daily throughout their treatment [], could serve as advisers for patients experiencing severe side effects.

In summary, both GPT-3.5 and GPT-4 demonstrated the ability to address concepts related to radiotherapy. However, GPT-4 provided more targeted and detailed responses, thereby exhibiting superior performance compared to GPT-3.5, as corroborated by several studies [,,,,]. The reduced number of scores of 1 and 2 assigned by radiotherapy experts to GPT-4 responses indicated a substantial improvement in response quality and reliability.

Comparison With Prior Work

In most responses, ChatGPT used a typical structure characterized by a succinct introductory paragraph, followed by 5 or 6 bullet points delineating the responses, culminating in a short concluding paragraph. Additionally, in a fair number of responses generated by GPT-3.5 (n=25) and GPT-4 (n=28), a statement was included advising that the information provided should always be discussed with health care providers, consistent with prior studies [,,]. This recommendation is significant because the use of ChatGPT in health care must be carefully monitored and should not be viewed as a substitute for human judgment. Its performance, safety, and associated risks require thorough evaluation by experts before integration into mainstream practice []. Moreover, it is essential that the model be trained on a substantial dataset validated by experts. This rigorous validation process could enhance the reliability and trustworthiness of ChatGPT responses, ultimately benefiting patient care [].

The cosine similarity score indicated a reasonably substantial similarity and consistency, and while subtle changes in sentence structure were noted, most answers remained consistent, implying accuracy [].

A key feature influencing consistency is the temperature parameter, a value ranging from 0 to 2, which adjusts the randomness of each subsequent word in the chat output. A value of 0 results in minimal variability, whereas values approaching 1 introduce greater randomness and creativity into the responses. Creativity is a powerful tool in communication, as it simplifies complex concepts, fosters critical thinking, and enhances the accessibility of intricate information, making it especially valuable for developing patient education materials. However, using ChatGPT with high creativity settings in clinical contexts may present challenges. By lowering the creativity level, we ensure that the summarized information remains faithful to the training data, thereby prioritizing accuracy and reliability over creative expression. Although this feature is not currently available for modification in ChatGPT, it may be included in future iterations of the tool’s web interface [].

Therefore, ensuring high reliability in ChatGPT’s outputs is essential for users to trust its data-driven conclusions. Although advances in ChatGPT’s performance can be attributed to key developments in its underlying technology, it is crucial that patients approach the information provided by AI tools such as ChatGPT with caution []. This is especially important given that ChatGPT does not disclose the bibliography used to generate responses [,,]. This issue was observed during our study, as the bibliography was not disclosed in either version, indicating ChatGPT’s inability to inform users of the contentious nature of certain information [,,,]. This lack of transparency is particularly significant, given the ethical concerns that arise regarding its application in patient care. Its implementation may lead to unintended or undesirable issues such as risks of bias and transparency, challenges related to interpretability, and generation of inaccurate content, all of which can have serious negative consequences for patient

View original article

JMIR CANCER

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Assessing the Quality and Reliability of ChatGPT’s Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4

Comments (0)