Prediction of tumor board procedural recommendations using large language models

Appendix A: Model alignment strategiesA.1: In-context learning

Large language models can be prompted to set the scenario for a conversation and achieve a desired model behavior. In this, the model is not altered by the alignment process, but given contextual information as to what response is expected, and how the response should be formatted. Since the model effectively learns this desired behavior at run-time, this is often referred to as in-context learning (ICL). In-context learning methods take advantage of the fact that a large language model, commonly based on the generative pre-trained transformer (GPT) architecture, is essentially only a next word (or, more correctly, token) predictor, and can’t differentiate between tokens/text it generated itself and tokens/text that were given within the prompt. Models based on the GPT architecture commonly have a limited context length and thus can’t store the complete history of a conversation. Thus, in a conversation that exceeds this context length, they need to utilize clues from the recent conversation history to infer what is expected in the following conversation. While a diverse set of strategies have been evaluated already for ICL, one scheme that was found particularly successful is the method of Untuned LLMs with Restyled In-context Alignment (URIAL, Lin et al [17]). In the URIAL method, the model receives several pairs of USER prompts and MODEL responses as desired (stylistic) examples together with a system prompt into its buffer, preceding the actual new prompt. In the training, the models have learnt to adhere to the system prompt for overall alignment in the desired application. The system prompt commonly includes requirements to be helpful, honest and polite [17], but in our context we also utilized the system prompt to define the setting of a tumor board and the task of predicting the recommended procedure for the patient (actual prompt is provided in appendix C). The examples are utilized by the model to adapt to the expected style of the response [17]. This way, the model can utilize past responses given by a model to learn how the next response is expected. In our work, we utilized URIAL prompting with 1, 2, 3, 5, and 10 prompts given into the buffer of the model. This method works universally with all models that allow a direct API access (including feeding the roles (USER/MODEL) in the conversation) such as OpenAI’s suite of GPT models that are also utilized in the ChatGPT product, or other commercially available LLMs.

In the experiment, we built the context (which can be understood as the chat history) of the model by randomly sampling the respective number of queries and expected predictions from the training data, and, additionally with one query from the validation set. We performed this procedure using randomized prompts from the training set for each entry of the validation set. As in all examples, The model’s response was expected to begin with the word "Procedere:" (Latin for procedure), and the model’s response was rejected and the query repeated if the model did not meet this requirement in its response.

A.2: Supervised fine-tuning

Another way of aligning the model output is supervised fine-tuning (SFT). In this, the actual parameters of the model are modified to achieve an output that is more tailored to the expectations. The major obstacle in fine-tuning is commonly the insufficiency of the dataset size. Large language models commonly contain billions of parameters, each of which needs to be set sensibly in the training process, which is why large data corpora are used for this. For the adaption to a specific task, however, commonly only a tiny fraction of this original data wealth is available, which directly leads to a significant risk of overadaptation to the training material that is used for the alignment, in that the model will only mostly be able to recreate the training data and not generalize to unseen data anymore (a phenomenon commonly described as overfitting). This can be counteracted by utilizing methods that only adapt a fraction of the parameters (parameter-efficient fine-tuning, PEFT). One methodology that has proven to be very successful in PEFT of large models is low-rank adaptation (LoRA, Hu et al [12]). Model parameters mostly reside in a large number of modestly sized matrices in the model. Giving the training process the freedom to change each and every parameter would directly lead to overfitting. In LoRA, the trick is that instead of allowing direct change to the parameter matrices, we introduce an additive component of the same size of the matrix, but factorized into two smaller matrices (which, as a product, are of the same size as the original matrix). The newly introduced matrices, however, have only a fraction of the parameters, due to them being much smaller, and the resulting product matrix is hence of low rank. By constricting the number of effectively trainable parameters in the model, the risk of overfitting is drastically reduced. This reduction can be further promoted by not using parameters with full computational precision, but using quantized versions of the same (QLoRA, Dettmers et al [4]), allowing for fine-tuning with further reduced requirements on the data quantity.

We performed QLoRA-based PEFT for the LLAMA, Gemma and Mistral models, since the entire suite of model weights was available, which is a precondition for SFT. In all experiments, we chose r=16 and alpha=16 as hyperparameters, resulting in a total number of trainable parameters of 42 M for the Mistral and LLAMA model, and of 50 M for the Gemma model, which amounts to approximately half a percentage point of the original model parameter count. We trained all models using a batch size of 8 for 100 steps.

Appendix B: Supplementary figures

See Figs. 4, 5.

Fig. 4figure 4

Distribution of TNM classification and ECOG status of our patient cohort

Fig. 5figure 5

Web-based interface (based on Django/Bootstrap) used in the blinded manual evaluation of model responses

Appendix C: System prompt for in-context learning

German

Du nimmst Teil an einem wissenschaftlichen Experiment zur Vorhersage von Empfehlung aus einem Tumor Board. Du bekommst nun count Beispiele, wie diese formatiert sind. Nach den Beispielen bitte ich Dich, eine ebenso formatierte Empfehlung für ein neues Beispiel zu geben.

English (translated)

You are taking part in a scientific experiment to predict recommendations from a tumor board. You will now receive count examples of how these are formatted. After the examples, I ask you to give an equally formatted recommendation for a new example.

Comments (0)

No login
gif