Down the Rabbit Hole: LLMs and the search for the perfect answer

Here, you see, you have to run as fast as you can to stay in the same place. If you want to go somewhere else, you have to run twice as fast!

(Carroll, Lewis: Alice Behind the Mirrors)

The world of generative AI often feels like Carroll's world behind the mirror, and not just because AI-generated images sometimes still look like we're looking at them in a distorting mirror. What was best practice yesterday may be obsolete tomorrow.

A lot has happened since the breakthrough for large language models (LLMs) with ChatGPT. What has remained is our desire to supplement these language models with further knowledge. Gemini and Co should know exactly what is in the service catalogue of an IT service provider or adapt their language style to the minister. There is no longer a one-size-fits-all solution, but there are numerous possibilities. We take you on a sprint through the various options for optimising LLMs.

Small effort, big impact: system prompts for LLMs

System prompts are the texts that precede every message to an LLM. They state, for example, the style in which the model should reply (formal or humorous) or what the general task is.

Optimising this prompt ((system) prompt engineering) is very complex, especially if only a small amount of information is provided. Perhaps the chatbot needs to know the name of the customer company and the sector in which it operates. Another common use case is for Gemini or another language model to respond in a specific format. Whether as JSON to further process the input or in the form of a polite email. All this information belongs in the system prompt.

Examples that show the desired output format (few-shot prompting) are very effective. However, by lengthening the prompts, they quickly drive up the cost per enquiry.

The nice thing about system prompts is that they can be customised and tried out with little effort. However, as soon as it becomes apparent that the information to be supplied is too extensive or the model no longer fulfils all our instructions, it is time to think about other methods.

Source references made easy: Retrieval Augmented Generation (RAG)

Retrieval augmented generation means that relevant information about the user question is first identified and passed to the language model together with it. For example, users ask ‘Which operating systems are available for a server provision?’ and the language model uses this information to formulate the search term for the database, in this case ‘Operating systems for servers’. The corresponding chapter from the service catalogue is returned from the database and passed to the language model together with the original question, et voila: ‘The servers are offered with the following operating systems: [...]’

Which type of data source is connected is almost more a question of imagination than a technical limitation. The classic is vector databases, such as Vertex AI Vector Search, where short sections of text are sorted by context and those closest to the question are returned. In the Google Cloud, Google Search, SQL tables from BigQuery and third-party databases can also be connected.

The model then receives the information from the sources and formulates the answer. Gemini can also use footnotes to indicate which source is being referred to.

Lend your voice to your language model: supervised fine-tuning

Language models are trained in Layers. Another layer is added during supervised fine-tuning. This is done by giving the model further example questions and answers and then retraining based on these. By influencing these examples, we shape the model in terms of what it knows and how it responds. The training data must be representative of the type of task the model will then be tasked with.

Google Cloud allows fine-tuning from as few as twenty examples, whereby 100-500 examples should be aimed for in the training data set, depending on the use case.

Supervised fine-tuning is therefore ideal if examples in system prompts are not sufficient, the data basis is consistent and you have some time for training and training data. However, if the requirements change, new training is required.

The difference between good and better: reinforcement learning from human feedback

Anyone who has ever dealt with the training of language models may have heard of Human in the Loop. This involves real people providing feedback during the training process on which model responses are the best.

Google Cloud offers the opportunity to use this strategy when training models further. Instead of simply using question-answer pairs, as in supervised fine-tuning, the training dataset consists of two answers per question. Both acceptable, the users choose the answer that fits better. In this way, the model learns to adapt to the person's desired style. This is particularly helpful when it is difficult to put into words exactly what the perfect answer looks like and there are only subtle differences between an acceptable and a fantastic answer. A USP of Google Cloud.

The model's nimble little sister: distillation of LLMs

In distillation, we use a large, comprehensive model, such as Gemini Pro 1.5, and set it to the task of training a smaller model under certain aspects. As the large model focusses its training of the smaller model on the specified tasks, such as the particularly formal response, the small model becomes particularly good at these tasks and is also faster than its large sister model due to its smaller size. Once training is complete, only the faster, specialised model is used.

If you've ever been blown away by how insanely fast Google's Gemini Flash 1.5 is, you may already have guessed it: Gemini Flash 1.5 is a distillation of Google's current flagship Gemini Pro 1.5.

By training a new model, this is significantly more malleable than a Supervised Fine-Tuned model, but also more expensive. Any model can be distilled; the Google Cloud Vertex AI offers a distillation pipeline for this purpose.

Due to the increased speed and high level of personalisation, the distilled model is often the right choice for chatbots at a support desk, for example. As with fine-tuning, however, no information can be changed retrospectively without starting the process again.

The best of all worlds: A proposal for the pharmaceutical industry

Anyone who has read this far may be wondering whether any form of training is invalid if source references are used or information needs to be updated. This is by no means the case.

For a concrete example, let's look at the pharmaceutical industry. There, optimised language models are not yet used in market access, although the potential is huge.

In Germany, according to the Act on the Reform of the Marketing Authorisation and Restriction of Medicinal Products (AMNOG), every medicinal product has to undergo a cost-benefit analysis after approval. For this purpose, a benefit dossier of over a thousand pages is submitted to the Federal Joint Committee, which refers to various studies from the marketing authorisation and beyond in order to demonstrate the benefit of the new drug.

In this case, it must be clear which information refers to which study and each medicinal product has its own studies. In addition, studies are well and evenly structured and are therefore very suitable for subdividing for a vector database, which a RAG system then retrieves.

At the same time, a very specific style must be adhered to in the AMNOG procedure. Information must be partly in tables, partly in continuous text, but always in the correct section. As this is somewhat complex for a system prompt, it makes sense to train a model retrospectively.

Until now, entire departments have spent all their time collating this information, but it could be so simple: a Gemini fine-tuned with hundreds of benefit dossiers from successful AMNOG procedures knows exactly the formality requirements and pulls the necessary studies from Vertex AI VectorSearch to write the benefit dossier. This would save market access staff countless hours of unpopular work, allowing them to focus on proofreading and preparing for negotiations.

The wonderful world of LLMs: shaping the future with adesso and Google Cloud

Alice: ‘Would you please tell me where to go from here?’
‘That depends largely on where you want to go,’ said the cat.

(Carroll, Lewis: Alice's Adventures in Wonderland)

The world of Generative AI is changing, but one thing remains constant: the search for the perfect answer. Whether we use prompt engineering, enrich LLMs with data from vector databases, tailor them to our specific needs through fine-tuning and reinforcement learning or train a distilled model - there are many ways to optimise LLMs. Google Cloud offers a wide range of tools and resources to simplify and accelerate these processes.

The journey through the data wonderland has only just begun. New developments and advances in the models themselves, as well as in the processing of large amounts of data, promise even more unimagined possibilities.

Are you ready to help shape the digital future? adesso, your Premier Partner for Google Cloud, will accompany you on this journey and support you in the implementation of generative AI solutions that are optimally tailored to your needs. Use our expertise and the possibilities of Google Cloud to transform your data into knowledge and innovation!

Learn more

Would you like to find out more about exciting topics from the world of adesso? Then take a look at our previous blog posts.

Also interesting

Open Source Large Language Models

Author Ellen Tötsch

Ellen Tötsch works as a Data Scientist and Machine Learning Engineer at adesso SE. She has been passionate about building innovative solutions based on large language models for years - preferably in the Google Cloud.

Category:	AI
Tags:	Large Language Models (LLM)