In a recent case study, Plum Defense demonstrated a significant improvement in the quality of an LLM application’s output from 66% to 82%. By evaluating outputs, augmenting training data, and fine-tuning the model, Plum Defense improved the LLM application using an initial dataset of only 15 examples. The key insight to Plum Defense’s effectiveness is to identify the right metrics for the business case and leverage them to drive a continuous fine-tuning process.
In this case study, we apply Plum Defense’s system to an LLM designed to answer product questions about a company with information from its public facing website. The text is written in a tone that is low in information density and hard to understand. The LLM’s answers should keep the product details but omit extraneous language, but as we’ll see, it underperforms in specific ways.
Here’s an example content from the website:
Output of the LLM application on the above example text:
Note that the information has been summarized too much. The details have been left out, and the sentence is too long, which is not ideal.
Plum Defense measures exactly how much the output diverges from the intentions conveyed in the system prompt. Plum’s evaluation system automatically extracts quantitative metrics from the qualitative descriptions in the system prompt. They are phrased as yes-or-no questions, which the evaluator then scores on a percentage scale:
For an output to be considered acceptable, all of the above metrics above need to be evaluated positively. If, for example, the output simplifies redundant language but omits important terminology, then the output is unacceptable. Thus, boosting the weakest metric is the most important improvement we can make to this LLM application.
The 15 webpages and their corresponding outputs from the LLM are fed into the evaluator to produce the following scores:
The evaluations make it clear that the LLM underperforms the most on this metric:
The majority of LLM application developers are not evaluating outputs at all. Those that do evaluate their outputs tend to rely on high-level metrics found in research, such as “helpfulness” and “faithfulness.” Plum Defense allows LLM application developers to dive one level deeper by creating quantitative metrics tailored to their specific use case.
This data is fed into the Plum Data Augmenter, which uses the initial 15 webpages to generate enough synthetic data to fine-tune the LLM. Plum Defense’s augmentation algorithm focuses on generating training data that specifically boosts the underperforming metric.
In this case study, we’re using ChatGPT-4o, so we make a call to the OpenAI fine-tuning API endpoint with the augmented training dataset. Plum Defense’s fine-tuning system is model agnostic and works with LLaMA, Mistral, and other open source open-weight models.
Once the model is fine-tuned with the augmented dataset, here are the results of rerunning the Plum Evaluator on the held-out test dataset:
Question 5’s score increased from 66% to 82% with initially only 15 training examples augmented with a synthetic dataset.
Because the output quality is directly tied to the quality of the weakest metric, this is a significant improvement to the LLM application overall. We can now rerun this process to improve the next-weakest metric.