Plum Defense Analyzes and Boosts LLM Quality Dramatically with Limited Training Data: A Case Study

Overview

In a recent case study, Plum Defense demonstrated a significant improvement in the quality of an LLM application’s output from 66% to 82%. By evaluating outputs, augmenting training data, and fine-tuning the model, Plum Defense improved the LLM application using an initial dataset of only 15 examples. The key insight to Plum Defense’s effectiveness is to identify the right metrics for the business case and leverage them to drive a continuous fine-tuning process.

In this case study, we apply Plum Defense’s system to an LLM designed to answer product questions about a company with information from its public facing website. The text is written in a tone that is low in information density and hard to understand. The LLM’s answers should keep the product details but omit extraneous language, but as we’ll see, it underperforms in specific ways.

Here’s an example content from the website:

With [redacted], clients gain the ability to scale with a highly automated exception-based process, including comprehensive reconciliation and extensive quality control.Significant experience supporting custom and complex client requirements, leveraging experience and knowledge throughout the [redacted] organization.

Output of the LLM application on the above example text:

[Redacted] features automated processes, comprehensive reconciliation, customer support, and scalability.

Note that the information has been summarized too much. The details have been left out, and the sentence is too long, which is not ideal.

Evaluation

Plum Defense measures exactly how much the output diverges from the intentions conveyed in the system prompt. Plum’s evaluation system automatically extracts quantitative metrics from the qualitative descriptions in the system prompt. They are phrased as yes-or-no questions, which the evaluator then scores on a percentage scale:

  1. Is there a simplification of redundant language into concise phrases?
  2. Are key elements and important terms retained without omission?
  3. Is each point or paragraph shortened individually instead of summarized or merged?
  4. Is bullet point formatting used as specified?
  5. Are long sentences split up into shorter ones as instructed?

For an output to be considered acceptable, all of the above metrics above need to be evaluated positively. If, for example, the output simplifies redundant language but omits important terminology, then the output is unacceptable. Thus, boosting the weakest metric is the most important improvement we can make to this LLM application.

The 15 webpages and their corresponding outputs from the LLM are fed into the evaluator to produce the following scores:

Graph showing quality score per metric

The evaluations make it clear that the LLM underperforms the most on this metric:

Question 5: Are long sentences split up into shorter ones as instructed?

The majority of LLM application developers are not evaluating outputs at all. Those that do evaluate their outputs tend to rely on high-level metrics found in research, such as “helpfulness” and “faithfulness.” Plum Defense allows LLM application developers to dive one level deeper by creating quantitative metrics tailored to their specific use case.

Improvement

This data is fed into the Plum Data Augmenter, which uses the initial 15 webpages to generate enough synthetic data to fine-tune the LLM. Plum Defense’s augmentation algorithm focuses on generating training data that specifically boosts the underperforming metric.

In this case study, we’re using ChatGPT-4o, so we make a call to the OpenAI fine-tuning API endpoint with the augmented training dataset. Plum Defense’s fine-tuning system is model agnostic and works with LLaMA, Mistral, and other open source open-weight models.

Once the model is fine-tuned with the augmented dataset, here are the results of rerunning the Plum Evaluator on the held-out test dataset:

Graph showing quality score per metric with improved metrics

Question 5’s score increased from 66% to 82% with initially only 15 training examples augmented with a synthetic dataset.

Because the output quality is directly tied to the quality of the weakest metric, this is a significant improvement to the LLM application overall. We can now rerun this process to improve the next-weakest metric.