The LLM Prompt Evaluator Tool is designed to help assess and optimise the effectiveness of prompts used with large language models (LLMs) by evaluating a set of candidate prompts for a specific task. The tool will leverage another LLM as a "judge" to rank the responses generated by each candidate prompt based on their quality and relevance to the task. This approach allows for a data-driven, systematic evaluation of which prompts produce the best results in a given context, eliminating the guesswork often involved in prompt engineering. The tool should support local mode with ollama as well as supporting APIs for non-local mode.

The tool will iteratively perform the following task with a new piece of context from a context bank:

image.png

The primary value of this tool lies in its ability to automate and optimise the process of prompt selection for LLM-based tasks. Rather than relying on manual trial and error or subjective assessments, the tool will offer a structured, objective means of evaluating and ranking prompts.

When working on the Police project we relied on this trail-and-error approach lacking an automated approach to perform a grid-search of a selection of candidate prompts. This will significantly streamline the process of prompt engineering tasks. With the tool acting as a "judge" to rank prompts, users will be able to directly measure and compare the efficacy of different prompt strategies, improving overall LLM performance and saving valuable time in the development cycle.

The primary output of this project will be the open-source tool for evaluating system prompts empirically, this tool will utilise MLFlow for experiment tracking and DagsHub hosting the dashboard.