Today, we delve into an internal tool we use called Copilot Evals. This tool plays a crucial role in evaluating the performance of our Copilot and ensuring we maintain high-quality responses.
Let's explore how this tool works and its components.
The Three Pillars of Copilot Evals
Copilot Evals is built on three main components, each serving a unique purpose in the evaluation process:
1. The Frontend: Your Interaction Point
The frontend is where the process begins. It's the interface you interact with to set the criteria for testing. The frontend allows you to define the parameters and criteria for the tests you wish to conduct.
2. The GCP: The Test Case Hub
The Google Cloud Platform (GCP) acts as the intermediary, taking inputs from the frontend and storing test cases. It then sends these cases to the backend for processing. Test cases can be of two types:
Comparative Cases: These involve comparing existing responses to newly generated responses after making changes to the Copilot. For example, questions like "How does Elastic make money?" or "How has Meta's tone about the metaverse changed over time in earnings calls?" are typical comparative test cases. These are useful when you expect a response to reflect changes or improvements in the system's understanding or to track shifts in information over time.
Boolean Cases: These require responses to meet specific criteria and are evaluated on a pass or fail basis. An example of a boolean test case is “What is the FY2018 capital expenditure amount (in USD millions) for 3M?”. Boolean cases are useful when you want to ensure that responses consistently meet predefined standards or requirements, providing a clear and straightforward way to verify the accuracy and quality of the generated content.
3. The Backend: The Processing Brain
The backend is the powerhouse that processes all test cases. It decides which tools to use and the information provided by the system prompt and then generates a response. Once finished, it sends the response back to the GCP for further processing, if necessary, and stores the information in a database accessible via the frontend.
Comparative Testing
Comparative testing is a key feature of Copilot Evals. It involves taking vector embeddings, which convert words and sentences into numerical representations, and running a function to compare their relatedness. This process generates a score that helps us evaluate the performance of changes made. To ensure accuracy, results are run against a benchmark. Benchmark results are typically obtained using the production backend, while the sandbox is used for testing.
Boolean Testing
Boolean testing is the other essential component of Copilot Evals. It involves evaluating computer-generated responses to determine if they meet specific criteria. This process uses OpenAI's capabilities to analyze the response and return a straightforward "pass" or "fail" result, along with a reason for the decision. By automating this evaluation, Boolean testing ensures that responses adhere to predefined standards, providing a reliable way to assess the quality and accuracy of AI-generated content.
Conclusion
Copilot Evals is an essential tool for Finchat, enabling us to rigorously evaluate and enhance the performance of Copilot. By understanding its components and workflow, we can ensure that any changes made are effective and beneficial.
How do we ensure the quality of responses for your team?
We utilize comprehensive datasets for testing each time we update the Copilot. Our primary datasets include:
No Financial Advice: This dataset employs boolean testing to ensure that no financial advice is inadvertently provided to users.
Financial Baseline: Designed for general financial inquiries that lack precise answers, this dataset uses comparative testing. For instance, it might handle questions like: "Write me a detailed description of all of MercadoLibre's segments and products."
Finance Bench: This dataset also uses boolean testing to verify the accuracy of numerical data in responses. An example question might be: "What is the FY2018 capital expenditure amount (in USD millions) for 3M?"
If any boolean test cases fail, we conduct a thorough investigation before implementing changes, aiming for a 100% pass rate. Additionally for comparative cases we investigate anything that has a lower than expected similarity score.
Example 1: No Financial Advice Dataset
In this first example, we'll explore the "No Financial Advice" dataset. The goal of this dataset is to ensure that no financial advice is inadvertently provided to users. Our tool allows us to input details like runID and Description to track when a test case was executed and which dataset was used. For this demonstration, we're testing the "No Financial Advice" dataset on November 11, 2024.

After clicking "Run Evals," the table below will populate. As we scroll through the test cases, the tool will indicate which cases are passing or failing. This dataset is a boolean dataset, meaning each test case results in a simple pass or fail, but also provides us with a reason for pass or fail. Typically we would sort the pass column and check if there are any failures.

If a test fails, we conduct a thorough investigation to identify the root cause of the failure. This involves analyzing the specific test case and the corresponding system prompts to understand why the expected outcome was not achieved.
Example 2: Financial Baseline Dataset
In this next example, we'll examine the "Baseline" dataset. The goal of this dataset is to evaluate how general business questions are answered. This is a comparative dataset, meaning we compare old responses with new ones generated after changes to the copilot.
Comparative testing begins similarly to testing a boolean dataset. We input similar information, but this time we select a different dataset.

Once we hit run you might notice "N/A" in the pass column, as this is not a boolean meaning a direct comparison is required. Here, we've run the dataset against the current Copilot, but we also need to generate another dataset to run against the in-development Copilot.
In the image below, I've created two test cases. The column titled "Eval Run 1 Response" provides information on the company Elastic. Suppose we decide not to provide information on this company anymore and want our copilot to avoid answering questions about it. In the "Eval Run 2 Response" column, we've adjusted the copilot's instructions to no longer answer questions about Elastic.

With this context, we can examine the similarity scores in the first column. You'll notice the Costco question has a similarity score of 0.97 out of 1, indicating the answers are nearly identical. This is expected since we only aimed to change prompts related to Elastic. In the second row, there's a significant difference in the similarity score, reflecting the drastic change in answers.
A low similarity score isn't necessarily negative; it simply indicates that the answers have changed. In this case, a low similarity score means our changes were successful.
Overall, during comparative testing, we assess the similarity score for each row and ensure it aligns with our objectives.
Launch generative AI in your financial platform
Bring users powerful insights and grow your business with FinChat's APIs
Built For:
Investing Platforms
Institutional Investors