Name: How to Effectively Evaluate AI Agents
Uploaded: 2025-09-10T15:30:57.361Z
Duration: 1 h 1 min 48 s
Description: How to Effectively Evaluate AI Agents

Transcript for "How to Effectively Evaluate AI Agents": Hi, everyone. We're kicking off our webinar. Just waiting a couple minutes for people to join. In the meantime, I just wanted to make a brief introduction here to our how to effectively evaluate AI agents webinar. Just so you guys know, the webinar is going to be recorded, and we'll be sharing, it with you guys afterwards. And, also, during the webinar, feel free to ask questions on the chat as you wish, and we're gonna do our best to answer all of them as we go. And without further ado, I would like to introduce, Julian Connor, our lovely speakers, and pass it on to them so that we can get started. Julia and Connor, please welcome on stage and Perfect. Thanks, Barbara. Hi, everyone. Welcome to how to actively evaluate AI agents. I'm Julia Tran joined by my colleague, Connor Jensen. Today, we'll take you from the hype of demos to the reality of evaluating agents so they work reliably, safely, and at scale in your enterprise. Connor and I will share what we've learned from working with organizations, deploying agents, and from ongoing research into evaluation practices. So to do just a quick level set to make sure everybody understands Dataiku as a platform, just so that you get the context of where we're coming from and how we interact with our customers in this. The way we typically describe Dataiku is it's a single platform for the development and the deployment of AI, or sort of, like, the full spectrum of data and AI tools, across four main pillars. One, it's for everybody. It's for people who have coding backgrounds and are data scientists or AI engineers, but it's also for finance analysts, people in the business, marketing teams, things like that that don't have a coding background. It's got a very a very sort of varied interface to be able to work with that data in different ways depending on your skill set. It allows you to connect into all of your data. Dataiku does not have a dedicated data repository within it. We're specifically designed to integrate into your existing data landscape, which means you can work flexibly across a hybrid environment in the cloud, on prem, multi cloud setups, etcetera. And you can access that data and work with it and leverage the investments you made in your data architecture and, you know, the power of the compute of platforms and things like that. But being able to do it without having to prerequisite that all the data has been migrated or moved or things like that. Across the technology landscape, we very much sort of feature proof you against that. You know, I myself first used Dataiku in an on prem Hadoop implementation and since have used it in Azure and stuff like in all the different places. We really are very tech agnostic, aggressively agnostic sometimes we say. We really understand that enterprises have a huge variety of data, platforms already in place and different LLM providers, etcetera. And we wanna be able to make sure that you're able to orchestrate and bring all those things together in one. And finally, all this needs to be done. Right? So how do you know who's using what project? How's the output? Or who's using what LLM? What are the prompts? What are the responses? How are you evaluating it, etcetera? We cover the all the steps in that life cycle from the build to the deploy, monitoring, redeployment, etcetera, to make sure that you have a very clear view as an organization of what you've built, what you've deployed, etcetera. Next one. So how do we do that? We really think about that in terms of, again, being sitting across your data in AI infrastructure, both the dedicated pieces as well as, sort of your enterprise, things like Salesforce or Workday, integrate those into a single place where you can develop analytics. So do your ETL, do your data preps, get things, going, also see insights from the data sort of automatically generated, you know, so it's pushing to you. Also, in your modeling, doing data science and machine learning, you can do model development. Again, AutoML with visual tools or, you know, work in notebooks, work in code environment, that's what you're typically doing. But that's also on the operation side. Right? What does it look like to deploy a model? What does it look like to maintain that model? How do you know how you're doing that? Not just, at speed, but scaling that effectively across tens, hundreds, thousands of models and doing it in a way that where you're able to sort of maintain those and redeploy them as appropriate and know how they're performing. And finally, on the agent side, where, obviously, we're gonna spend more of the time today. You know, this is one of the newest aspects of the platform as we've been, you know, working into the generative AI space and agents. How do you build those agents, and how do you control those agents, which is trickier than even on the model upside of the ML stuff, you know, and that's a big piece of what we're gonna talk about here today is that orchestration and control and making sure that the agents are doing what you want them to do. Thanks, Connor. For today's discussion, we'll walk through why agents matter, the challenges of deploying them, five key categories for evaluation, how to go beyond benchmark, and then finish with q and a. So businesses today are adopting agents because they promise speed, automation, intelligence across workflows. Let's start with what agents are. So the best way to think about them is a spectrum of maturity from LLM chatbots to highly advanced autonomous systems. The biggest difference between GenAI of 2024 and agents of 2025 is that agents can take actions to support complex workflows, not just answer a question. So that means they can plan tasks across multiple steps, call external tools and APIs, adjust their strategies in response to results, and deliver outcomes that map directly to business processes. At Dataiku, we've seen our customers across every industry and every use case using agents as sales assistants, maintenance schedulers, financial analysts, and even r and d assistants, but the challenge is the same across all of them, determining whether or not they're enterprise ready. Agents represent a really unique opportunity to transform businesses, but moving from a successful demo to reliable enterprise use is where most organizations stumble. If you've been experimenting with Gen AI agents for a while, you already know that success once doesn't guarantee success the next time. Let's explore why that is. So at first glance, agents might just seem like another piece of software. Both applications and agents serve end users within workflows, must meet enterprise standards for uptime, scalability, and compliance, and both require thoughtful design, testing, and governance. But where they they diverge and differ is in how they work, so they differ in two important ways. Unlike deterministic software, agents generate plans on the fly, adopt mid task, and may take multiple valid paths to success. Outputs can vary between runs even with the same inputs due to probabilistic reasoning. So as a result, binary pass fail tests, code reviews that work with traditional software applications don't work exactly the same way with agents because agents can fail in more than one way. Instead, you need to evaluate how well agents reason, adapt, and deliver consistent value against real world constraints. In the worst cases, skipping agent specific evaluation means risking wrong outputs, poor UX, PII leaks, bias, latency, cost overruns, and more. These aren't just corner cases. They're systematic enterprise wide risk that can really cause a lot of problems across the business. Got gotta actually unmute myself. So alright. So let's talk about as as, you know, we've talked about some of these different challenges. One of the key differences we talked about, you know, versus from software versus agents is how do you know an agent is doing a good job? And I wanna sort of before we talk about some of these specifics, you know, the the way that I really wanna paint this and especially when you're used to sort of thinking about software deployment is that, you know, when you're thinking about software deployment and regression testing and making sure all of that, you know, the question really is, does the code function or does the code not function? If it functions and it does what it's supposed to do, then you're good. If it doesn't function, then you gotta bug and you gotta fix it. Unfortunately, in the agent space, it is not that simple because an agent can still be functioning. But if it is still functioning but is no longer performing and doing the right things, that's maybe even worse than it not working in the first place. Right? You would rather than agent is not working than it is making the wrong decisions, using your own data, saying the wrong thing to your customer or your employee, etcetera. And so this idea of performance evaluation of agents and making sure that we know they're good is really, really critical because we don't want agents that are out there functioning but functioning wrongly. So as we think about the five key areas to evaluate here, the first one is the task success and output quality. Right? So did it work? That's what we're talking about. But, also, did it give us the actual output that we wanted? Second one, user experience and business value. How do we make sure that the experience, again, on customers or internally, is the way that we want it to be and that we're actually showing the ROI for the investment? Third piece is the reasoning and tool use. You know, agents can select from tools, can be autonomous in the way that they decide to work through a process. Are they doing that the most effective way? Does it use the same tool to answer the same question every time? Does it use different tools? You know, sometimes it might take more steps to come back to where it needs to go. Right? So how do we understand the way that it's reasoning and evaluate that for use? Next is the trust, safety, compliance. How well can we trust that? Already see a q and a in the q and a, a question about hallucination. How do we know that we can trust the response that's coming out of it? How do I know that I'm protecting my data, my customer's data, PII, all that type of stuff? And then I'm in compliance, especially with things like the EU AI Act and as that evolves and other countries, states here in in The US, etcetera, start to put more regulations in place. How do we know that we're in compliance with all of those laws? And finally, what's the performance at scale? This is one of the most challenging things, especially because things don't always scale linearly. Right? Or you do a demo and you do a demo of something across a thousand or a million, you know, instances to see how it does. But scaling from a thousand to a 100,000 to a million to a 100,000,000, Sometimes it is not, you know, 10 or a 100 times the resources. It's a thousand times the resources, or things start to break down. So how do you make sure that you're being able to do this not just on the test cases and on the build side, but that you still have that performance? Not running into issues with latency. You're not running into other issues there. So these are the five things that we're gonna talk through here. I'll give an example of each of these and then talk through some of the ways that we help to mitigate that. So the first one, on task success and output quality. So let's think about, like, an example of when an agent does the wrong thing and we have inconsistent or unreliable results. So a simple example here relatively speaking is we've built an agent that's able to reset a user's account password. Right? Rather than it being a very specific rules based, click this, you get an email, there could be different ways. Maybe there's two factor authentication in it. Maybe they're using an external something. Whatever. There's different ways that this can be solved depending on the actual instance of that customer, and so we wanna make sure that the agent can do that. And so we test it. It works nine out of 10 times. But sometimes, it does extra things along the way that it really didn't need to do. It asks the user for something. It, you know, makes them do the same thing twice, jumps through hoops. I'm sure we've all had that in the real world trying to do something like resetting a password. I had one just yesterday. It took me, like, eight tries to figure out how to get in. So, I doubt they had an agent behind the scenes on that one, but this happens already. The whole point of an agent is that we're getting around those things. We're able to sort of do it contextually. So we don't want an agent to create confusion, to take longer to resolve the the customer's, issue, etcetera. So how do we mitigate that? So the key piece of, like, why why, obviously, why we care that this needs to happen is that we know that we're getting the right output and we understand that it's functioning. So what do we need to evaluate? We really need to evaluate to ensure that outputs are consistently accurate, they're reliable, and they meet what we're expecting from a business perspective. So that's what we really want to be able to do here to evaluate that. And how do we do that? There are different things that we need to be able to track along the way. The first one is really just, did it provide an answer? Right? Like, did the agent move make its way all the way through? This is that sort of classic software testing. Like, did the code function or did it not? If I didn't get an answer, then I know the agent's not working. But I also need to then secondly have a way that I'm evaluating output correctness, both from a testing perspective, but also in the real world when things are out there. How am I testing it, making sure that this agent is continually giving the right results? Am I just checking random sample answers? Am I checking every answer? There's different ways to be able to do that, and there's are different reasons to use things that are maybe a little more heavy handed. Things like LLM as a judge is gonna check every response versus doing random sampling and having a human look at it, however you wanna do it. But you have to be checking that output correctness to make sure that not only did I get that that task was completed or the goal was completed, but it was completed correctly. The other piece is instruction adherence. Did the agent work within the bounds that you've given it? Did it access you know, maybe it has tools, and there's some tools. You know, think about an internal agent that, you know, has access to HR data to act to be able to answer certain questions for HR specific, reasons, but it's constrained to that of saying, hey. We only want you to use this data for these particular instances. Or you could use this data, but you can't actually give personnel specific information out or something like that. Is it following those instructions, or does it go rogue and share salary information to from one person to another or something like that. Right? And finally, and this is a theme we'll probably come back to a bunch of times across the day, is that idea of the expert in the loop. We are, as we build agents, really thinking about building, again, things, AI products that can act autonomously, use complicated reasoning, use tools to solve problems. This is not something that we can do without having the right people in the loop. One of the best ways that I've heard this phrase and probably many of you have heard this, Steve, thinking about an agent as an intern or a, you know, junior grade employee, somebody that's that's new and entry level. You can set them tasks. You can give them access to data or tools or whatever. But as your as a manager or a senior person or whatever, like, you are that expert in the loop checking their work, making sure things are happening. This is a really important thing to do. And, again, this can happen both from a testing and a build perspective, but it really needs to happen some way that you're evaluating stuff to say, hey. I wanna put a certain number of decisions past a human past an expert who would normally have been doing this task or is that person who does the harder ones to say, am I getting the right results here, or not? And this is how you, obviously, then can continue to refine and improve that agent, but knowing how it's doing versus it's not is the biggest challenge there. So that's our first one. The second one, again, something's going on here, poor user satisfaction. So we've got an internal HR assistant agent that helps you find benefits information. Maybe it gets the right answer, but it asks question after question after question to that employee before it gets that right response back up. That's, you know, not really helpful. The whole idea of having that agent is that I don't have to go in and figure out the, like, if the if thens to get to the right document. Also, back to the previous thing I was talking about, you know, does it have access to the HR system to your, you know, Workday or whatever your ERP is to be able to say, okay. I've typed in a question here. What country do I live in? Right? What state do I live in? What part of the company am I in? You know, like, what program am I under, etcetera? If it has to ask me that, you know, I go in and I say, like, hey. I wanna know if I have coverage for this specific medical thing and it says, okay. Do you live in The Americas or Europe or Asia? Then it says, which country do you live in? Right? Like, that's data that should be there. I don't shouldn't have to go back and forth with the agent on that. And so this is where you really wanna make sure that you're understanding how the agent is working through the process, and is it doing it in a way that's gonna be better or worse than somebody just emailing HR. So let's talk about how do we evaluate that. So what do we need to do? We have to ensure an agent is delivering a smooth intuitive user journey that drives adoption and ROI. This is a really interesting challenge to me and something that I, you know, like, hits home for me is that we ask a lot of data scientists, AI engineers, the people who are building and creating these things. And they have, you know, degrees in math or computer science. They have experience in the, you know, whatever domain. They're good at working with the data, etcetera. But building a UX and testing user experience or building a UI and testing user experience, whatever, that's not always and often not in the sort of background and domain experience of a data scientist or an AI engineer who is building these agents. And so we need to be really thoughtful about starting from the what is the UI, what is that user journey, how are we making sure that the agent is create is delivering, and this is sends into some of the other modeling things that we do from a data science perspective. And so how do we measure that? Again, we want to evaluate this to make sure we know that it's working. First one, ask for user satisfaction. Right? Whether it's you're doing just customer sat of did this answer your question, yes or no? One through five, do an NPS score, you know, the zero to 10, how likely are you recommending my HR internal chatbot? However you wanna do it, track that user satisfaction. Right? Find where you're getting bad results so that you can dig and understand that. The second thing is checking via clarity and tone. Right? Again, this is a asking, using an, in an LLM to look at all the responses that it's getting, you know, is the agent providing the response in the way that you want it to? You know, how did you train that agent? More specifically, what did you give it as sort of examples of, you know, okay. Back to our HR example, I've got, I would assume, an HR email box or some sort of contact system that should have lots of history of HR people interacting with somebody to answer these questions. Then I use that to help train the agent so that when the agent is interacting with people, it sounds like somebody from HR. Whether that's a good or a bad thing, I don't know, but whatever. We wanna be consistent. Right? Or we have a target for them. We wanna know what that is. This is way more important when we're thinking about customer satisfaction agents that are working outside and interacting with our customers. Do we wanna make sure that that person's that person's experience interacting with the agent is analogous to that interaction that they would have with one of our employees? Third is engagement and adoption. This can be a real challenge, in and has been a real challenge in the data science spaces. Rolling out tools, especially tools that are outside of somebody's existing workflow, do they choose to use it or not? Our HR example, okay. There's a HR chatbot. If I've gone to it a couple of times and it couldn't answer the question or it didn't answer the question, but it took me twenty minutes of back and forth with an agent, I'm just gonna email HR from now on. Right? So I actually then sort of, like, look at, like, did this shift the volume of the contacts and actually satisfy those? Or, when we're thinking about, you know, again, that sort of customer service agent, you know, how often does the customer service agent, if that's the our first line of what a customer interacts with, actually solve the question. Right? Do does do they get it right 50% of the time, 80% of the time, or 5% of the time? Time. Right? Really see how that's going. Last piece, or next piece here is the conversation quality. This is related to that clarity and tone, but, again, really sort of understanding the dialogue of the back and forth. And this is where you start to, marry that against the previous one and looking at, like, what tools did it use, how does it answer that. You know, did it answer a good did it answer the question that was clearly asked the first time, or did the person have to sort of basically restate their prompt three, four, five times to really help the agent actually understand it? That's the thing that we is so it is again, it's not just that clarity and tone, but it's also, is the agent reacting to the data that it's receiving the right way? Is it answering things succinctly? Is it straightforward? All that type of stuff. And then, again, our recurring theme of of the expert in the loop. You know, if I have an HR chatbot like this, are HI people HR people doing that? You know, how are we looking at that? How are we looking at the results on this? I see that, you know, one of the comments on there around these being qualitative versus quantitative. I there is a mixture of both of those. Right? Things like user satisfaction is absolutely quantitative. Things like engagement and adoption, looking at the percentage of, you know, how much people are using these, that is very much quantitative. On the clarity and tone and tone and conversation quality, some of this comes into just our general LLM evaluation frameworks of the correct answer, is this the tone that we're getting and stuff. So there a couple of latter ones are definitely a little bit, like, squishier metrics, but there are ways to make those quantitative. So it's a great it's a great comment going on. Alright. Let's go to their third one here. So next, once again, what's happening? Okay. We have flaw reasoning and or tool use. So in this case, we have a financial analysis agent, and it's asked to prepare a revenue forecast. But actually using a visualization tool, reporting an API, you know, how is it it's it's automatically providing that output and into a dashboard or spreadsheet or something like that. So it calls the API, but in this case, it called it three times asking, you know, multiple whether it's the same or sort of overlapping queries, and then it doesn't aggregate those results back together appropriately. So then it makes a pretty report, but that report contains duplicate numbers. So here, the tool is functioning, you know, back to where I started with. I asked the the question or I asked it to prepare this revenue forecast. I get a dashboard back. I get a visualization of it. It's got the data in there. Now I should be able to trust that. Right? So it's functioning. But in this case, it didn't call it didn't properly aggregate those results that it got from calling the API to get that data. And then the answer that it's now giving me is wrong, which is, again, to me, arguably worse than it just not providing it at all. Right? I would rather be able to if it works and I can then I can trust it. If it's not gonna work, then I can go pull the numbers and I can go do this myself. Right? That's a better scenario than having to either then as a, you know, human figure out where these answers correct or have to go redo the work or redo it multiple times. So how do we set this up to make sure that we understand the reasoning and that the tool use is there? So we need to ensure that the steps are efficient, they're transparent, and they are verifiable from what was asked through to the outcome. So the first one is understanding, the plan and intent accuracy. This is you and this is where you really start to, like, open the hood of the agent and look at the stack trace, understand what steps it took. Because you can see, here was the prompt that, you know, came to that agent. You can see then you could pull up the agent reasoning that says, this prompt is asking for this, and now I'm gonna send it to that tool and then this whatever. You can see what that plan looks like. This is where you can especially start to aggregate questions that are similar, understand, and this can some of this, there is a big aspect of expert in the loop as, always here. But this can be quantified to say, okay. I have 500 queries that were all about the same topic. And of those 500 queries or prompts about that same topic, I saw 80% of them with this path. This tool, this tool, then response, and that was accurate. 10% of them took a couple other stops and then went back through the same thing. And then the last one maybe, you know, somehow are just giving me wrong answers, etcetera. So you have to be thinking about checking these as they're going. You can do it on the fly. You can set these metrics up to track, and you can even have, if you really want, as the elements judge, something in the middle to sort of do this, but that is a a part of the scaling side. The tool call accuracy is part of it. Right? How do I know which tool it pulled in? If the plan said, call this tool, this tool, this tool did or those are the tools that it actually did. That's one piece. Second piece is this is, you know, the plan, but the plan was wrong. And, you know, I need to understand those different things. The trajectory efficiency, this is really in that sort of, is it doing it in the simplest path possible? Right? Can this be satisfied? I'm asking questions about revenue from last quarter. There's a text to SQL thing that's really just gonna pull it up from, you know, our data warehouse and pull that. There are things that we can absolutely put in checks in place to say even before the response goes out, but, maybe while it's, you know, maybe it's it's after the fact that it's aggregated to say there's a certain pattern type of question about revenue, or about well, the historic revenue should always be answered via this tool that's looking at actual retrospective data. Something is asking about revenue forecast like our example in the previous slide, it that should be then pulling up my forecast model, whatever my ML or my forecast that out there is. And so anytime somebody you know, you can even just do this by keyword. Right? Like, every prompt that asks for a forecast, did it use that tool right away? Did it have to go back and forth, etcetera? Next piece here is error handling. This is a big one, especially in terms of in production. How are you doing the error handling? You know, if you get a if the agent gets a bad result, right, or gets a null result, you know, it's calling another LLM to, you know, pull data out of the rag and it comes back and just says, you know, not enough data or it's inconclusive, how is it responding with that? If it pulls those things together, do you have a way that it's checking things, to understand, you know, in this where we talked about the forecast stuff? Do you have a checkpoint at the back to sort of say, hey. Does this match some specific data set or something along those lines? How is it handling the error? Does it just spit out the answer? Does it come back and say, I'm not getting a good one. I'm gonna rerun. You know, I'm gonna go back through the process again. Does that happen one times, two times, three times? Can you have an agent get stuck in an infinite loop of, you know, it hits an error and it's just gonna keep going back through? You have to be thinking about, you know, proactively thinking about the ways we want to handle errors, especially for certain types of queries, but then also how are we gonna track these and make sure that things, you know, don't run them up or, you know, use a whole ton of tokens or something like that because it keeps running the same query over and over again. And finally, expert in the loop. I'm a big fan of agents having at least some level of, almost like AB testing where we're throwing some that would be handled by the agent to one of our experts and using that as part of our validation set as well. But either way, whether we're looking at aggregate results afterwards, whether we're testing in process, making sure that we've got an expert that is part of that process that can look at what's going on, can evaluate those results, incredibly important no matter how you do this. Alright. Let's go on to our fourth one. So what about when it goes wrong around trust, safety, and compliance? So we have a customer support agent agent saying, can you confirm my last order? Here, instead of just giving it or giving the status of the ad the order, it, you know, displays out an address and credit card details, something like that. So, sure, maybe it also got the answer and it said, here's your order, mister customer, etcetera, etcetera, etcetera. But we've now exposed, PII that maybe in this case, we're exposing the PII to the customer themselves. And so it, you know, could be worse, but there's plenty of instances where we could be displaying PII externally or if we've got this, you know, set up as an API. This provides reputational risk. If you're in a regulated industry, this is a compliance, violation. If you're under GDPR or things like that, there are additional things. So there's a lot of risks here if we get get that trust of the response or the safety and compliance wrong. How do we check for that? So we have to ensure that agents are behaving within safe data, ethical, regulatory boundaries, and that we can audit all of this. That is a huge component of it. You know, again, back to this the questions come come a couple of times is how much can we do this versus testing? How much can this actually happen in production when we have stuff that's live? You have to set up the appropriate guardrails, filters, auditability to be able to look at both results in aggregate and results in individual and check for things like that. So putting on a toxicity filter. We've also need examples of, you know, LLMs going rogue, sometimes deliberately because people do rope them, but often enough where they just sort of do it themselves. We need to have that filter to make sure that the LLM is responding appropriately. And this should be something that's on again, every single result can be going through a toxicity filter. Hallucination and grabbing this. You know, there's a question in in the thing earlier. Hallucination can be very tricky. If we go back to that, you know, the analogy of thinking of these as, an intern or an entry level employee, sometimes the answer just doesn't kinda quite come out right. It's not grounded in reality, and that's part of having the expert in the loop there. But evaluating those things to say, okay. Sometimes it's just as simple as having LLM as a judge in the process. You ask a question, it gets a response. Before the response comes back to you, you run it through another LLM that says, this answer actually sound right, and sometimes you'll can catch things there. Hallucination is a still really tricky one. This requires a lot of the build upfront. Right? How are we testing before we build it? But we do need to make sure that we're checking on the back end. Policy adherence. Is this following our company policies, our data use policies, etcetera? How do I check those? Again, you can put something that is a specific check to say, hey. This is a guardrail. We have a policy around, you know, how, like, use of a specific dataset. And, you know, I want to check, did this agent use data from a dataset that was restricted in policy for which it could be used? That again, these are the sort of custom things that you're building. These aren't, you know, just an API that you can call necessarily, but building these guardrails. Prompt injection resistance, though, is one. There's a lot of tools out there that are making it easier for us to not let customers, or users deal break the, guardrails that we put around it. So how do you monitor for prompt injection? How do you put prompt injection detection in place? What is the handling for that? Right? Do you just throw a null result or an error result, or do you shut down that interaction with that, that user entirely? Fairness and bias checks. This is especially important when we're thinking about, what we are getting back from without getting back what the sort of whether it's a subpopulation analysis, some sort of responsible AI type thinking in there, putting those in. Am I aggregating results and making sure that, my results are consistent across responses to people of different genders, to people of different backgrounds, to people of different, levels of seniority within my company, etcetera. Those should be checked. Often, that's done a little bit more sort of in aggregate and in batch than on the fly, but there certainly can be things to look at, you know, put where you know there are things that have gone wrong. As you looked at it, you can put those in place. The auditability side, how do I look at what has happened? When I look at every new choice an agent has made, how do I look at every choice a person made with that agent? How am I able to check those? Question in the chat, you know, speaks directly to this of that idea of white or sort of open versus black box systems. I'm extremely personally like and I would say, data. I do philosophically is always a fan of the open box and the white box sort of thing because if you can't look under the hood to understand what prompts were sent, how an agent acted, you're then really unable to do that with the expert in the loop as we talked about or things like that. This is very much that auditability. Can I look at every query? Can I look at every response? Can I look at every decision used afterwards? This really does need to be open. It's one of the concerns I especially have with off to off the shelf agents that companies are coming out with, you know, at this point in time, is if you can't evaluate for yourself how that is doing and you're completely basically sort of, like, offloading that validation to the provider of that agent, that makes me really nervous. An example of something like that going the way that you don't want it to do was there's a a startup this is pre agent, but very heavy sort of in the ML space a few years ago. That was and some of you probably will know this example when I talk about it, but it was a, HR evaluation tool that was used in the recruiting process to help screen. And it was a fairly closed system. Right? You know, it was just an API. You put it in. You got a response back. Hire this person. Don't or, you know, interview this person. Don't interview this person, etcetera. And they had a change in their data from something upstream from them that turned their results, very discriminatory, sort of that way. And there was no way to understand that that was happening. There was no way for those users to be able to go in and say, you know, what is the reasoning here? What is the response rate? What is the variable or the tool that's giving me this result? If you can't go in and look at and this is gonna be hard. You know, again, this is a scale challenge, which we'll talk about in the next one. Okay. You know, am I doing this am I having an agent being called a 100 times a day? Evaluating that as long as it, you know, it's in there, I can have an expert check all 100. I can do automated checking and I can do etcetera. Start talking about a 10,000, a million queries a day. How you then do that auditability, becomes a bigger challenge and you need to sort of I, you know, again, talked about random sampling or different ways that you can solve that auditability. But if the system is a closed system, if it's a black box, then you can't audit anything and you can't do it at all. So, like, that's a prerequisite for me. It's very much that these systems are as open as possible. Alright. Finally, number five. Let's talk about agents failing to scale. So we've built a supply chain planning agent. It is supposed to generate a daily production scale for schedule for 50 factories. So we test it. We build it. We pilot it out where it's looking at, you know, five plants. And it does plan those. It meets the demand forecast that we have coming in. It effectively, like, you know, develops a deployment schedule or production schedule that will actually meet those. But we roll it out, and we go from five to 50. And we've 10 x the, respond or the queries that are coming into it to be able to do this, and suddenly we have a massive latency issue. And so we're having a lot of it's hitting the same data multiple times, so maybe we're, you know, asking too much of the data warehouse underneath to be able to do it, or I have to call the demand forecast model. And if I call the demand forecast model 10,000 times in the morning or something, suddenly, you know, that chokes somewhere along the way. This, then, you know, can result in API rate limits that are causing, like, incomplete schedules. So now suddenly I'm, you know, underproducing somewhere or I'm under producing a certain product, and I'm overproducing somewhere else because I didn't get the forecast information the way that I needed to or whatever that data paint needs to look at. So you get this, I think, pretty evident. Right? Like, how this stuff can go wrong and the scaling can happen. It could be a challenge in many different parts of the process. Right? It's not necessarily the agent itself that's gonna fail to schedule. It could be that or fail to scale. You may it could be the tools that it's accessing underneath or something like that. So how do we make sure that we are able to sort of scale effectively? So ensuring that we're meeting enterprise grade requirements under real world workloads is easy to say hard to do. This is tends to be, you know, something that you have to be really, really aggressive in testing before you really start to roll agents out. So measuring something like latency and response time. Yes. Of course, as you're piloting or you're you're testing and building a tool, you do that. Before you go live with it from those five to those 50, run it in parallel with an existing process, or just run it for one month and see how it goes before you start to do those rollouts, but do it as scale. No. You shouldn't be trusting the LLM for the demand forecast. LLMs can access demand forecasts. The LLMs shouldn't be providing the forecast, so that isn't back there. Another piece we wanna look at is the token and cost efficiency. This is another piece that you can really sort of have challenges for. You know, we had a customer who built, in the airline industries, a tool for gate agents that, sort of help improve loading times and tested it out. It actually saw really good results on, you know, what it could do for their interactions with their customers. But when they looked at scaling that, the to, you know, thousands of gate agents at any given point in time, the cost of doing that, it wasn't there. Right? And so, you know, it did show positive results, and it was doing what they wanted it to do. But once they tested the scaling on it, how much they would need to spend from a tokenization perspective really made it so that it wasn't, an effective way to be able to go and do those things. And, again, test it during your test phase, but run it at scale for a small amount of time. Use things like, cost guardrails on top of it that will actually say, okay. Run this query. We've run it over a thousand records or five plants or whatever it is. Now we wanna run it over the whole dataset and put something in there that says, if this is gonna cost x tokens more than we expected it to, like, just cancel it. Like, don't even run it at all or test it. So, again, you really need to be looking for these things, both in that test phase, but you have to start to or both in the development phase, but very much in that test phase before you start rolling these out. Yeah. The tool API overhead. Right? I talked about that in terms of that accessing a demand forecast or something like that. How much am I calling a specific tool? How much am I calling a specific API? Are those APIs developed to be able to do that? You know, if I have an API that was built for a system that is being used by humans that are manually inputting data, calling the API, something like that, you know, is that being called right now just two or 300 times a day, or, you know, something scale. And that's what it's been built for for a latency and a performance perspective. An agent can suddenly 10 x that, you know, literally within the first hour. So, like, how do we ensure that those tools that we're getting it, are they built from the types of scale that we may actually require of them? Uptime and scalability. This is the sort of classic IT challenge of, you know, especially things that are customer facing. How do we guarantee uptime? Do we have redundancy of an agent if an agent fails? Or if those tools fail, how are we gonna be able to do that? And are those underlying tools, are those agents, are they built to be horizontally scalable or vertically scalable in a way that we have, you know, failover and horizontal deployment or things like that to be able to make sure that we're able to to do those, things like high availability and stuff. We have to really talk in seriousness with about high availability of an agent yet, but it's a definite thing that will be there as we start to build more and more critical business critical agents. Last piece here on this the scale is the error rate. As we do more and more and more, the, impetus to make sure that we look at that error rate and that that is scale either declining with scale or staying the same. You know, if we've looked at some of these earlier evaluation pieces and and we looked at, you know, did it fail, did it fail 1% of the time? Did it fail point 1% of the time? Did it fail point o 1% of the time? If we now scale that out, does that scale linearly and it stays at that point 1%? Or does it now suddenly start failing 1% of the time? Or, is it failing, you know, at those points in time? Or, again, could be functioning but getting me their own results. However, we're looking at that. We have to be really cognizant of what that error rate looks like. How are we doing error handling? Does it just send it to a human to provide the response, etcetera? But are those happening so that we're not gonna overwhelm the system with now suddenly, you know, something that we thought would have a 1% error rate has a 20% error rate, and we don't have the people in place to be able to sort of deal with that. Finally, expert in the loop. This is an interesting one because this often becomes an expert as much on the IT side of the house as on the business side of the house. You know, most of the things we talked about so far have been really, I'd say, like, business experts are the ones thinking through this. A lot of these things, though, really are now, you know, who on the AI team, the engineering team, or the DevOps team, or whoever, you know, you have that's responsible for this, how are they handling things and making sure that they're, you know, performing effectively across all of these different metrics? Alright. So to tie this up, we'll switch to some q and a here in a second. So what does enterprise readiness look like beyond just sort of setting benchmarks and and looking at some of these things? So there's, you know, our key takeaways here. So we talked about these five different areas that you need to be able to evaluate. Task success and output quality, user experience and business value, reasoning and tool use, trust, safety, and compliance, and performance scale. These are all you know, these aren't five options. These are the things that you need to be able to do across all of your agents. And so how do you do that as an enterprise to be able to sort of do this at scale across multiple agents and hit all of these different areas? So there is a sort of hybrid approach. There are some IT functionalities in here. There's some, business functions in here. There's some things that has to happen beforehand. There's some things that need to happen after you've deployed. There's things that need to happen in production. So when you think about from a phase perspective, you have the testing. Did we do all the realistic things while we were building it? Deterministic test suites to test for some of the things we're talking about with a human looking at that. Repeatable scoring that's tried to some sort of a promotion gate or a catalog or something like that to repeatedly say, does this match the threshold that we need it to be? Whether it's an error rate is low enough, a accuracy rate is high enough, etcetera, we have very specifically we're testing that repeatedly throughout the build phase to make sure that we're not, progressing until we've hit what we want those to look like. Before we go and launch it, we should be looking at deterministic and open ended scenarios. Right? The deterministic ones, you know, the sort of our testing stuff is really I have examples of this customer interaction. I'm testing to see how it's done, and I'm pushing it forward. But I need to be able to also ask some more complicated questions of it before we start to launch it. And, you know, that's where then I don't necessarily have what the right answer is. I have to have people who are evaluating those. Doing regression tests on every change. Again, the regression test isn't just does this code function, does this agent function and provide me a response, but I'm evaluating those regression tests against these different things that we've talked about. And the new pilot comparisons, AB testing against baselines, against humans, against old models. This can be an interesting way in terms of looking at, you know, abandoned solution versus something internal, like, actually doing those things. And finally, there's still evaluation that that has to happen in production on the fly. Some of it can be automated in its metrics and things that we talked about earlier that are very quantitative. Some of these are qualitative, and we need to have expert judgment for nuanced behavior as it says here. You know, we have to mix both of those things. Things that we can quantify, we're looking at quantitatively. Ideally, it either, like, at run time or shortly thereafter depending on the sensitivity of those, but we need to be able to hit those. And then we have to have continuous monitoring to where it starts to go off the rails. Like, it can't be we're looking at this at the end of every month or end of, you know, every quarter or something like that. Like, this really has to have a continuous monitoring so that when something starts to go awry, we're able to react to that as quickly as possible. And it's just come up in people. If you have this, experts in the loop are crucial for you to be able to do this successfully. So let's go to the next slide. And then here we think about this. Look. The your subject manager experts are critical to all of those agents we talked about on that previous screen. How do I develop this agent in a way and validate that it's doing it before it sees the light of day? This has to be your you sort of validated through the people who know how to do what we want this agent to do. They can look at the risk. They can look at, you know, benchmarking. You know, what is the 1% error rate matter? Is 5% acceptable? Is point 5% too high? All those things. Also, this is really critical, especially in terms of that adoption stuff we talked about earlier of building trust. If your experts, your SMEs, your really trusted people on the business don't believe in the agents, that's gonna be very hard for you to get that adoption for people to be able to use them. Conversely, when the experts are involved with it, they help to build it, they're helping to champion it on your behalf in the business with their peers, with their subordinates, that's what will really help you sort of do that adoption very quickly. Awesome. Alright. So before we turn the q and a, couple things, you know, of things for if you don't know Dataiku at all, you're welcome to scan the top QR code there. There's a quick three minute video demo that shows how Dataiku does analytics models and AI agents on the platform. We also have another webinar coming up soon around scaling designing scalable governance programs, in a way that supports and enhance your business and innovation rather than restrict it. So we'll leave that up for a second, and then we will switch to some q and a. You are new. Awesome. So, let's go through some of the q and a that were wasn't already answered in during the presentation. So one question was, how are MCP context engineering reducing the need for complex evals? So in this case, MCP and context engineering reduce evaluation complexity by constraining how agents operate. So MCP standardizes tool use with predictable auditable interfaces, cuts down on unsafe or inconsistent behaviors, and context engineering. Good context engineering really grounds agents in structured props and trusted data, so that helps to reduce hallucinations and variability. They can shrink the risk surface, so evaluations focus less on catching errors and more on measuring business value and user experience. Someone else asked, is there any inference to see intent accuracy like LLM as a judge? You can. You can use stuff like l l l m as a judge. Also do this from some of that sort of actual, like, human expert testing. Right? So some of it you can do, BLM. I'm always I'm always cautious to sort of say, like, hey. You know, you can't rely over much on the LM as judge both from a cost perspective. Actually, more from a cost perspective than as an accuracy perspective. But doing that sort of this, honestly, is is really one of the things that I think will be really interesting and tricky as we think about over time evaluating how an agent does sort of in simple metrics because we need to move our next layer beyond just did the agent provide the answer that we expected in a reasonable fashion or else that we talked about too. Here is the the resulting decision or action, especially if it's an agent that's designed for a person to interact with, actually, you know, then do what we think the right next step is. Did they do what the agent suggested? Did they provide that answer to the customer? Did they run the forecast that the agent gave them, etcetera? How are we actually tracking the, like, end business results? And so that context from sort of, like, objective tracking will over time go from, yes, we can use things either again, experts in the loop, sometimes ahead of the fact, but usually, like, after the fact. Did this question get answered in the way the person that was ask asking it? We can use LM as a judge to do that to some extent, but over time, we'll really need to start shifting that to not just did the agent give the person the result that we would expect to based on the context, but did the actual action and business results come out of it that that initial sort of context, you know, that the customer query should have resulted in this happening even if the agent, you know, gave the right answer to the person do the right thing with it. Someone else asked, is it safe to use LLMs as evaluators for vibe and tone and automate that over domain experts. So, using LLM as a judge, you can audit definitely automate, like, baseline scoring for Tone and Vibe and, like, route edge cases, and high risk outputs to to subject matter experts or experts in the loop for validation. You'll kind of wanna continuously compare LLM judgments against SME judgments to calibrate and tune the evaluator. So it's definitely safe to use LLMs as a judge, but not fully to replace domain experts. And I think that kinda ties into some of the other q and a we have here around experts in the loop. So someone asked, have you balanced the amount of experts in the loop requirements so SMEs aren't spending all of their time babysitting agents? Not a there's not a simple answer to that one. Great question. You know, I definitely get where you're coming from. I think part of what you need to look at is sort of, like, the criticality of the agent. You know, not all agents are created equal and not all agents have the sort of potential to either do good or bad things, in the same way. And so really being thoughtful and cognizant about, you know, okay. I'm rolling out three agents in my sales operations team. You know, this one will be working, you know, very autonomously and providing stuff straight to a customer versus this one is gonna provide something to my salesperson and, you know or this one's gonna provide a dashboard or something like that. Like, those don't all have equal level of need and, you know, sort of or, I guess, criticality and and danger in them going wrong. And so being judicious about what does the right sort of tempo timeline look like for SMEs, there. Second piece that I'd say, and this is, this challenge goes beyond agents, but I think it's important is that, we need to also be thinking about that expert pool and, you know, not having the same experts every time. And so we also need to think about that in terms of, you know, okay. Am I asking the same person to babysit these three or four agents? Am I asking a team of people to share that responsibility, and really managing it? But I'd say it's an interesting thing that we'll, curious to see how this evolves in the business world. Right? It's, you know, how will you evaluate your employees or your experts use of agents. Right? So our, you know, daily role and this, I guess, book then tells us how an employee is doing, but also tells us how an agent doing. You know, if I rolled out an agent that has an employee spending all of their or a team spending all their time babysitting the agent, did it speed up anything? Are they are they able to sort of do their job? Or is it taking off their place some of the lower, lower value work and allowing them to do something different? That's something that I think is is both important for us to, you know, as we encourage employees to effectively use AI, that we are tracking their usage of it, and how they interact with agents, etcetera. But that also then is a part of the performance evaluation of the agents themselves to make sure that agents aren't just being a drain on on our people. Okay. And then let's get a last question here as we're in our last minute. So I another question on humans, in the loop. So Derek asked experts in the loop is key, but I often see a reluctance from experts, even outright lack of cooperation due to concerns or training their replacement. Any thoughts on the human side of this in getting buy in? I mean, not to be, like, overly tongue and cheek, but, like, it's a valid concern for people. Right? Like, they're we are seeing examples of that. A big part of that does come down to how your organization is handling doing you know, rolling out things like this. You know? When you roll out an AI tool that, you know, I've seen this, I guess, a couple times on this the in the startup space recently of, you know, oh, we we bought an agent that does our business development representative. Right? You know, like, outbound sales type stuff, and we fired the entire team. If you did that and then you go to a salesperson to say, hey. We're we're looking at an AI agent to little do these things. I'm not salesperson. There's no way that I want to be able to help with it. Right? So, like, there is no better concern here into sort of the proof is in your actions as you do these things. The skepticality is is completely valid and and understood, obviously. And so it's through how you roll these things out and how you're communicating it, I think is is really where it's gonna happen. Obviously, on the first one or the first couple, there can be a dance in sort of right finding those right people because if they are really resistant, then you don't want them to be part of the project. So finding the right experts to help out is important especially early on. But the only way you really truly solve that is how you as an organization roll these things out, and are you using these as ways to make departments more efficient, to be able to do more advanced work, or are you using it to replace people? Everybody will see how you're acting there. Thank you, Connor. So feel free to contact us via email, reach out to your, account manager. In the next couple of months, we'll be releasing a lot of fun and exciting, product launches that will help with evaluation in Dataiku as well as managing the agentic life cycle. So stay tuned. We'll have more webinars as well on how to do evaluations in Dataiku and using Dataiku tools, as well as more about, resources and content coming out on integrating humans and experts in the agentic life cycle. So please, keep an eye out for those, and I hope that everyone has a great day. Yep. Thanks everybody for the, great interactions. Appreciate all the questions and and chat. Have a good day.