-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Eval API converge #35312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eval API converge #35312
Conversation
Next Steps to Merge✅ All automated merging requirements have been met! To get your PR merged, see aka.ms/azsdk/specreview/merge. |
|
PR validation pipeline restarted successfully. If there is ApiView generated, it will be updated in this comment. |
API Change CheckAPIView identified API level changes in this PR and created the following API reviews
|
| disableEvaluationRunPersistence?: boolean; | ||
|
|
||
| @doc("Extra storage options for evaluation runs, the options includes Kusto, Blob storage, App Insights, etc.") | ||
| extraEvaluationRunStorageOptions?: Array<unknown>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this WIP to define storage options instead of unknown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we only have a spec for the list of storage options, but we didn't align on whether we only allow AI Foundry connection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, app insights at the moment for agent evaluations is a custom configuration, so at least we would need to support that? In addition to Kusto, ADLs, etc.
| continuousEvaluationConfig?: ContinuousEvaluationConfig; | ||
|
|
||
| @doc("Allow the user to opt-out of evaluation runs being persisted in the AI Foundry. The default value is false. If it's false, the evaluation runs will not be persisted.") | ||
| disableEvaluationRunPersistence?: boolean; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean here to be persisted? If it's set by the user, do we persist them to the current cache or somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's "true", the data will be stored as an asset, which will be available until the user deletes it; otherwise, it'll be in a cache for N hours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to clarify, if I want to persist results from continuous monitoring (and so presumably the volume is such that I don't want to use assets) then I would want disableEvaluationRunPersistence = false, but extraEvaluationRunStorageOptions non-empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is case, we may want to rename... since disableEvaluationRunPersistence makes me think nothing will be saved anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we have multiple storage tiers, from hottest/managed to coolest/customer-owned: 1) cache 2) asset store 3) insights/adls/kusto,etc.
If enabling persistence allows users to get the results past the cache's 24h expiration mark, then can we name it something like storePermanentApiGet or something in the sorts to signify why someone would want to use assets (and most probably get billed for those).
| @doc("Percentage of sampling per hour (0-100).") | ||
| samplingPercent: float32; | ||
|
|
||
| @doc("Maximum token usage per hour (0 to 1000).") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these ranges enforced anywhere within this spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, let me remove "0 to 1000"
|
|
||
| @doc("Evaluation target in simple JSON format.") | ||
| @added(Versions.v2025_05_15_preview) | ||
| model EvaluationTargetSimpleJson extends EvaluationTarget { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this meant to represent? I suppose it's not for agents, but for something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the simple JSON format we support for evaluation: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/evaluation/azure-ai-evaluation/samples/data/evaluate_test_data.jsonl I think it's helpful to keep this flat and simple version to help the customers to test our product, what do you think?
| response: string; | ||
|
|
||
| @doc("The context for the LLM query.") | ||
| context?: string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is context different from query? Because in agent world, query has all the context (aka, conversation, previous runs, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the "compatible mode" for non-agent (a.k.a. inferencing) as well as custom evaluators, while the AI developer wants to run agentic data, we recommend them to go for another target EvaluationTargetFoundryAgentData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the big thing to recognize here is that we are essentially delegating all input schema requirements to individual evaluators. An evaluator will define what "context" or "query" etc should mean, how those should be formatted, whether any additional fields are accepted or required, and so on. So the eval API makes no judgement or validation, that is left up to individual evaluators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with what Steve & Sandy suggested, we do need the input of conversation type. Let me add it shortly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I don't follow what you mean exactly, let me try to explain my point of view a bit more. I think there are two high-level approaches:
-
a strongly-typed schema. We have certain fields, and we have opinions on what those fields should look like. At the API level, if someone submits a request, we can validate those fields. All evaluators know what to expect from a given field.
-
weak or no typing. Essentially all fields are just strings. We won't do much validation on fields at the API layer; evaluators will have to validate fields themselves.
At this point, my feeling is we're most likely looking at option #2, if we need to preserve backward compatibility (which we probably do because some evaluators are GA). For example, for some legacy evaluators, query and response should be simple strings. For a newer evaluator, query and response should be serialized json messages matching a certain format. For another evaluator (ISA), context should be a conversation history, query should be a simple string prompt, and response should be a simple string. For groundedness, context should be some source material. Code vulnerability I can't even remember, but it doesn't match any of the above. And so on. So there's already not consistency between existing evaluators, and if we're going to add 3P evaluators then the situation will get even worse.
If we take as given that we won't have strong typing, then trying to add enough fields to cover all the use cases also seems like an endless task. Right now, if we combine all the fields that any evaluator might want to use (from existing AIFO and Nunatak), it would include:
query
response
conversation
prompt
tool_calls
tool_definitions
invoked_skills
context
ground_truth
additional_metadata <-- this is a catch-all that is used to provide other fields like agent_purpose, agent_name, schema, query_results, chosen_skills, etc.
So -- we could try to make an effort to rationalize different evaluators so that they all map to a common schema (e.g., query and conversation are sometimes interchangeable, and we could go with one and make evaluators migrate). That could involve some breaking changes, but those might be palatable with API and evaluator versioning. Or we could just let the API accept any json, and it's up to the evaluators to deal with it. I don't have a strong opinion, other than that we should make a choice and plan for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just imagining the matrix in help docs... For simpleJson type, inputs must look like this and these evaluators are available, for foundryAgentData type, inputs must look like this and this different set is available, for customJson type, inputs can look like anything and this other set is available. I feel like this might add more confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good question. This is a data input for a single evaluation (includes multiple evaluators), we expect the users to group relevant evaluators within the same input type together, but we can't avoid people add wrong types, which we will return "NOT applicable" for these cases.
For input-type and evaluator matching issue:
- custom-input-type only for custom evaluators, if the inputs doesn't match, we will tell the user;
- most evaluators will support both simple JSON and foundryAgentData, but we can't avoid exceptions, e.g. we won't expect F1 score to accept foundryAgentData, also not expect agent evaluators to accept simple JSON. These should be documented properly and also mentioned the reason of errors in evaluation result.
Is there any other good recommendations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I suggest to separate input-type and evaluator matching issue from this PR, we can discuss in product level for documentation and error handling -- previously, we are using dataset with no schema, the matching issue also applies dataset, which is not new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what you described above, you're imagining that evaluators will throw an exception if the input type does not match. And otherwise, it's kind of free-form. Some evaluators will accept both simpleJson and foundryAgentData, some won't, etc. and it's all up to documentation and useful error responses. In that case, I don't see the benefit to the three classes of input. I would just have everything be customJson in that case.
I guess to my mind it's a binary choice -- either we have a strict schema (and break current flexible usage with a bump in api-version), or we have a free schema (and support everything, with the obvious drawbacks). It's fine if that's in another PR, but I would just want to know the direction we're going so we can implement the backend service accordingly (and I would take these explicit fields out of this PR so we don't accidentally commit ourselves to that approach).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still consider the bias for actions: if simpleJson and agent data can help a lot of users, e.g. 60%+ users (a bit more background, we are expecting more users to use agent data in the near future), then let's add them to help the users, and then also leave the free-form to support custom evaluators?
| "maxHourlyTokenLimit": { | ||
| "type": "integer", | ||
| "format": "int32", | ||
| "description": "Maximum token usage per hour (0 to 1000)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000 tokens per hour would be very little. =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But more importantly, "token limit" doesn't indicate if it's input tokens or output tokens or a sum of both or a weighted sum of both or what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, updated as we discussed at Yan's config spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought where we landed was that Yan was going to talk to customers? Because maxHourlyTokenLimit is confusing, since we have both input tokens and output tokens which are measured and priced separately. There is a tradeoff between fine-grained control and simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, the current billing structure is tied directly to the number of input and output tokens produced by evaluation requests. We already know the mapping and can infer the costs at runtime, shouldn't we be able to allow customers to set a cap on their total spend, "My limit is $1,000 for every day/week/month," such that when the limit is met, we let them know. That way, it's easier on our customers to plan their cogs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem is I don't think we have access to the billing APIs directly (or such access might be a pain) so we won't know any discounts customers might have applied. We could have a unit-free number that caps it (e.g., "Evaluation Units") since we know the ratio between cost of input/output tokens. But we couldn't say it in number of dollars.
| "samplingPercent": { | ||
| "type": "number", | ||
| "format": "float", | ||
| "description": "Percentage of sampling per hour (0-100)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should clarify the sampling grain -- requests, evaluations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point... from the implementation, we know it's for request, but we need better descriptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make a callout on the differences between the two, as it's easy to see as request = evaluation.
| "required": [ | ||
| "id", | ||
| "name", | ||
| "continuousEvaluationConfig" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if this config is not for continuous evaluation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be an issue from the generated JSON, in TypeSpec, it's optional...
| @doc("Evaluation target in simple JSON format.") | ||
| @added(Versions.v2025_05_15_preview) | ||
| model EvaluationTargetSimpleJson extends EvaluationTarget { | ||
| @doc("The type of evaluation target.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I create a custom evaluator which takes in question, customer_info etc, how would I pass data for such evaluator ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, it seems like we should allow both "a structure of query/response/context/groundTruth", and also open to arbitrary fields here... pending TypeSpec question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YusakuNo1 any update here? This will affect how we write some of the backend service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new input type InputCustomJson is added, the user can put arbitrary data
| @doc("Data for evaluation.") | ||
| data: InputData; | ||
| @doc("Deprecated: data for evaluation.") | ||
| data?: InputData; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will the case where customer provides
- Data and model
In this case evaluation API does inference and then evaluation (current behaviour)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This use case will be covered by modelPrompt from Prachi's PR: https://github.com/Azure/azure-rest-api-specs/pull/35021/files . As we don't know all the future requirements, Prachi's PR is the good example of adding new targets in the future. By the way, Nagendra's Red Teaming is an other new target (the PR is not finalized yet): https://github.com/Azure/azure-rest-api-specs/pull/35426/files
| "maxHourlyTokenLimit" | ||
| ] | ||
| }, | ||
| "CosmosDBIndex": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typspec is missing these definitions. Can you please add them there ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the issue outside the current change, let me talk to Darren and address from a separate PR.
|
Making a PR for output modeling: https://github.com/Azure/azure-rest-api-specs/pull/35627/files old: https://github.com/Azure/azure-rest-api-specs/pull/35475/files |
| @doc("Abstract base model for defining evaluation targets.") | ||
| @discriminator("type") | ||
| @added(Versions.v2025_05_15_preview) | ||
| model EvaluationTarget { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember where any comments on the issue are -- but I know in the past we talked about including some sort of way for users to include correlation IDs with a request so that they can map evaluation results back to their source data. For agent evals, this is done automatically with runId and threadId; but for generic createEval, when users may not be using agents, they will want to supply sessionId or inferenceId or experimentId or correlationId or... in other words, arbitrary IDs that have some significance in the source application that can be used for correlation and consumption of downstream results. Is that anywhere in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, correlationId is added as an optional parameter, we will figure out how to run indexing for it from the backend
4cc6c83 to
bfb404c
Compare
| @doc("Simple JSON as source for evaluation.") | ||
| @added(Versions.v2025_05_15_preview) | ||
| @removed(Versions.v1) | ||
| model InputFoundryAgentData extends InputData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InputFoundryAgentData demonstrates how it'll use, but since the TypeSpec in Azure.AI.Agents is not finalized, this InputData type won't be merged
| response: string; | ||
|
|
||
| @doc("The context for the LLM query.") | ||
| context?: string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with what Steve & Sandy suggested, we do need the input of conversation type. Let me add it shortly
|
Hi, @@YusakuNo1. Your PR has no update for 14 days and it is marked as stale PR. If no further update for over 14 days, the bot will close the PR. If you want to refresh the PR, please remove |
|
Hi, @@YusakuNo1. The PR will be closed since the PR has no update for 28 days. If you still need the PR review to proceed, please reopen it and @ mention PR assignee. |
Choose a PR Template
Switch to "Preview" on this description then select one of the choices below.
Click here to open a PR for a Data Plane API.
Click here to open a PR for a Control Plane (ARM) API.
Click here to open a PR for only SDK configuration.