-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Eval API converge #35312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Eval API converge #35312
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -46,6 +46,32 @@ model InputDataset extends InputData { | |
| id: string; | ||
| } | ||
|
|
||
| @doc("Simple JSON as source for evaluation.") | ||
| @added(Versions.v2025_05_15_preview) | ||
| @removed(Versions.v1) | ||
| model InputSimpleJson extends InputData { | ||
| type: "simpleJson"; | ||
|
|
||
| @doc("The LLM query.") | ||
| query: string; | ||
|
|
||
| @doc("The LLM response.") | ||
| response: string; | ||
|
|
||
| @doc("The context for the LLM query.") | ||
| context?: string; | ||
|
|
||
| @doc("The ground truth for the LLM response.") | ||
| groundTruth?: string; | ||
| } | ||
|
|
||
| @doc("Simple JSON as source for evaluation.") | ||
| @added(Versions.v2025_05_15_preview) | ||
| @removed(Versions.v1) | ||
| model InputFoundryAgentData extends InputData { | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| type: "foundryAgentData"; | ||
| } | ||
|
|
||
| @doc("Evaluation Definition") | ||
| @resource("runs") | ||
| @added(Versions.v2025_05_15_preview) | ||
|
|
@@ -60,6 +86,12 @@ model Evaluation { | |
| @doc("Data for evaluation.") | ||
| data: InputData; | ||
|
|
||
| @doc("Correlation Id for the evaluation. It is used to correlate the evaluation with other resources, e.g. link DataBricks job Id and run Id with AI Foundry evaluation.") | ||
| correlationId?: string; | ||
|
|
||
| @doc("Configuration for the evaluation.") | ||
| config?: EvaluationConfig; | ||
|
|
||
| @doc("Display Name for evaluation. It helps to find the evaluation easily in AI Foundry. It does not need to be unique.") | ||
| displayName?: string; | ||
|
|
||
|
|
@@ -80,6 +112,37 @@ model Evaluation { | |
| evaluators: Record<EvaluatorConfiguration>; | ||
| } | ||
|
|
||
| @doc("The redaction configuration will allow the user to control what is redacted.") | ||
| @added(Versions.v2025_05_15_preview) | ||
| @removed(Versions.v1) | ||
| model EvaluationRedactionConfiguration { | ||
| @doc("Redact score properties. If not specified, the default is to redact in production.") | ||
| redactScoreProperties?: boolean; | ||
| } | ||
|
|
||
| @doc("Evaluation Configuration Definition") | ||
| @resource("configs") | ||
| @added(Versions.v2025_05_15_preview) | ||
| @removed(Versions.v1) | ||
| model EvaluationConfiguration { | ||
| @doc("Identifier of the evaluation configuration.") | ||
| @key("id") | ||
| @visibility(Lifecycle.Create, Lifecycle.Update, Lifecycle.Read) | ||
| id: string; | ||
|
|
||
| @doc("Name of the evaluation configuration.") | ||
| name: string; | ||
|
|
||
| @doc("Allow the user to opt-out of evaluation runs being persisted in the AI Foundry. The default value is false. If it's false, the evaluation runs will not be persisted.") | ||
| disableRunPersistence?: boolean; | ||
|
|
||
| @doc("Redaction configuration for the evaluation.") | ||
| redactionConfig?: EvaluationRedactionConfiguration; | ||
|
|
||
| @doc("Extra storage options for evaluation runs, the options includes Kusto, Blob storage, App Insights, etc.") | ||
| extraResultStorageOptions?: Array<unknown>; | ||
| } | ||
|
|
||
| @doc("Definition for sampling strategy.") | ||
| @added(Versions.v2025_05_15_preview) | ||
| @removed(Versions.v1) | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is
contextdifferent fromquery? Because in agent world,queryhas all the context (aka, conversation, previous runs, etc.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the "compatible mode" for non-agent (a.k.a. inferencing) as well as custom evaluators, while the AI developer wants to run agentic data, we recommend them to go for another target
EvaluationTargetFoundryAgentDataThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the big thing to recognize here is that we are essentially delegating all input schema requirements to individual evaluators. An evaluator will define what "context" or "query" etc should mean, how those should be formatted, whether any additional fields are accepted or required, and so on. So the eval API makes no judgement or validation, that is left up to individual evaluators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with what Steve & Sandy suggested, we do need the input of conversation type. Let me add it shortly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I don't follow what you mean exactly, let me try to explain my point of view a bit more. I think there are two high-level approaches:
a strongly-typed schema. We have certain fields, and we have opinions on what those fields should look like. At the API level, if someone submits a request, we can validate those fields. All evaluators know what to expect from a given field.
weak or no typing. Essentially all fields are just strings. We won't do much validation on fields at the API layer; evaluators will have to validate fields themselves.
At this point, my feeling is we're most likely looking at option #2, if we need to preserve backward compatibility (which we probably do because some evaluators are GA). For example, for some legacy evaluators, query and response should be simple strings. For a newer evaluator, query and response should be serialized json messages matching a certain format. For another evaluator (ISA), context should be a conversation history, query should be a simple string prompt, and response should be a simple string. For groundedness, context should be some source material. Code vulnerability I can't even remember, but it doesn't match any of the above. And so on. So there's already not consistency between existing evaluators, and if we're going to add 3P evaluators then the situation will get even worse.
If we take as given that we won't have strong typing, then trying to add enough fields to cover all the use cases also seems like an endless task. Right now, if we combine all the fields that any evaluator might want to use (from existing AIFO and Nunatak), it would include:
query
response
conversation
prompt
tool_calls
tool_definitions
invoked_skills
context
ground_truth
additional_metadata <-- this is a catch-all that is used to provide other fields like agent_purpose, agent_name, schema, query_results, chosen_skills, etc.
So -- we could try to make an effort to rationalize different evaluators so that they all map to a common schema (e.g., query and conversation are sometimes interchangeable, and we could go with one and make evaluators migrate). That could involve some breaking changes, but those might be palatable with API and evaluator versioning. Or we could just let the API accept any json, and it's up to the evaluators to deal with it. I don't have a strong opinion, other than that we should make a choice and plan for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just imagining the matrix in help docs... For simpleJson type, inputs must look like this and these evaluators are available, for foundryAgentData type, inputs must look like this and this different set is available, for customJson type, inputs can look like anything and this other set is available. I feel like this might add more confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good question. This is a data input for a single evaluation (includes multiple evaluators), we expect the users to group relevant evaluators within the same input type together, but we can't avoid people add wrong types, which we will return "NOT applicable" for these cases.
For input-type and evaluator matching issue:
Is there any other good recommendations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I suggest to separate input-type and evaluator matching issue from this PR, we can discuss in product level for documentation and error handling -- previously, we are using dataset with no schema, the matching issue also applies dataset, which is not new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what you described above, you're imagining that evaluators will throw an exception if the input type does not match. And otherwise, it's kind of free-form. Some evaluators will accept both simpleJson and foundryAgentData, some won't, etc. and it's all up to documentation and useful error responses. In that case, I don't see the benefit to the three classes of input. I would just have everything be customJson in that case.
I guess to my mind it's a binary choice -- either we have a strict schema (and break current flexible usage with a bump in api-version), or we have a free schema (and support everything, with the obvious drawbacks). It's fine if that's in another PR, but I would just want to know the direction we're going so we can implement the backend service accordingly (and I would take these explicit fields out of this PR so we don't accidentally commit ourselves to that approach).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still consider the bias for actions: if simpleJson and agent data can help a lot of users, e.g. 60%+ users (a bit more background, we are expecting more users to use agent data in the near future), then let's add them to help the users, and then also leave the free-form to support custom evaluators?