Eval API converge #35312

YusakuNo1 · 2025-06-17T08:03:35Z

Choose a PR Template

Switch to "Preview" on this description then select one of the choices below.

Click here to open a PR for a Data Plane API.

Click here to open a PR for a Control Plane (ARM) API.

Click here to open a PR for only SDK configuration.

openapi-pipeline-app · 2025-06-17T08:03:39Z

Next Steps to Merge

✅ All automated merging requirements have been met! To get your PR merged, see aka.ms/azsdk/specreview/merge.

openapi-pipeline-app · 2025-06-17T08:03:40Z

PR validation pipeline restarted successfully. If there is ApiView generated, it will be updated in this comment.

github-actions · 2025-06-17T08:06:47Z

API Change Check

APIView identified API level changes in this PR and created the following API reviews

Language	API Review for Package
TypeSpec	Azure.AI.Projects
Python	azure-ai-projects
C#	Azure.AI.Projects

thecsw · 2025-06-23T19:45:53Z

specification/ai/Azure.AI.Projects/evaluations/models.tsp

+  disableEvaluationRunPersistence?: boolean;
+
+  @doc("Extra storage options for evaluation runs, the options includes Kusto, Blob storage, App Insights, etc.")
+  extraEvaluationRunStorageOptions?: Array<unknown>;


Is this WIP to define storage options instead of unknown?

Yes, we only have a spec for the list of storage options, but we didn't align on whether we only allow AI Foundry connection

Well, app insights at the moment for agent evaluations is a custom configuration, so at least we would need to support that? In addition to Kusto, ADLs, etc.

thecsw · 2025-06-23T19:46:38Z

specification/ai/Azure.AI.Projects/evaluations/models.tsp

+  continuousEvaluationConfig?: ContinuousEvaluationConfig;
+
+  @doc("Allow the user to opt-out of evaluation runs being persisted in the AI Foundry. The default value is false. If it's false, the evaluation runs will not be persisted.")
+  disableEvaluationRunPersistence?: boolean;


What does it mean here to be persisted? If it's set by the user, do we persist them to the current cache or somewhere else?

If it's "true", the data will be stored as an asset, which will be available until the user deletes it; otherwise, it'll be in a cache for N hours.

So to clarify, if I want to persist results from continuous monitoring (and so presumably the volume is such that I don't want to use assets) then I would want disableEvaluationRunPersistence = false, but extraEvaluationRunStorageOptions non-empty?

If that is case, we may want to rename... since disableEvaluationRunPersistence makes me think nothing will be saved anywhere?

I agree, we have multiple storage tiers, from hottest/managed to coolest/customer-owned: 1) cache 2) asset store 3) insights/adls/kusto,etc.

If enabling persistence allows users to get the results past the cache's 24h expiration mark, then can we name it something like storePermanentApiGet or something in the sorts to signify why someone would want to use assets (and most probably get billed for those).

thecsw · 2025-06-23T19:47:01Z

specification/ai/Azure.AI.Projects/evaluations/models.tsp

+  @doc("Percentage of sampling per hour (0-100).")
+  samplingPercent: float32;
+
+  @doc("Maximum token usage per hour (0 to 1000).")


Are these ranges enforced anywhere within this spec?

Good point, let me remove "0 to 1000"

thecsw · 2025-06-23T19:47:36Z

specification/ai/Azure.AI.Projects/evaluations/models.tsp

+
+@doc("Evaluation target in simple JSON format.")
+@added(Versions.v2025_05_15_preview)
+model EvaluationTargetSimpleJson extends EvaluationTarget {


What is this meant to represent? I suppose it's not for agents, but for something else?

It's the simple JSON format we support for evaluation: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/evaluation/azure-ai-evaluation/samples/data/evaluate_test_data.jsonl I think it's helpful to keep this flat and simple version to help the customers to test our product, what do you think?

thecsw · 2025-06-23T19:48:15Z

specification/ai/Azure.AI.Projects/evaluations/models.tsp

+  response: string;
+
+  @doc("The context for the LLM query.")
+  context?: string;


How is context different from query? Because in agent world, query has all the context (aka, conversation, previous runs, etc.)

It's the "compatible mode" for non-agent (a.k.a. inferencing) as well as custom evaluators, while the AI developer wants to run agentic data, we recommend them to go for another target EvaluationTargetFoundryAgentData

I think the big thing to recognize here is that we are essentially delegating all input schema requirements to individual evaluators. An evaluator will define what "context" or "query" etc should mean, how those should be formatted, whether any additional fields are accepted or required, and so on. So the eval API makes no judgement or validation, that is left up to individual evaluators.

Agree with what Steve & Sandy suggested, we do need the input of conversation type. Let me add it shortly

Maybe I don't follow what you mean exactly, let me try to explain my point of view a bit more. I think there are two high-level approaches:

a strongly-typed schema. We have certain fields, and we have opinions on what those fields should look like. At the API level, if someone submits a request, we can validate those fields. All evaluators know what to expect from a given field.

weak or no typing. Essentially all fields are just strings. We won't do much validation on fields at the API layer; evaluators will have to validate fields themselves.

At this point, my feeling is we're most likely looking at option #2, if we need to preserve backward compatibility (which we probably do because some evaluators are GA). For example, for some legacy evaluators, query and response should be simple strings. For a newer evaluator, query and response should be serialized json messages matching a certain format. For another evaluator (ISA), context should be a conversation history, query should be a simple string prompt, and response should be a simple string. For groundedness, context should be some source material. Code vulnerability I can't even remember, but it doesn't match any of the above. And so on. So there's already not consistency between existing evaluators, and if we're going to add 3P evaluators then the situation will get even worse.

If we take as given that we won't have strong typing, then trying to add enough fields to cover all the use cases also seems like an endless task. Right now, if we combine all the fields that any evaluator might want to use (from existing AIFO and Nunatak), it would include:

query
response
conversation
prompt
tool_calls
tool_definitions
invoked_skills
context
ground_truth
additional_metadata <-- this is a catch-all that is used to provide other fields like agent_purpose, agent_name, schema, query_results, chosen_skills, etc.

So -- we could try to make an effort to rationalize different evaluators so that they all map to a common schema (e.g., query and conversation are sometimes interchangeable, and we could go with one and make evaluators migrate). That could involve some breaking changes, but those might be palatable with API and evaluator versioning. Or we could just let the API accept any json, and it's up to the evaluators to deal with it. I don't have a strong opinion, other than that we should make a choice and plan for it.

I'm just imagining the matrix in help docs... For simpleJson type, inputs must look like this and these evaluators are available, for foundryAgentData type, inputs must look like this and this different set is available, for customJson type, inputs can look like anything and this other set is available. I feel like this might add more confusion?

Very good question. This is a data input for a single evaluation (includes multiple evaluators), we expect the users to group relevant evaluators within the same input type together, but we can't avoid people add wrong types, which we will return "NOT applicable" for these cases.

For input-type and evaluator matching issue:

custom-input-type only for custom evaluators, if the inputs doesn't match, we will tell the user;

most evaluators will support both simple JSON and foundryAgentData, but we can't avoid exceptions, e.g. we won't expect F1 score to accept foundryAgentData, also not expect agent evaluators to accept simple JSON. These should be documented properly and also mentioned the reason of errors in evaluation result.

Is there any other good recommendations?

By the way, I suggest to separate input-type and evaluator matching issue from this PR, we can discuss in product level for documentation and error handling -- previously, we are using dataset with no schema, the matching issue also applies dataset, which is not new

From what you described above, you're imagining that evaluators will throw an exception if the input type does not match. And otherwise, it's kind of free-form. Some evaluators will accept both simpleJson and foundryAgentData, some won't, etc. and it's all up to documentation and useful error responses. In that case, I don't see the benefit to the three classes of input. I would just have everything be customJson in that case.

I guess to my mind it's a binary choice -- either we have a strict schema (and break current flexible usage with a bump in api-version), or we have a free schema (and support everything, with the obvious drawbacks). It's fine if that's in another PR, but I would just want to know the direction we're going so we can implement the backend service accordingly (and I would take these explicit fields out of this PR so we don't accidentally commit ourselves to that approach).

I think we should still consider the bias for actions: if simpleJson and agent data can help a lot of users, e.g. 60%+ users (a bit more background, we are expecting more users to use agent data in the near future), then let's add them to help the users, and then also leave the free-form to support custom evaluators?

stevepon · 2025-06-23T21:34:02Z

specification/ai/data-plane/Azure.AI.Projects/preview/2025-05-15-preview/azure-ai-projects.json

+        "maxHourlyTokenLimit": {
+          "type": "integer",
+          "format": "int32",
+          "description": "Maximum token usage per hour (0 to 1000)."


1000 tokens per hour would be very little. =)

But more importantly, "token limit" doesn't indicate if it's input tokens or output tokens or a sum of both or a weighted sum of both or what.

Good point, updated as we discussed at Yan's config spec

I thought where we landed was that Yan was going to talk to customers? Because maxHourlyTokenLimit is confusing, since we have both input tokens and output tokens which are measured and priced separately. There is a tradeoff between fine-grained control and simplicity.

Agreed, the current billing structure is tied directly to the number of input and output tokens produced by evaluation requests. We already know the mapping and can infer the costs at runtime, shouldn't we be able to allow customers to set a cap on their total spend, "My limit is $1,000 for every day/week/month," such that when the limit is met, we let them know. That way, it's easier on our customers to plan their cogs.

Problem is I don't think we have access to the billing APIs directly (or such access might be a pain) so we won't know any discounts customers might have applied. We could have a unit-free number that caps it (e.g., "Evaluation Units") since we know the ratio between cost of input/output tokens. But we couldn't say it in number of dollars.

stevepon · 2025-06-23T21:35:10Z

specification/ai/data-plane/Azure.AI.Projects/preview/2025-05-15-preview/azure-ai-projects.json

+        "samplingPercent": {
+          "type": "number",
+          "format": "float",
+          "description": "Percentage of sampling per hour (0-100)."


Should clarify the sampling grain -- requests, evaluations?

Good point... from the implementation, we know it's for request, but we need better descriptions.

We can make a callout on the differences between the two, as it's easy to see as request = evaluation.

stevepon · 2025-06-23T22:08:36Z

specification/ai/data-plane/Azure.AI.Projects/preview/2025-05-15-preview/azure-ai-projects.json

+      "required": [
+        "id",
+        "name",
+        "continuousEvaluationConfig"


What if this config is not for continuous evaluation?

It seems to be an issue from the generated JSON, in TypeSpec, it's optional...

singankit · 2025-06-23T23:23:58Z

specification/ai/Azure.AI.Projects/evaluations/models.tsp

+@doc("Evaluation target in simple JSON format.")
+@added(Versions.v2025_05_15_preview)
+model EvaluationTargetSimpleJson extends EvaluationTarget {
+  @doc("The type of evaluation target.")


If I create a custom evaluator which takes in question, customer_info etc, how would I pass data for such evaluator ?

Good question, it seems like we should allow both "a structure of query/response/context/groundTruth", and also open to arbitrary fields here... pending TypeSpec question

@YusakuNo1 any update here? This will affect how we write some of the backend service.

A new input type InputCustomJson is added, the user can put arbitrary data

singankit · 2025-06-23T23:28:46Z

specification/ai/Azure.AI.Projects/evaluations/models.tsp

-  @doc("Data for evaluation.")
-  data: InputData;
+  @doc("Deprecated: data for evaluation.")
+  data?: InputData;


How will the case where customer provides

Data and model

In this case evaluation API does inference and then evaluation (current behaviour)

This use case will be covered by modelPrompt from Prachi's PR: https://github.com/Azure/azure-rest-api-specs/pull/35021/files . As we don't know all the future requirements, Prachi's PR is the good example of adding new targets in the future. By the way, Nagendra's Red Teaming is an other new target (the PR is not finalized yet): https://github.com/Azure/azure-rest-api-specs/pull/35426/files

singankit · 2025-06-23T23:34:18Z

specification/ai/data-plane/Azure.AI.Projects/preview/2025-05-15-preview/azure-ai-projects.json

+        "maxHourlyTokenLimit"
+      ]
+    },
    "CosmosDBIndex": {


Typspec is missing these definitions. Can you please add them there ?

This is the issue outside the current change, let me talk to Darren and address from a separate PR.

kseager · 2025-06-24T16:59:36Z

Making a PR for output modeling: https://github.com/Azure/azure-rest-api-specs/pull/35627/files

old: https://github.com/Azure/azure-rest-api-specs/pull/35475/files

stevepon · 2025-06-24T17:14:30Z

specification/ai/Azure.AI.Projects/evaluations/models.tsp

+@doc("Abstract base model for defining evaluation targets.")
+@discriminator("type")
+@added(Versions.v2025_05_15_preview)
+model EvaluationTarget {


I can't remember where any comments on the issue are -- but I know in the past we talked about including some sort of way for users to include correlation IDs with a request so that they can map evaluation results back to their source data. For agent evals, this is done automatically with runId and threadId; but for generic createEval, when users may not be using agents, they will want to supply sessionId or inferenceId or experimentId or correlationId or... in other words, arbitrary IDs that have some significance in the source application that can be used for correlation and consumption of downstream results. Is that anywhere in here?

Yep, correlationId is added as an optional parameter, we will figure out how to run indexing for it from the backend

YusakuNo1 · 2025-07-01T18:43:21Z

specification/ai/Azure.AI.Projects/evaluations/models.tsp

+@doc("Simple JSON as source for evaluation.")
+@added(Versions.v2025_05_15_preview)
+@removed(Versions.v1)
+model InputFoundryAgentData extends InputData {


InputFoundryAgentData demonstrates how it'll use, but since the TypeSpec in Azure.AI.Agents is not finalized, this InputData type won't be merged

YusakuNo1 · 2025-07-01T18:45:40Z

specification/ai/Azure.AI.Projects/evaluations/models.tsp

+  response: string;
+
+  @doc("The context for the LLM query.")
+  context?: string;


Agree with what Steve & Sandy suggested, we do need the input of conversation type. Let me add it shortly

microsoft-github-policy-service · 2025-07-28T00:54:57Z

Hi, @@YusakuNo1. Your PR has no update for 14 days and it is marked as stale PR. If no further update for over 14 days, the bot will close the PR. If you want to refresh the PR, please remove no-recent-activity label.

microsoft-github-policy-service · 2025-08-11T13:02:33Z

Hi, @@YusakuNo1. The PR will be closed since the PR has no update for 28 days. If you still need the PR review to proceed, please reopen it and @ mention PR assignee.

YusakuNo1 requested a review from a team as a code owner June 17, 2025 08:03

AzureRestAPISpecReview added data-plane TypeSpec Authored with TypeSpec VersioningReviewRequired <valid label in PR review process>add this label when versioning review is required labels Jun 17, 2025

thecsw reviewed Jun 23, 2025

View reviewed changes

stevepon reviewed Jun 23, 2025

View reviewed changes

singankit reviewed Jun 23, 2025

View reviewed changes

stevepon reviewed Jun 24, 2025

View reviewed changes

Spec for create API

bfb404c

YusakuNo1 force-pushed the users/daviwu/api-converge branch from 4cc6c83 to bfb404c Compare July 1, 2025 17:04

AzureRestAPISpecReview removed VersioningReviewRequired <valid label in PR review process>add this label when versioning review is required data-plane labels Jul 1, 2025

openapi-pipeline-app bot added PipelineBotTrigger and removed PipelineBotTrigger labels Jul 1, 2025

YusakuNo1 commented Jul 1, 2025

View reviewed changes

Add InputCustomJson

30590b2

microsoft-github-policy-service bot added the no-recent-activity There has been no recent activity on this issue. label Jul 28, 2025

microsoft-github-policy-service bot closed this Aug 11, 2025

Eval API converge #35312

Eval API converge #35312

Uh oh!

Conversation

YusakuNo1 commented Jun 17, 2025

Choose a PR Template

Uh oh!

openapi-pipeline-app bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Next Steps to Merge

Uh oh!

openapi-pipeline-app bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

API Change Check

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openapi-pipeline-app bot commented Jun 17, 2025 •

edited

Loading

openapi-pipeline-app bot commented Jun 17, 2025 •

edited

Loading

github-actions bot commented Jun 17, 2025 •

edited

Loading

kseager commented Jun 24, 2025 •

edited

Loading