Skip to content
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions specification/ai/Azure.AI.Projects/evaluations/models.tsp
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,32 @@ model InputDataset extends InputData {
id: string;
}

@doc("Simple JSON as source for evaluation.")
@added(Versions.v2025_05_15_preview)
@removed(Versions.v1)
model InputSimpleJson extends InputData {
type: "simpleJson";

@doc("The LLM query.")
query: string;

@doc("The LLM response.")
response: string;

@doc("The context for the LLM query.")
context?: string;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is context different from query? Because in agent world, query has all the context (aka, conversation, previous runs, etc.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the "compatible mode" for non-agent (a.k.a. inferencing) as well as custom evaluators, while the AI developer wants to run agentic data, we recommend them to go for another target EvaluationTargetFoundryAgentData

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the big thing to recognize here is that we are essentially delegating all input schema requirements to individual evaluators. An evaluator will define what "context" or "query" etc should mean, how those should be formatted, whether any additional fields are accepted or required, and so on. So the eval API makes no judgement or validation, that is left up to individual evaluators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with what Steve & Sandy suggested, we do need the input of conversation type. Let me add it shortly

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I don't follow what you mean exactly, let me try to explain my point of view a bit more. I think there are two high-level approaches:

  1. a strongly-typed schema. We have certain fields, and we have opinions on what those fields should look like. At the API level, if someone submits a request, we can validate those fields. All evaluators know what to expect from a given field.

  2. weak or no typing. Essentially all fields are just strings. We won't do much validation on fields at the API layer; evaluators will have to validate fields themselves.

At this point, my feeling is we're most likely looking at option #2, if we need to preserve backward compatibility (which we probably do because some evaluators are GA). For example, for some legacy evaluators, query and response should be simple strings. For a newer evaluator, query and response should be serialized json messages matching a certain format. For another evaluator (ISA), context should be a conversation history, query should be a simple string prompt, and response should be a simple string. For groundedness, context should be some source material. Code vulnerability I can't even remember, but it doesn't match any of the above. And so on. So there's already not consistency between existing evaluators, and if we're going to add 3P evaluators then the situation will get even worse.

If we take as given that we won't have strong typing, then trying to add enough fields to cover all the use cases also seems like an endless task. Right now, if we combine all the fields that any evaluator might want to use (from existing AIFO and Nunatak), it would include:

query
response
conversation
prompt
tool_calls
tool_definitions
invoked_skills
context
ground_truth
additional_metadata <-- this is a catch-all that is used to provide other fields like agent_purpose, agent_name, schema, query_results, chosen_skills, etc.

So -- we could try to make an effort to rationalize different evaluators so that they all map to a common schema (e.g., query and conversation are sometimes interchangeable, and we could go with one and make evaluators migrate). That could involve some breaking changes, but those might be palatable with API and evaluator versioning. Or we could just let the API accept any json, and it's up to the evaluators to deal with it. I don't have a strong opinion, other than that we should make a choice and plan for it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just imagining the matrix in help docs... For simpleJson type, inputs must look like this and these evaluators are available, for foundryAgentData type, inputs must look like this and this different set is available, for customJson type, inputs can look like anything and this other set is available. I feel like this might add more confusion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good question. This is a data input for a single evaluation (includes multiple evaluators), we expect the users to group relevant evaluators within the same input type together, but we can't avoid people add wrong types, which we will return "NOT applicable" for these cases.

For input-type and evaluator matching issue:

  1. custom-input-type only for custom evaluators, if the inputs doesn't match, we will tell the user;
  2. most evaluators will support both simple JSON and foundryAgentData, but we can't avoid exceptions, e.g. we won't expect F1 score to accept foundryAgentData, also not expect agent evaluators to accept simple JSON. These should be documented properly and also mentioned the reason of errors in evaluation result.

Is there any other good recommendations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I suggest to separate input-type and evaluator matching issue from this PR, we can discuss in product level for documentation and error handling -- previously, we are using dataset with no schema, the matching issue also applies dataset, which is not new

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what you described above, you're imagining that evaluators will throw an exception if the input type does not match. And otherwise, it's kind of free-form. Some evaluators will accept both simpleJson and foundryAgentData, some won't, etc. and it's all up to documentation and useful error responses. In that case, I don't see the benefit to the three classes of input. I would just have everything be customJson in that case.

I guess to my mind it's a binary choice -- either we have a strict schema (and break current flexible usage with a bump in api-version), or we have a free schema (and support everything, with the obvious drawbacks). It's fine if that's in another PR, but I would just want to know the direction we're going so we can implement the backend service accordingly (and I would take these explicit fields out of this PR so we don't accidentally commit ourselves to that approach).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still consider the bias for actions: if simpleJson and agent data can help a lot of users, e.g. 60%+ users (a bit more background, we are expecting more users to use agent data in the near future), then let's add them to help the users, and then also leave the free-form to support custom evaluators?


@doc("The ground truth for the LLM response.")
groundTruth?: string;
}

@doc("Simple JSON as source for evaluation.")
@added(Versions.v2025_05_15_preview)
@removed(Versions.v1)
model InputFoundryAgentData extends InputData {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InputFoundryAgentData demonstrates how it'll use, but since the TypeSpec in Azure.AI.Agents is not finalized, this InputData type won't be merged

type: "foundryAgentData";
}

@doc("Evaluation Definition")
@resource("runs")
@added(Versions.v2025_05_15_preview)
Expand All @@ -60,6 +86,12 @@ model Evaluation {
@doc("Data for evaluation.")
data: InputData;

@doc("Correlation Id for the evaluation. It is used to correlate the evaluation with other resources, e.g. link DataBricks job Id and run Id with AI Foundry evaluation.")
correlationId?: string;

@doc("Configuration for the evaluation.")
config?: EvaluationConfig;

@doc("Display Name for evaluation. It helps to find the evaluation easily in AI Foundry. It does not need to be unique.")
displayName?: string;

Expand All @@ -80,6 +112,37 @@ model Evaluation {
evaluators: Record<EvaluatorConfiguration>;
}

@doc("The redaction configuration will allow the user to control what is redacted.")
@added(Versions.v2025_05_15_preview)
@removed(Versions.v1)
model EvaluationRedactionConfiguration {
@doc("Redact score properties. If not specified, the default is to redact in production.")
redactScoreProperties?: boolean;
}

@doc("Evaluation Configuration Definition")
@resource("configs")
@added(Versions.v2025_05_15_preview)
@removed(Versions.v1)
model EvaluationConfiguration {
@doc("Identifier of the evaluation configuration.")
@key("id")
@visibility(Lifecycle.Create, Lifecycle.Update, Lifecycle.Read)
id: string;

@doc("Name of the evaluation configuration.")
name: string;

@doc("Allow the user to opt-out of evaluation runs being persisted in the AI Foundry. The default value is false. If it's false, the evaluation runs will not be persisted.")
disableRunPersistence?: boolean;

@doc("Redaction configuration for the evaluation.")
redactionConfig?: EvaluationRedactionConfiguration;

@doc("Extra storage options for evaluation runs, the options includes Kusto, Blob storage, App Insights, etc.")
extraResultStorageOptions?: Array<unknown>;
}

@doc("Definition for sampling strategy.")
@added(Versions.v2025_05_15_preview)
@removed(Versions.v1)
Expand Down
Loading