feat(bedrock): add converse mcp support#9134
Conversation
There was a problem hiding this comment.
Pull request overview
Adds Model Context Protocol (MCP) integration to the AWS Bedrock Converse provider so MCP servers can be configured, their tools surfaced as Bedrock tools, and tool-use requests executed (including in streaming). Also updates the Red Team target configuration UI to support Bedrock Invoke vs Converse IDs and configure MCP servers, and adds a Bedrock Converse+MCP example.
Changes:
- Add MCP client initialization/cleanup and tool discovery → Bedrock Converse toolConfig conversion, plus MCP tool execution for tool_use responses (non-streaming + streaming).
- Update Red Team provider configuration UI to support Bedrock as a foundation-model provider, including an InvokeModel/Converse selector and MCP server editor.
- Add tests and a documented example config for Bedrock Converse with MCP (DeepWiki).
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| test/providers/bedrock/converse.test.ts | Adds unit tests covering MCP init/cleanup, toolConfig injection, and tool execution (including streaming). |
| src/providers/fal.ts | Type assertion tweak for fal.config call. |
| src/providers/bedrock/converse.ts | Implements MCP integration: init/cleanup, tool discovery → Bedrock tools, and MCP tool execution paths. |
| src/app/src/pages/redteam/setup/components/Targets/ProviderConfigEditor.tsx | Treats Bedrock as a foundation-model provider; removes MCP config when switching away from Converse IDs. |
| src/app/src/pages/redteam/setup/components/Targets/ProviderConfigEditor.test.tsx | Tests Bedrock foundation-model rendering and MCP config removal when switching IDs. |
| src/app/src/pages/redteam/setup/components/Targets/FoundationModelConfiguration.tsx | Adds Bedrock Invoke/Converse selector and MCP servers configuration UI stored under config.mcp. |
| src/app/src/pages/redteam/setup/components/Targets/FoundationModelConfiguration.test.tsx | Adds tests for Bedrock mode switching and MCP server config persistence. |
| examples/amazon-bedrock/models/README.md | Documents a new Converse+MCP example and configuration snippet. |
| examples/amazon-bedrock/models/promptfooconfig.converse-mcp.yaml | Adds a runnable example using a DeepWiki MCP server with Bedrock Converse. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Handle MCP tool callbacks before local function callbacks. | ||
| if (this.mcpClient && toolUseBlocks.length > 0) { | ||
| const results: string[] = []; | ||
| let hasSuccessfulCallback = false; | ||
|
|
There was a problem hiding this comment.
Fixed in 228ae15. parseResponse now dispatches each tool_use block individually: MCP for tools matching an MCP-discovered name, functionToolCallbacks for the remaining blocks, and the default tool_use JSON serialization for anything still unhandled. Mixed responses now surface results from both sources. Added a regression test ("should run MCP and functionToolCallbacks for different tools in the same response").
| @@ -1237,6 +1375,10 @@ export class AwsBedrockConverseProvider extends AwsBedrockGenericProvider implem | |||
| prompt: string, | |||
| context?: CallApiContextParams, | |||
| ): Promise<ProviderResponse & { stream?: AsyncIterable<string> }> { | |||
| if (this.initializationPromise != null) { | |||
There was a problem hiding this comment.
Fixed in 228ae15. The JSDoc on callApiStreaming now clarifies that MCP tools ARE executed automatically in streaming mode and only functionToolCallbacks are non-streaming-only.
| const updateCustomTarget = (field: string, value: unknown) => { | ||
| const updatedTarget = { ...provider } as ProviderOptions; | ||
|
|
||
| if (field === 'id') { | ||
| updatedTarget.id = value as string; | ||
| if (providerType === 'bedrock' && !updatedTarget.id.startsWith('bedrock:converse:')) { | ||
| delete updatedTarget.config.mcp; | ||
| } |
There was a problem hiding this comment.
Fixed in 228ae15. updateCustomTarget now shallow-clones provider.config along with provider so subsequent field assignments and the delete updatedTarget.config.mcp no longer mutate React state by reference.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a625053e6a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (hasSuccessfulCallback && results.length > 0) { | ||
| return { | ||
| output: results.join('\n'), | ||
| tokenUsage, |
There was a problem hiding this comment.
Process all tool-use blocks before returning MCP output
When a response includes both an MCP tool call and a non-MCP functionToolCallbacks call, the early return after MCP handling exits before the function-callback pass. As a result, non-MCP callbacks are skipped and output is incomplete. This regresses mixed-tool workflows where both callback types are expected to run.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 228ae15 alongside the matching Copilot comment. Each tool_use block is now dispatched independently — MCP for matching MCP tools, functionToolCallbacks for the remainder, default tool_use JSON for anything still unhandled. New regression test should run MCP and functionToolCallbacks for different tools in the same response.
Address review findings on the new Bedrock Converse MCP integration:
- Wire `shutdown()` into providerRegistry so stdio MCP child processes are
released at end of eval (previously `cleanup()` was implemented but never
invoked). `cleanup()` is now safe to call multiple times and resilient to
a failed init.
- Surface MCP failures via `ProviderResponse.error` instead of embedding the
error message in `output` with `success: true`. Broken MCP servers no longer
produce green eval rows.
- Catch `initializeMCP()` rejection in the constructor so a misconfigured
provider that's never invoked can't trigger an UnhandledPromiseRejection;
the recorded error is reported on the first `callApi`.
- Gate the streaming MCP execution path on `!toolsDisabled` (matching the
non-streaming path) so `toolChoice: 'none'` cannot accidentally invoke MCP.
- Make `parseToolInput` total: malformed JSON tool_use input from the model
yields `{}` with a warning, not a thrown SyntaxError that crashes the row.
- Dedup tool names between `config.tools` and MCP-discovered tools (Bedrock
rejects duplicates with ValidationException). MCP-provided tools win, with
a warning logged on conflict.
- UI: don't seed MCP servers with `command: ''` and don't auto-set
`mcp.enabled: true` until at least one server has a usable transport
(command/path/url). Empty servers no longer crash the eval at provider init.
- Rename `hasSuccessfulCallback` → `hasMcpInvocation` (the variable is set on
errors too, so the old name was misleading).
- Log a debug breadcrumb when `normalizeMCPToolContent` falls through to the
unknown-shape branch.
Docs:
- Add MCP section to `site/docs/providers/aws-bedrock.md` covering config
shape, single-turn execution semantics, tool-name collisions, lifecycle,
and `toolChoice: none` behavior.
- Document single-turn semantics in the example README.
Tests:
- 15 new backend tests covering the contract changes (error propagation, init
failure, cleanup safety, multiple-cleanup, toolChoice=none gating in both
streaming and non-streaming paths, tool-name dedup, malformed JSON safety,
$schema stripping, mixed-content normalization, array-input coercion,
unconfigured-server early-out).
- 2 new frontend tests covering the MCP-enabled toggle behavior.
Verified end-to-end against real AWS Bedrock + DeepWiki MCP:
- Healthy run: 2/2 pass, error=null, $0.0068
- Broken-URL run: 1/1 error with descriptive `error` field
- Drop setTimeout-based microtask flush in the unhandled-rejection test; callApi already awaits initializationPromise internally, so observing the recorded error through the response is enough. - Tighten Promise truthy-check in cleanup() (biome noMisusedPromises). - Reflow toolNames assertion onto one line.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a12baddbdb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Register only after a successful init so we don't keep a zombie reference | ||
| // around if initialization failed. | ||
| if (!this.registeredForShutdown) { | ||
| providerRegistry.register(this); | ||
| this.registeredForShutdown = true; |
There was a problem hiding this comment.
Register provider before MCP init can fail
Because registration happens only after await this.mcpClient.initialize(), any partial init failure (for example, one MCP server connects and a later server fails) leaves MCP resources allocated while the provider is never added to providerRegistry. In that failure path, callApi/callApiStreaming return an error immediately, so evaluator shutdown via providerRegistry.shutdownAll() cannot invoke cleanup(), which can leak stdio child processes or open transports for the rest of the run.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 228ae15. providerRegistry.register(this) now runs BEFORE await this.mcpClient.initialize(), so a partial init failure (one server connects, a later one fails) still gets cleaned up by providerRegistry.shutdownAll(). cleanup() is already resilient to a rejected initializationPromise. Regression test added: should register with providerRegistry before awaiting MCP init.
| return true; | ||
| }); | ||
|
|
||
| const converseTools = convertToolsToConverseFormat([...mcpTools, ...dedupedConfigTools]); |
There was a problem hiding this comment.
Deduplicate MCP tools across multiple servers
This merge only filters collisions between config.tools and MCP tools, but it never deduplicates mcpTools themselves. If two configured MCP servers expose the same tool name, both are sent in toolConfig.tools, and Bedrock rejects duplicate tool names with ValidationException, causing every request to fail for that multi-server setup.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 228ae15. transformMCPToolsToBedrockConverse now dedupes by tool name (first server wins) with a logger.warn on collision. Multi-server setups exposing the same tool name no longer trip Bedrock's ValidationException. Regression test: should dedupe MCP tool names exposed by multiple servers.
Generated 17 tests - 17 passedTip New to Tusk Unit Tests? Learn more here. Test Summary
ResultsTusk's tests are all passing and cover the critical paths for the new Bedrock Converse MCP support. The test suite validates that Avg +1% line coverage gain across 1 file
Coverage is calculated by running tests directly associated with each source file, learn more here. Last run on 316c9f5. Comment View check history
Was Tusk helpful? Give feedback by reacting with 👍 or 👎 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 21d057c9d6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (this.mcpInitError) { | ||
| return { error: `Bedrock Converse MCP initialization failed: ${this.mcpInitError.message}` }; |
There was a problem hiding this comment.
Skip MCP init errors when tool choice disables tools
When mcp.enabled is set but the MCP server fails to initialize, this unconditional early return makes every callApi fail before tool choice is evaluated. That breaks valid configurations where a test case sets toolChoice/tool_choice to none (or otherwise avoids tools), because those requests should still execute without MCP. The same guard exists in callApiStreaming, so both paths are blocked by an MCP outage even when tools are intentionally disabled.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 228ae15. callApi and callApiStreaming only return the MCP-init-failure error when tools are still active for the request. When the test case sets toolChoice: 'none' (or tool_choice: 'none'), the request proceeds without MCP. Regression test: should still serve toolChoice=none requests when MCP init failed.
Biome's lint/nursery/useReduceTypeParameter rule was flagging StrategyTable's reduce initializer. Pass GroupedStrategies as the reduce type parameter so the empty-object initial value doesn't need a type assertion.
Reduces cognitive complexity of parseResponse and callApiStreaming by hoisting the MCP-dispatch loop into dedicated helpers (executeMcpToolCalls method, formatStreamingToolUseBlocks method) and the JSON parse + error aggregation into module-scope helpers. Behavior unchanged; all 165 unit tests pass.
The reduce<T> generic version didn't actually clear the Style Check error. Reverting to match origin/main exactly so we can isolate the real source.
Three formatting deltas biome ci wanted: - Wrap isMCPServerConfigured signature across lines (was over 100-col). - Inline the constructor's MCP init failure logger.error onto a single line. - Inline formatStreamingToolUseBlocks' MCP catch logger.error onto a single line. Behavior unchanged; all 165 unit tests still pass.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a3c0debff1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| 'groq', | ||
| 'deepseek', | ||
| 'azure', | ||
| 'bedrock', |
There was a problem hiding this comment.
Restore Bedrock custom-config editing path
Routing bedrock through FoundationModelConfiguration removes the raw CustomTargetConfiguration editor that previously allowed arbitrary Bedrock settings. The new Bedrock form only exposes a small subset of fields, so users can no longer set required per-target options like config.region (or other Bedrock-specific fields such as endpoint/guardrail settings); those targets then fall back to defaults (e.g., us-east-1) and can fail in non-default-region deployments.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 228ae15. FoundationModelConfiguration now renders a "Bedrock Settings" collapsible section (when providerType === 'bedrock') exposing AWS Region, AWS Profile, and Inference Model Type — the most common Bedrock-specific fields that disappeared with the route change. Region defaults to us-east-1 (unchanged) but is now editable from the UI, fixing non-default-region deployments.
Comments addressed (Copilot + Codex):
P1 — Process all tool_use blocks in mixed dispatch (Copilot, Codex):
parseResponse no longer early-returns after MCP. It now dispatches each
tool_use block to MCP if a matching MCP tool exists, otherwise to a
configured functionToolCallback, otherwise to the default tool_use JSON
serialization. Mixed responses (one MCP tool + one local callback) now
surface both outputs.
P1 — Register provider before MCP init can fail (Codex):
providerRegistry.register(this) now runs BEFORE awaiting
mcpClient.initialize(). A partial init failure (e.g., one server
connects, a later one fails) is still cleaned up by shutdownAll().
P1 — Skip MCP init error when tool choice is none (Codex):
callApi/callApiStreaming no longer block all requests on a failed MCP
init. When the test case sets toolChoice/tool_choice to 'none', the
request executes without MCP — matching the documented contract that
disabling tools opts out of the entire tool path.
P1 — Restore Bedrock custom-config editing path (Codex):
FoundationModelConfiguration now exposes a Bedrock Settings section with
AWS Region, AWS Profile, and Inference Model Type — the
Bedrock-specific fields that disappeared when we routed bedrock through
the foundation-model form. Defaults to us-east-1 when unset (unchanged
from prior behavior).
P1 — ProviderConfigEditor shallow-clone (Copilot):
updateCustomTarget now clones provider.config along with provider so
`delete updatedTarget.config.mcp` and other field assignments don't
mutate React state by reference.
P2 — Deduplicate MCP tools across servers (Codex):
transformMCPToolsToBedrockConverse now drops duplicate tool names with a
warning. Bedrock rejects duplicate tool names with ValidationException;
multi-server setups exposing the same tool name no longer fail every
request.
Doc — callApiStreaming doc comment (Copilot):
Updated the JSDoc to clarify that MCP tool execution does happen in
streaming mode, and only functionToolCallbacks are non-streaming-only.
Tests:
- 4 new backend tests covering: multi-server MCP dedup, mixed
MCP+functionToolCallbacks dispatch, registry registration before
init, and toolChoice=none surviving a failed MCP init.
- All 169 backend tests + 42 frontend tests still pass.
- Real eval against AWS Bedrock + DeepWiki MCP: 2/2 pass.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 228ae158ae
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (this.initializationPromise != null) { | ||
| await this.initializationPromise; | ||
| } |
There was a problem hiding this comment.
Bypass MCP init wait when tools are disabled
Fresh evidence relative to the earlier MCP-init discussion: callApi still awaits MCP initialization before checking whether this request disables tools. In configs where mcp.enabled is true but a test sets toolChoice/tool_choice to none, this path can still stall on MCP connect timeouts (or a hung transport) even though no tool calls should run for that request; callApiStreaming has the same ordering. The tool-choice check should happen before awaiting MCP init so tool-disabled requests can proceed immediately without MCP availability.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in d6608ec. The tool-choice resolution now happens BEFORE awaiting initializationPromise, so tool-disabled requests skip both the wait and the init-error check. A hung MCP transport can no longer stall a toolChoice: 'none' request.
Verification:
- New regression test
should not await MCP init for toolChoice=none requestsraces a 200ms timeout against a never-resolving MCP init. - Real-eval against AWS Bedrock with a deliberately broken MCP URL +
tool_choice: none: passes with the model's response ("hello"), confirming the bypass works end-to-end.
The new helper awaitMcpReadyForRequest() resolves the effective tool choice first via isRequestToolsDisabled(); both callApi and callApiStreaming now share it.
Address fresh Codex P1 feedback: a hung MCP transport should never stall a
request that has opted out of tools entirely. Previously `callApi` and
`callApiStreaming` always awaited `initializationPromise` before checking
the effective tool choice — a slow or never-resolving MCP connect could
block tool-disabled requests indefinitely. Now the tool-choice resolution
runs first, and the await only happens when the request actually needs MCP.
Also folds in code-simplifier improvements applied alongside this fix:
- Centralize repeated `err instanceof Error ? err.message : String(err)` as
`errorMessage(err)`.
- Centralize the four `MCP Tool Result/Error (${name}): …` strings as
`formatMcpToolResult` / `formatMcpToolError`.
- Extract `dispatchMcpToolCall` shared by `executeMcpToolCalls` and
`formatStreamingToolUseBlocks` so the try/catch + parse + format logic
exists in one place.
- Fold `parseStreamingToolInput` to return `Record<string, unknown>` so
callers no longer need a redundant second `parseToolInput` pass.
- Hoist `isServerConfigured` to module scope in
`FoundationModelConfiguration` (was reallocated per render).
- Drop unused `invoked` field from `executeMcpToolCalls`'s return value.
Tests:
- New regression test `should not await MCP init for toolChoice=none
requests` races a 200ms timeout against a never-resolving MCP init —
fails fast if the await regresses.
- All 170 backend + 42 frontend tests still pass.
- Real eval verification:
- Healthy run: 2/2 pass, error=null
- Broken-MCP run: 1/1 error with "MCP initialization failed"
- Broken-MCP + tool_choice=none: 1/1 pass, error=null, output="hello"
(confirms the bypass works end-to-end against real Bedrock)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d6608ec745
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const { value: parsedInput, failed: parseFailed } = parseStreamingToolInput(toolBlock.input); | ||
| const matchedMcpTool = mcpTools.find((tool) => tool.name === toolBlock.name); | ||
|
|
||
| if (toolsDisabled || !this.mcpClient || !matchedMcpTool) { | ||
| toolUseParts.push( | ||
| JSON.stringify({ | ||
| type: 'tool_use', | ||
| id: toolBlock.toolUseId, | ||
| name: toolBlock.name, | ||
| input: parsedInput, | ||
| }), |
There was a problem hiding this comment.
Preserve raw tool_use input in streaming fallback output
Move JSON parsing behind the MCP-dispatch branch so fallback serialization keeps the model-emitted payload intact. Right now parseStreamingToolInput() runs before checking toolsDisabled/MCP match, so malformed JSON is coerced to {} and then returned as a normal fallback tool_use block for non-MCP tools (including toolChoice: none), which hides the model error and can invalidate assertions/debugging that rely on the actual emitted arguments.
Useful? React with 👍 / 👎.
mldangelo
left a comment
There was a problem hiding this comment.
LGTM, approving. Thanks @jameshiester-oai for the great work on this — the Bedrock Converse MCP integration fills a real gap, and the design (MCP tools merged with config.tools, single-turn execution semantics, clean separation of streaming vs. non-streaming dispatch) holds up well under stress.
What's been verified end-to-end:
- Real eval against
bedrock:converse:us.anthropic.claude-sonnet-4-6in us-east-1 with the DeepWiki MCP server: 2/2 pass, $0.0068,stopReason=tool_use, MCP tool results correctly returned. - Broken-MCP path: a deliberately bad MCP URL surfaces as
success: falsewith a descriptiveProviderResponse.error(instead of green-lighting on an embedded error string). - Broken-MCP +
tool_choice: none: bypasses MCP entirely and returns the model's text response — confirming hung/failed MCP transports don't block tool-disabled requests. - 170 backend + 42 frontend tests green; full CI matrix (36 checks) green; biome
format+tscclean.
Hardening that landed during review (mostly in response to Copilot/Codex feedback):
- Lifecycle:
shutdown()wired intoproviderRegistry; provider registers BEFORE awaitingmcpClient.initialize()so partial init failures still get cleaned up;cleanup()is safe to call multiple times and resilient to a rejected init promise. - Error contract: MCP errors (both
mcpResult.errorand thrown exceptions) propagate intoProviderResponse.errorso downstream consumers (assertions, exit codes, redteam grader) treat broken MCP calls as failures. - Mixed dispatch:
parseResponsenow routes eachtool_useblock independently (MCP for matching tools,functionToolCallbacksfor the remainder, default JSON for anything still unhandled). A response with both MCP and local-callback tools surfaces both outputs. - Tool-name dedup: collisions between
config.toolsand MCP-discovered tools are dropped with a warning; same dedup applied across multiple MCP servers (avoids Bedrock'sValidationException). - Tools-disabled fast path:
tool_choice: 'none'requests skip the MCP init wait AND the init-error check entirely — a hung MCP transport can no longer stall them. - Streaming: now honors
toolChoice: 'none'(gates on!toolsDisabledmatching the non-streaming path); JSDoc updated to reflect that MCP runs in streaming, butfunctionToolCallbacksdon't. - UI: Bedrock InvokeModel/Converse selector + MCP server editor + Bedrock Settings section (region, profile, inferenceModelType — fixes the regression where users couldn't set non-default-region from the form). MCP
enabledflips totrueonly after a server has a usable transport, so empty servers don't crash provider init. - Docs: new MCP section in
site/docs/providers/aws-bedrock.md; example README documents single-turn execution semantics.
Nice work — shipping it. 🚀


Summary
Test plan
git -C promptfoo diff --check origin/main...HEADexamples/amazon-bedrock/models/promptfooconfig.converse-mcp.yamlNote: I did not run the live Bedrock example because it requires AWS Bedrock credentials, model access, and live MCP/network behavior.