diff --git a/mkdocs.yml b/mkdocs.yml index cccca9fc3..603740daf 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -186,6 +186,7 @@ nav: - How-To Guides: - Experiment Management: user-guide/evals-sdk/how-to/experiment_management.md - Serialization: user-guide/evals-sdk/how-to/serialization.md + - Deploy to AgentCore Evaluators: user-guide/evals-sdk/how-to/agentcore-code-evaluators.md - Deploy: - Operating Agents in Production: user-guide/deploy/operating-agents-in-production.md - Amazon Bedrock AgentCore: diff --git a/src/config/navigation.yml b/src/config/navigation.yml index 17a09cf15..096a575b4 100644 --- a/src/config/navigation.yml +++ b/src/config/navigation.yml @@ -208,6 +208,7 @@ sidebar: - docs/user-guide/evals-sdk/how-to/result_caching - docs/user-guide/evals-sdk/how-to/experiment_management - docs/user-guide/evals-sdk/how-to/serialization + - docs/user-guide/evals-sdk/how-to/agentcore-code-evaluators - docs/user-guide/versioning-and-support - label: Examples diff --git a/src/content/docs/user-guide/deploy/deploy_to_bedrock_agentcore/index.mdx b/src/content/docs/user-guide/deploy/deploy_to_bedrock_agentcore/index.mdx index af2efc9e1..d814ec688 100644 --- a/src/content/docs/user-guide/deploy/deploy_to_bedrock_agentcore/index.mdx +++ b/src/content/docs/user-guide/deploy/deploy_to_bedrock_agentcore/index.mdx @@ -37,6 +37,10 @@ Select your preferred programming language to get started with deploying Strands +## Evaluating Deployed Agents + +Once your agent is deployed to AgentCore, you can evaluate it with Strands evaluators by wrapping them as AgentCore code-based evaluators. See [Deploy to AgentCore Evaluators](../../evals-sdk/how-to/agentcore-code-evaluators.md). + ## Additional Resources - [Amazon Bedrock AgentCore Runtime Documentation](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/what-is-bedrock-agentcore.html) diff --git a/src/content/docs/user-guide/evals-sdk/evaluators/index.mdx b/src/content/docs/user-guide/evals-sdk/evaluators/index.mdx index 029fdde80..99ebb46a1 100644 --- a/src/content/docs/user-guide/evals-sdk/evaluators/index.mdx +++ b/src/content/docs/user-guide/evals-sdk/evaluators/index.mdx @@ -382,3 +382,4 @@ def compare_agent_versions(cases: list, agents: dict) -> dict: - [Simulators Overview](../simulators/index.md): Learn about simulators - [Detectors Overview](../detectors/index.md): Automatic failure diagnosis for agent traces - [Experiment Generator](../experiment_generator.md): Generate test cases automatically +- [Deploy to AgentCore Evaluators](../how-to/agentcore-code-evaluators.md): Wrap a Strands evaluator as an AgentCore code-based evaluator diff --git a/src/content/docs/user-guide/evals-sdk/how-to/agentcore-code-evaluators.mdx b/src/content/docs/user-guide/evals-sdk/how-to/agentcore-code-evaluators.mdx new file mode 100644 index 000000000..02cecffe9 --- /dev/null +++ b/src/content/docs/user-guide/evals-sdk/how-to/agentcore-code-evaluators.mdx @@ -0,0 +1,625 @@ +--- +title: Deploy Strands Evaluators as AgentCore Code-Based Evaluators +sidebar: + label: "AgentCore Code-Based Evaluators" +languages: Python +--- + +This guide is for a Strands developer who has deployed agents to Amazon Bedrock +AgentCore and wants to run Strands evaluators against those agents. It walks +through wrapping an evaluator from the Python Strands Evals SDK +(`strands-agents-evals`) as an AgentCore code-based evaluator — an AWS Lambda +function that AgentCore Evaluations invokes to score sessions, traces, or tool +calls. + +## Prerequisites + +Before following this guide, make sure you have: + +- Python 3.10+ +- `strands-agents` installed +- `strands-agents-evals` installed +- An AWS account with access to AWS Lambda + +:::note +Evaluators that call Amazon Bedrock (for example `HelpfulnessEvaluator`) +require model access to be enabled in the Amazon Bedrock console. See the +[model-access instructions](../quickstart.md) in the Evals SDK quickstart. +::: + +## How it fits together + +The diagram below shows the control flow inside the Lambda handler at runtime, +from the moment AgentCore Evaluations invokes it to the schema-valid response +the handler returns. + +```mermaid +sequenceDiagram + participant AC as AgentCore Evaluations + participant L as Lambda (sample handler) + participant M as StrandsInMemory\nSessionMapper + participant E as Strands Evaluator\n(HelpfulnessEvaluator) + + AC->>L: event (schemaVersion, evaluationLevel,\nsessionSpans, referenceInputs, target) + L->>L: parse_payload(event) + L->>M: map sessionSpans → Session + M-->>L: Session + L->>E: evaluator.evaluate(EvaluationData) + alt evaluator supports level + E-->>L: EvaluationOutput(score, reason) + L->>L: build_response(score, reason,\nlevel, evaluatorId) + L-->>AC: schema-valid response JSON + else unsupported level + L->>L: error_handling: build "not run" response + L-->>AC: schema-valid response JSON\n(score=None, reason=explanation) + else evaluator raised + L->>L: error_handling: catch, log,\nbuild failure response + L-->>AC: schema-valid response JSON\n(score=None, reason=error) + end +``` + +## AgentCore request payload + +AgentCore Evaluations invokes your Lambda with a JSON event defined by the +AWS [code-based evaluators contract](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/code-based-evaluators.html). + +The handler reads every field listed below: + +- `schemaVersion` — contract version; currently `"1.0"`. +- `evaluatorId` — ARN of the registered code-based evaluator. Example: + `"arn:aws:bedrock-agentcore:us-east-1:123456789012:evaluator/strands-helpfulness"`. +- `evaluatorName` — human-readable label. Example: `"strands-helpfulness"`. +- `evaluationLevel` — one of `TRACE`, `TOOL_CALL`, or `SESSION`. Selects which + Strands evaluator level the handler can answer. +- `evaluationInput.sessionSpans` — array of OTel-style span dicts the handler + feeds to `StrandsInMemorySessionMapper` to build a `Session`. +- `evaluationReferenceInputs` — optional reference data for evaluators that + need it (for example `FaithfulnessEvaluator`). Example: + `{"reference_output": "Paris"}`. +- `evaluationTarget` — metadata about the target agent or session. Example: + `{"agentId": "my-agent", "sessionId": "abc-123"}`. + +A complete request looks like this: + +```json +{ + "schemaVersion": "1.0", + "evaluatorId": "arn:aws:bedrock-agentcore:us-east-1:...:evaluator/...", + "evaluatorName": "strands-helpfulness", + "evaluationLevel": "TRACE", + "evaluationInput": { + "sessionSpans": [ /* array of OTel-style span dicts */ ] + }, + "evaluationReferenceInputs": { /* optional reference data */ }, + "evaluationTarget": { /* metadata about the target agent/session */ } +} +``` + +## Evaluation-level mapping + +AgentCore's `evaluationLevel` must line up with the Strands evaluator's declared +level for the handler to score the request. The table below shows the +one-to-one mapping the handler relies on: + +| AgentCore `evaluationLevel` | Strands evaluator level | Typical Strands evaluators | +| --- | --- | --- | +| `TOOL_CALL` | `OUTPUT_LEVEL` | `OutputEvaluator`, deterministic evaluators | +| `TRACE` | `TRACE_LEVEL` | `HelpfulnessEvaluator`, `FaithfulnessEvaluator` | +| `SESSION` | `SESSION_LEVEL` | `TrajectoryEvaluator`, `GoalSuccessRateEvaluator` | + +When the incoming level doesn't match the wrapped evaluator, the handler +returns a schema-valid "not-run" response instead of raising — the details of +that response shape are covered in the error-handling section later on this +page. +## Writing the Lambda handler + +The rest of this section walks through the sample handler region by region. +Every code block is pulled verbatim from +[`agentcore_code_evaluator_handler.py`][sample-handler] via MkDocs snippet +syntax, so the docs and the runnable sample can't drift apart. + +[sample-handler]: https://github.com/strands-agents/docs/blob/main/src/content/docs/user-guide/evals-sdk/how-to/agentcore_code_evaluator_handler.py + +### Imports and constants + +The `imports` region pulls in the stdlib utilities, the Strands Evals SDK +building blocks the handler needs, and the module-level constants echoed back +in every AgentCore response. The evaluator instance is created at module scope +rather than inside `lambda_handler` so warm Lambda invocations reuse the same +`HelpfulnessEvaluator` — and therefore the same judge-model client — +instead of re-initializing it on every call. + +`SCHEMA_VERSION` is the AgentCore code-based evaluator contract version the +handler echoes in responses. `PASS_THRESHOLD_ENV` and `DEFAULT_PASS_THRESHOLD` +give deployers a way to override the numeric cutoff that maps evaluator scores +to `PASS` / `FAIL` without editing code. The `AGENTCORE_LEVEL_*` constants are +the exact string values AgentCore sends for `evaluationLevel`, defined once so +the rest of the handler can reference them symbolically. + +```python +--8<-- "user-guide/evals-sdk/how-to/agentcore_code_evaluator_handler.py:imports" +``` + +### Entry point + +`lambda_handler` is intentionally small: parse the event, invoke the +evaluator, build the response. Each step lives in its own helper so the docs +can introduce them one at a time, and so unit tests can exercise each stage +independently. + +The outer `try/except` is the key part. AgentCore treats any Lambda error as +an evaluator failure that aborts the evaluation run, so the handler converts +every failure — level mismatches, validation errors, evaluator exceptions — +into a schema-valid response. The two helpers referenced here, +`LevelMismatchError` and `_error_response`, are defined alongside the other +error-handling code further down. + +```python +--8<-- "user-guide/evals-sdk/how-to/agentcore_code_evaluator_handler.py:handler" +``` + +### Parsing the request + +`_parse_payload` translates the AgentCore request into a Strands +`EvaluationData`. It starts with a soft version check: an unexpected or +missing `schemaVersion` logs a warning but doesn't abort, because the rest of +the payload may still be readable. `evaluationLevel` is different — it's +required to pick the right evaluator path, so a missing value raises a +`ValueError` that the outer `try/except` turns into an `ERROR` response. + +The OTel-style `sessionSpans` are fed to `StrandsInMemorySessionMapper` to +produce the `Session` object that Strands evaluators accept. Optional +passthroughs (`evaluationReferenceInputs` and `evaluationTarget`) are read +only when the evaluator needs them. + +The `_recover_prompt` and `_recover_final_output` helpers are illustrative. +They pull the user prompt from the root span's attributes and the final +assistant message from the session's message list, which works for the shape +this sample assumes. Adapt both helpers to match the span and message schema +your agent actually records. + +```python +--8<-- "user-guide/evals-sdk/how-to/agentcore_code_evaluator_handler.py:parse_payload" +``` + +### Invoking the evaluator + +`_invoke_evaluator` enforces the level-compatibility check before calling the +evaluator. It reads the evaluator's declared Strands level (for +`HelpfulnessEvaluator` that's `TRACE_LEVEL`) and compares it to the expected +Strands level for the incoming AgentCore level via +`_AGENTCORE_TO_STRANDS_LEVEL`, the single mapping table the handler uses for +this translation. + +On a mismatch, the function raises `LevelMismatchError`. Python resolves the +forward reference at call time, so the class can live in the error-handling +region below without a circular import. The outer handler catches the +mismatch and converts it into the `NOT_RUN` response. + +```python +--8<-- "user-guide/evals-sdk/how-to/agentcore_code_evaluator_handler.py:invoke_evaluator" +``` + +### Building the response + +`_build_response` shapes the AgentCore response the Lambda returns on the +happy path. The pass threshold is read per-invocation from +`STRANDS_EVAL_PASS_THRESHOLD` (falling back to `DEFAULT_PASS_THRESHOLD` when +the env var is missing or non-numeric), so tests and deployments can override +it without touching code. + +`passed` and `label` follow an explicit precedence. If the evaluator sets +`test_pass` directly — deterministic evaluators often do — that value wins; +otherwise `passed` is derived by comparing `score` to the threshold. The +`label` preserves an evaluator-provided value when one exists, and otherwise +falls back to the AgentCore `PASS` / `FAIL` vocabulary. `schemaVersion`, +`evaluatorId`, `evaluatorName`, and `evaluationLevel` are echoed straight +from the request so AgentCore can correlate the response with the evaluation +it triggered. + +```python +--8<-- "user-guide/evals-sdk/how-to/agentcore_code_evaluator_handler.py:build_response" +``` + +### Error handling + +The `error_handling` region collects the three failure modes the handler is +prepared for: + +- **Level mismatch.** `LevelMismatchError` is raised by `_invoke_evaluator` + when the incoming `evaluationLevel` doesn't map to the evaluator's declared + Strands level. The outer handler catches it and calls `_error_response` + with `label="NOT_RUN"`, signalling that the evaluator is healthy but + cannot score this request. +- **Unexpected exceptions.** Any other `Exception` — an evaluator crash, + a Bedrock error, a bug in the mapping helpers — is caught by the generic + `except Exception` in `lambda_handler`, logged with `logger.exception(...)` + so the full traceback lands in CloudWatch, and converted into an `ERROR` + response. +- **Missing required fields.** When `_parse_payload` encounters a missing + `evaluationLevel`, it raises `ValueError`. That also falls into the generic + catch above and surfaces as an `ERROR` response. + +All three failure modes return schema-valid responses because AgentCore +treats an uncaught Lambda error as an evaluator failure that aborts the +evaluation run — surfacing a graceful score with a clear `reason` is almost +always more useful than taking the whole run down. + +```python +--8<-- "user-guide/evals-sdk/how-to/agentcore_code_evaluator_handler.py:error_handling" +``` + +Swapping in a different built-in evaluator is a one-line change. Replace the +module-level `_EVALUATOR = HelpfulnessEvaluator()` with, for example, +`OutputEvaluator()` for `TOOL_CALL`-level requests or `TrajectoryEvaluator()` +for `SESSION`-level requests — the surrounding parse, invoke, and build-response +logic stays the same. + +## AgentCore response schema + +The handler returns JSON shaped to the AgentCore code-based evaluator response +contract — AgentCore uses this payload to record the score and attach it back +to the evaluation run. + +The response carries these fields: + +- `schemaVersion` — echoed from the request so AgentCore can confirm the + contract version the handler answered. +- `results[0].evaluatorId` — echoed from the request; the ARN of the + registered code-based evaluator. +- `results[0].evaluatorName` — echoed from the request; the human-readable + evaluator label. +- `results[0].evaluationLevel` — echoed from the request (`TRACE`, + `TOOL_CALL`, or `SESSION`). +- `results[0].score` — numeric score produced by the Strands evaluator, or + `null` on the `NOT_RUN` / `ERROR` paths. +- `results[0].passed` — boolean pass indicator. See the defaults table below + for how the sample handler derives it. +- `results[0].label` — short string label (`PASS` / `FAIL` / `NOT_RUN` / + `ERROR`) that AgentCore surfaces in the dashboard. +- `results[0].reason` — human-readable rationale taken from the evaluator's + `EvaluationOutput.reason`. + +A successful response looks like this: + +```json +{ + "schemaVersion": "1.0", + "results": [ + { + "evaluatorId": "arn:aws:bedrock-agentcore:...", + "evaluatorName": "strands-helpfulness", + "evaluationLevel": "TRACE", + "score": 0.83, + "passed": true, + "label": "PASS", + "reason": "Response directly answered the user's question with ..." + } + ] +} +``` + +AgentCore requires a handful of fields that Strands evaluators don't produce +on their own. The sample handler fills them in as follows: + +| Field | Default source in sample handler | +| --- | --- | +| `schemaVersion` | Echoed from the request | +| `evaluatorId` / `evaluatorName` | Echoed from the request | +| `evaluationLevel` | Echoed from the request | +| `passed` | `test_pass`, else `score >= threshold` (env: `STRANDS_EVAL_PASS_THRESHOLD`) | +| `label` | `PASS` / `FAIL` from `passed` unless the evaluator sets an explicit `label` | +| `reason` | Taken from `evaluation_output.reason` | + +## Round-trip translation + +The sample handler preserves a simple round-trip property: parsing a valid +AgentCore payload, running the wrapped Strands evaluator, and serializing the +result produces an AgentCore response whose `score` equals the evaluator's +score and whose `evaluationLevel` equals the request's `evaluationLevel`. When +the evaluator doesn't support the incoming level, the response is still +schema-valid — a `NOT_RUN` result with `score: null` and a clear `reason` +explaining that the evaluator did not run. + +### Request → evaluator input + +When `_parse_payload` processes a request, each AgentCore field lands in a +specific slot on the Strands `EvaluationData` the evaluator receives: + +- `evaluationInput.sessionSpans` → a `Session` built by + `StrandsInMemorySessionMapper`, used as the `trajectory` on `EvaluationData` + for `TRACE_LEVEL` and `SESSION_LEVEL` evaluators. +- `input` → the user prompt recovered from the root span's attributes by + `_recover_prompt`. +- `actual_output` → the final assistant message in the mapped `Session`, + recovered by `_recover_final_output`. +- `reference_input` → taken from `evaluationReferenceInputs` when present, so + reference-based evaluators (for example `FaithfulnessEvaluator`) can read + the keys they need. +- `evaluationTarget` → passed through untouched; the handler does not rewrite + it, and evaluators that need agent or session metadata can read it directly. + +### Evaluator output → response + +When `_build_response` serializes the result, the evaluator's +`EvaluationOutput` maps back into the AgentCore response as follows: + +- `score` → `results[0].score`, echoed verbatim (or `null` on the `NOT_RUN` + and `ERROR` paths). +- `reason` → `results[0].reason`, echoed verbatim. +- `test_pass` / `label` → `results[0].passed` and `results[0].label`, with the + precedence documented earlier: an evaluator-provided `test_pass` wins over + the threshold comparison, and an evaluator-provided `label` wins over the + derived `PASS` / `FAIL` vocabulary. +- `schemaVersion`, `evaluatorId`, `evaluatorName`, `evaluationLevel` → echoed + straight from the request so AgentCore can correlate the response with the + evaluation run that triggered it. + +### Worked example + +Given a `TRACE`-level request targeting the `HelpfulnessEvaluator`: + +```json +{ + "schemaVersion": "1.0", + "evaluatorId": "arn:aws:bedrock-agentcore:us-east-1:...:evaluator/helpfulness", + "evaluatorName": "strands-helpfulness", + "evaluationLevel": "TRACE", + "evaluationInput": { + "sessionSpans": [ + { + "name": "invoke_agent", + "attributes": { + "gen_ai.prompt": "What's the capital of France?" + } + }, + { + "name": "assistant.message", + "attributes": { + "gen_ai.completion": "The capital of France is Paris." + } + } + ] + }, + "evaluationReferenceInputs": { + "reference_output": "Paris" + }, + "evaluationTarget": { + "agentId": "my-agent", + "sessionId": "abc-123" + } +} +``` + +The sample handler produces this response: + +```json +{ + "schemaVersion": "1.0", + "results": [ + { + "evaluatorId": "arn:aws:bedrock-agentcore:us-east-1:...:evaluator/helpfulness", + "evaluatorName": "strands-helpfulness", + "evaluationLevel": "TRACE", + "score": 0.9, + "passed": true, + "label": "PASS", + "reason": "Response directly answered the user's question with Paris." + } + ] +} +``` + +## Deploy the Lambda function + +With the handler code in place, the remaining work is packaging it, creating +the Lambda, and wiring up IAM so the function can call Bedrock (for +evaluators that need it) and write logs. This section collects the AWS-side +steps a Strands reader needs to get a code-based evaluator Lambda running. + +### Package the function + +Lambda expects a ZIP archive containing the handler module plus its +dependencies. The simplest approach is to `pip install` the Strands Evals +SDK into a local `package/` directory, drop the handler next to it, and zip +everything together: + +```bash +mkdir -p package +pip install --target ./package strands-agents-evals +cp agentcore_code_evaluator_handler.py package/ +cd package && zip -r ../function.zip . && cd .. +``` + +:::note +Lambda imposes a 250 MB unzipped deployment-package size limit (50 MB +zipped for direct upload). If `strands-agents-evals` plus its transitive +dependencies push a deployment package near that ceiling, move the library +into a Lambda layer and deploy the handler by itself. A layer also lets you +share a single `strands-agents-evals` install across multiple evaluator +Lambdas. +::: + +### Create the Lambda + +Once `function.zip` exists, create the Lambda with the AWS CLI. Use a +Python 3.11 runtime, point `--handler` at +`agentcore_code_evaluator_handler.lambda_handler`, and give the function +enough time and memory to load the SDK and invoke the judge model: + +```bash +aws lambda create-function \ + --function-name strands-evals-helpfulness \ + --runtime python3.11 \ + --handler agentcore_code_evaluator_handler.lambda_handler \ + --timeout 60 \ + --memory-size 512 \ + --role arn:aws:iam::ACCOUNT_ID:role/StrandsEvalsLambdaRole \ + --zip-file fileb://function.zip +``` + +Replace `ACCOUNT_ID` with your AWS account ID and substitute the execution +role you create under *Execution-role IAM policy* below. + +### Environment variables + +The sample handler reads a small number of environment variables so +deployers can tune runtime behavior without editing code: + +- `STRANDS_EVAL_PASS_THRESHOLD` — numeric cutoff the handler uses to derive + `passed` from `score` when the evaluator doesn't set `test_pass` + directly. Defaults to `0.5`. Non-numeric values fall back to the default. +- `AWS_REGION` — region the Strands Evals SDK uses for Bedrock model calls. + Lambda sets this automatically to the function's region; override it + only when you want the judge model to run in a different region than the + Lambda. + +Evaluators that instantiate `BedrockModel` also respect the standard +Bedrock SDK environment variables for overriding the model ID, endpoint, +and credentials — useful if you want to swap the judge model without +redeploying the handler. + +### Execution-role IAM policy + +The Lambda execution role needs just two permission families: call the +Bedrock models the evaluator uses, and write logs to CloudWatch. The +minimal policy below covers both: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "InvokeBedrockModel", + "Effect": "Allow", + "Action": "bedrock:InvokeModel", + "Resource": "*" + }, + { + "Sid": "CloudWatchLogs", + "Effect": "Allow", + "Action": [ + "logs:CreateLogGroup", + "logs:CreateLogStream", + "logs:PutLogEvents" + ], + "Resource": "*" + } + ] +} +``` + +Scope `Resource` on `bedrock:InvokeModel` down to the specific model ARNs +your evaluator uses for a tighter policy in production. The wildcard shown +here keeps the example concise. + +### Region and additional resources + +The Lambda must live in a region that supports both AgentCore Evaluations +and the Bedrock models the chosen evaluator uses. Evaluators that call +Bedrock (for example `HelpfulnessEvaluator`) fail at runtime if the judge +model isn't available in the Lambda's region — pick a region that +satisfies both services before creating the function. + +An IAM execution role must exist before `aws lambda create-function` will +succeed. See the AWS Lambda docs on +[creating an execution role](https://docs.aws.amazon.com/lambda/latest/dg/lambda-intro-execution-role.html) +for step-by-step guidance; attach the policy above to the role you create. + +:::note +If your evaluator uses Amazon Bedrock (for example `HelpfulnessEvaluator`), +model access must be enabled in the Amazon Bedrock console before the +Lambda can invoke the judge model. See the +[model-access instructions](../quickstart.md) in the Evals SDK quickstart. +::: + +## Register the code-based evaluator + +With the Lambda deployed, the next step is to register it with AgentCore +Evaluations so that AgentCore knows which function to invoke for a given +evaluator. AgentCore's +[code-based evaluators dev guide](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/code-based-evaluators.html) +covers the registration procedure in detail; this section summarizes what +registration accomplishes and the AWS-side IAM permissions AgentCore needs +to call the function. + +Registration does three things. AgentCore records the Lambda ARN against a +new evaluator resource, assigns that resource an `evaluatorId` (the same +value that flows back into the `evaluatorId` field of every request the +handler receives), and grants itself invoke permissions on the function so +it can dispatch evaluation calls at runtime. + +### AgentCore-side IAM policy + +AgentCore needs permission to invoke and introspect the deployed Lambda. +The minimal policy below grants both, scoped to a single function ARN: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "InvokeEvaluatorLambda", + "Effect": "Allow", + "Action": [ + "lambda:InvokeFunction", + "lambda:GetFunction" + ], + "Resource": "arn:aws:lambda:us-east-1:ACCOUNT_ID:function:strands-evals-helpfulness" + } + ] +} +``` + +AgentCore normally manages this policy automatically when you register the +evaluator through the registration console or API — the explicit policy +above is shown for reference and for deployments that prefer to attach it +manually. + +## Run an evaluation + +With the Lambda deployed and the evaluator registered, AgentCore Evaluations +can invoke the code-based evaluator against the target agent's traces, +sessions, or tool calls as part of an evaluation run. The CLI invocation +below is a minimal starting point that references the registered evaluator +by ARN and points at an agent and session to score: + +```bash +aws bedrock-agentcore-control start-evaluation-run \ + --evaluator-id EVALUATOR_ARN \ + --target agentId=AGENT_ID,sessionId=SESSION_ID \ + --region us-east-1 +``` + +The exact command name and arguments follow the AWS dev guide's current +code-based evaluators API — if the invocation above doesn't match what +your AWS CLI expects, the AWS page linked earlier in this guide is the +source of truth. + +The evaluator's scores appear in the AgentCore Evaluations dashboard — see +[AgentCore Evaluation Dashboard](./agentcore_evaluation_dashboard.md) for +how to interpret the results. + +## TypeScript support + +:::note +This guide covers only the Python Strands Evals SDK +(`strands-agents-evals`). TypeScript coverage is out of scope for now and +is tracked by the upstream feature request at +[strands-agents/evals#204](https://github.com/strands-agents/evals/issues/204). +It will land once the +TypeScript Evals SDK ships. +::: + +## Related documentation + +- [Evals SDK quickstart](../quickstart.md) — install the SDK and write your + first evaluator. +- [Evaluators overview](../evaluators/index.md) — the built-in evaluators + available to wrap for AgentCore. +- [Custom evaluators](../evaluators/custom_evaluator.md) — build a + domain-specific evaluator to wrap. +- [Deploy to AgentCore](../../deploy/deploy_to_bedrock_agentcore/index.md) — + how to deploy the agent you're evaluating. +- [AgentCore Evaluation Dashboard](./agentcore_evaluation_dashboard.md) — + interpret the scores your code-based evaluator produces. diff --git a/src/content/docs/user-guide/evals-sdk/how-to/agentcore_code_evaluator_handler.py b/src/content/docs/user-guide/evals-sdk/how-to/agentcore_code_evaluator_handler.py new file mode 100644 index 000000000..1d2b81d28 --- /dev/null +++ b/src/content/docs/user-guide/evals-sdk/how-to/agentcore_code_evaluator_handler.py @@ -0,0 +1,313 @@ +"""AWS Lambda handler that wraps a Strands evaluator as an AgentCore +code-based evaluator. + +This sample shows how to deploy an evaluator from the Strands Evals SDK +(``strands-agents-evals``) behind an AWS Lambda function that Amazon +Bedrock AgentCore Evaluations invokes with the code-based evaluator +contract. The handler parses the AgentCore request, runs a Strands +evaluator against the mapped session, and returns a response that +conforms to the AgentCore response schema. + +The companion docs page walks through the code region by region: +``src/content/docs/user-guide/evals-sdk/how-to/agentcore-code-evaluators.mdx``. + +Snippet references in that page resolve to this file via +``--8<-- "user-guide/evals-sdk/how-to/agentcore_code_evaluator_handler.py:"``. +""" + +# --8<-- [start:imports] +import json +import logging +import os +from typing import Any + +from strands_evals.evaluators import HelpfulnessEvaluator +from strands_evals.mappers import StrandsInMemorySessionMapper +from strands_evals.types import EvaluationData, EvaluationOutput + +logger = logging.getLogger(__name__) + +# AgentCore code-based evaluator contract version echoed in responses. +SCHEMA_VERSION = "1.0" + +# Env var used to override the pass/fail threshold at deploy time. +PASS_THRESHOLD_ENV = "STRANDS_EVAL_PASS_THRESHOLD" + +# Default pass threshold when the evaluator does not set ``test_pass``. +DEFAULT_PASS_THRESHOLD = 0.5 + +# AgentCore ``evaluationLevel`` values, per the AgentCore contract: +# https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/code-based-evaluators.html +AGENTCORE_LEVEL_TOOL_CALL = "TOOL_CALL" +AGENTCORE_LEVEL_TRACE = "TRACE" +AGENTCORE_LEVEL_SESSION = "SESSION" +# --8<-- [end:imports] + +# --8<-- [start:handler] +# Module-level evaluator instance reused across Lambda invocations so +# warm starts don't re-initialize the judge-model client on every call. +_EVALUATOR = HelpfulnessEvaluator() + + +def lambda_handler(event: dict[str, Any], context: Any) -> dict[str, Any]: + """Entry point AgentCore Evaluations invokes for each scored unit. + + Parses the AgentCore request, runs the configured Strands evaluator, + and returns a schema-valid AgentCore response. All failure modes are + converted into schema-valid responses so AgentCore never observes an + unhandled Lambda error. ``LevelMismatchError`` and ``_error_response`` + are defined alongside the other error-handling helpers below. + """ + try: + evaluation_data, level = _parse_payload(event) + evaluation_output = _invoke_evaluator( + _EVALUATOR, evaluation_data, level + ) + return _build_response(event, evaluation_output, status="OK") + except LevelMismatchError as exc: + return _error_response(event, exc, label="NOT_RUN") + except Exception as exc: # noqa: BLE001 - graceful failure response + logger.exception("Evaluator invocation failed") + return _error_response(event, exc, label="ERROR") +# --8<-- [end:handler] + +# --8<-- [start:parse_payload] +def _parse_payload( + event: dict[str, Any], +) -> tuple[EvaluationData, str]: + """Translate an AgentCore request into Strands evaluator input. + + Returns the ``EvaluationData`` the evaluator will score and the + ``evaluationLevel`` string from the request so the caller can + compare it against the evaluator's declared level. + """ + # Version check. Unknown versions are logged but not fatal; the + # request may still carry fields we know how to read. + schema_version = event.get("schemaVersion") + if schema_version is None: + logger.warning("AgentCore request is missing schemaVersion") + elif schema_version != SCHEMA_VERSION: + logger.warning( + "Unexpected schemaVersion %r (handler expects %r)", + schema_version, + SCHEMA_VERSION, + ) + + # ``evaluationLevel`` is required to pick the right evaluator path. + # The outer handler converts ValueError into an ERROR response that + # AgentCore can surface to the reader. + evaluation_level = event.get("evaluationLevel") + if not evaluation_level: + raise ValueError( + "AgentCore request is missing required 'evaluationLevel'" + ) + + # Convert the OTel-style session spans into a Strands ``Session``. + evaluation_input = event.get("evaluationInput") or {} + session_spans = evaluation_input.get("sessionSpans") or [] + mapper = StrandsInMemorySessionMapper() + session = mapper.map(session_spans) + + # Optional passthroughs. AgentCore only sends these fields when the + # evaluation configuration references them. + reference_inputs = event.get("evaluationReferenceInputs") + evaluation_target = event.get("evaluationTarget") + + # Illustrative defaults: recover the prompt from the root span's + # attributes and treat the final assistant message as the output. + # Adjust these helpers to match the span and message shape your + # agent actually records. + input_text = _recover_prompt(session) + actual_output = _recover_final_output(session) + + evaluation_data = EvaluationData( + input=input_text, + actual_output=actual_output, + trajectory=session, + reference_input=reference_inputs, + ) + + # ``evaluation_target`` is accepted for future use (for example + # logging the target agent version alongside the score) but is not + # consumed by ``EvaluationData`` today. + _ = evaluation_target + + return evaluation_data, evaluation_level + + +def _recover_prompt(session: Any) -> str: + """Pull the user prompt out of the mapped session. + + Illustrative: reads the root span's attributes. Real deployments + should adapt this to match the span schema their agent emits. + """ + root_span = getattr(session, "root_span", None) + attributes = getattr(root_span, "attributes", {}) or {} + return str( + attributes.get("input") or attributes.get("prompt") or "" + ) + + +def _recover_final_output(session: Any) -> str: + """Pull the final assistant message out of the mapped session. + + Illustrative: reads the last message on the session. Real + deployments should adapt this to match the message shape their + agent records. + """ + messages = getattr(session, "messages", []) or [] + if not messages: + return "" + last = messages[-1] + return str(getattr(last, "content", last) or "") +# --8<-- [end:parse_payload] + +# --8<-- [start:invoke_evaluator] +# Strands evaluators expose their native level as a string. The sample +# wraps ``HelpfulnessEvaluator``, which is a ``TRACE_LEVEL`` evaluator. +STRANDS_OUTPUT_LEVEL = "OUTPUT_LEVEL" +STRANDS_TRACE_LEVEL = "TRACE_LEVEL" +STRANDS_SESSION_LEVEL = "SESSION_LEVEL" + +# How each AgentCore ``evaluationLevel`` maps to a Strands level. +_AGENTCORE_TO_STRANDS_LEVEL = { + AGENTCORE_LEVEL_TOOL_CALL: STRANDS_OUTPUT_LEVEL, + AGENTCORE_LEVEL_TRACE: STRANDS_TRACE_LEVEL, + AGENTCORE_LEVEL_SESSION: STRANDS_SESSION_LEVEL, +} + + +def _invoke_evaluator( + evaluator: Any, + evaluation_data: EvaluationData, + level: str, +) -> EvaluationOutput: + """Run the evaluator when the request level matches its Strands level. + + Raises ``LevelMismatchError`` when the incoming AgentCore level does + not correspond to the evaluator's declared Strands level. The outer + handler converts that into a schema-valid "not-run" response so + AgentCore never observes an unhandled Lambda error. The + ``LevelMismatchError`` class is defined in the error-handling region + below; Python resolves the forward reference at call time. + """ + # Every Strands evaluator declares the level at which it operates. + # ``HelpfulnessEvaluator`` is ``TRACE_LEVEL``; ``OutputEvaluator`` is + # ``OUTPUT_LEVEL``; ``TrajectoryEvaluator`` is ``SESSION_LEVEL``. + # The fallback matches this sample's default evaluator. + evaluator_level = getattr(evaluator, "level", STRANDS_TRACE_LEVEL) + + # ``.get(...)`` returns ``None`` for unknown AgentCore levels, which + # never matches ``evaluator_level`` and so raises the mismatch error. + expected_strands_level = _AGENTCORE_TO_STRANDS_LEVEL.get(level) + if expected_strands_level != evaluator_level: + raise LevelMismatchError( + f"Evaluator level {evaluator_level!r} does not support " + f"AgentCore evaluationLevel {level!r}" + ) + + return evaluator.evaluate(evaluation_data) +# --8<-- [end:invoke_evaluator] + +# --8<-- [start:build_response] +def _build_response( + event: dict[str, Any], + evaluation_output: EvaluationOutput, + status: str, +) -> dict[str, Any]: + """Assemble an AgentCore response from a Strands evaluator output. + + ``status`` is accepted for symmetry with ``_error_response(label=...)`` + so the docs can explain the two response helpers in parallel. The + happy path always derives ``label`` from ``passed`` (or an explicit + evaluator label), so the flag is not consumed here. + """ + _ = status # reserved for parity with _error_response + + # Pass threshold for turning numeric scores into pass/fail booleans. + # Read per-invocation so tests and deploys can override via env var. + try: + threshold = float( + os.environ.get(PASS_THRESHOLD_ENV, DEFAULT_PASS_THRESHOLD) + ) + except (TypeError, ValueError): + threshold = DEFAULT_PASS_THRESHOLD + + score = getattr(evaluation_output, "score", None) + reason = getattr(evaluation_output, "reason", None) + + # Strands evaluators may set ``test_pass`` directly (deterministic + # evaluators often do); otherwise fall back to the threshold compare. + explicit_pass = getattr(evaluation_output, "test_pass", None) + if explicit_pass is not None: + passed = bool(explicit_pass) + else: + passed = score is not None and score >= threshold + + # Prefer an evaluator-provided label when one exists; otherwise map + # the boolean into the AgentCore ``PASS`` / ``FAIL`` vocabulary. + explicit_label = getattr(evaluation_output, "label", None) + label = explicit_label if explicit_label else ( + "PASS" if passed else "FAIL" + ) + + return { + "schemaVersion": event.get("schemaVersion", SCHEMA_VERSION), + "results": [ + { + "evaluatorId": event.get("evaluatorId"), + "evaluatorName": event.get("evaluatorName"), + "evaluationLevel": event.get("evaluationLevel"), + "score": score, + "passed": passed, + "label": label, + "reason": reason, + } + ], + } +# --8<-- [end:build_response] + +# --8<-- [start:error_handling] +class LevelMismatchError(Exception): + """Raised when AgentCore's ``evaluationLevel`` does not map to the + wrapped evaluator's declared Strands level. + + The outer ``try/except`` in ``lambda_handler`` catches this and + converts it into a schema-valid "not-run" response (``score=None``, + ``passed=False``, ``label="NOT_RUN"``) so AgentCore surfaces a + graceful result instead of an unhandled Lambda error. + """ + + +def _error_response( + event: dict[str, Any], + exc: Exception, + label: str, +) -> dict[str, Any]: + """Build a schema-valid AgentCore response for a failed invocation. + + ``label`` is ``"NOT_RUN"`` for level-mismatch failures (the + evaluator is healthy but cannot score this ``evaluationLevel``) and + ``"ERROR"`` for everything else (validation failure, evaluator + raise, Bedrock error). ``score`` is ``None`` and ``passed`` is + ``False`` in both cases so AgentCore can distinguish a missing + score from a real ``0.0``. ``reason`` carries a short summary + (exception class and message); full tracebacks go to CloudWatch + via ``logger.exception(...)`` in ``lambda_handler``. + """ + reason = f"{type(exc).__name__}: {exc}" + return { + "schemaVersion": event.get("schemaVersion", SCHEMA_VERSION), + "results": [ + { + "evaluatorId": event.get("evaluatorId"), + "evaluatorName": event.get("evaluatorName"), + "evaluationLevel": event.get("evaluationLevel"), + "score": None, + "passed": False, + "label": label, + "reason": reason, + } + ], + } +# --8<-- [end:error_handling] diff --git a/src/content/docs/user-guide/evals-sdk/quickstart.mdx b/src/content/docs/user-guide/evals-sdk/quickstart.mdx index 5d769b152..08515c005 100644 --- a/src/content/docs/user-guide/evals-sdk/quickstart.mdx +++ b/src/content/docs/user-guide/evals-sdk/quickstart.mdx @@ -544,3 +544,4 @@ Ready to dive deeper? Explore these resources: - [Experiment Generator](experiment_generator.md) - Automatically generate comprehensive test suites - [Serialization](how-to/serialization.md) - Save, load, and version your evaluation experiments - [Detectors](detectors/index.md) - Automatic failure detection and root cause analysis for agent traces +- [Deploy to AgentCore Evaluators](how-to/agentcore-code-evaluators.md) - Wrap a Strands evaluator as an AgentCore code-based evaluator and run it against AgentCore sessions