Agent Persona Exploration - 2026-05-17 #32837
Closed
Replies: 1 comment
-
|
This discussion has been marked as outdated by Agent Persona Explorer. A newer discussion is available at Discussion #33589. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Overview
Seven scenarios tested across two runs (2026-05-16 and 2026-05-17), covering all five software worker personas. Overall average score: 4.2/5.0. The agent consistently excels at PR-triggered analysis workflows and struggles with scenarios requiring external infrastructure tooling (cloud CLIs, credentials, network allowlisting).
Key Findings
pull_requesttriggers withadd-commentsafe-output consistently scored 4.4–4.8/5schedule+create-issue/create-discussionwithskip-if-matchdedup works cleanlyvisual-regression.mdandtest-coverage.mdreference prompts give the agent a major quality boost for those domainsQuality Scores by Scenario
Top Patterns Observed
pull_request(opened/synchronize) with optionalpaths:filtergithub(gh-proxy),bash,playwright(browser scenarios),cache-memory(persistence)skip-if-matchdeduplication on scheduled runsHigh Quality Responses (scores ≥ 4.4)
S5 — Visual Regression (Frontend, 4.6/5)
Near-perfect scenario fit. The agent maps directly to
.github/aw/visual-regression.mdas a reference, applies Playwright with SSRF protection (localhost-only), usescache-memoryfor baseline persistence, and correctly rate-limitsadd-commentwithmax:1. Only gap: should ask which app server command to use (storybook,vite, etc.).S3 — Test Coverage (QA, 4.8/5)
The agent correctly routes to the
test-coverage.mdprompt and applies coverage diff analysis with clear guidance on thresholds. Best-performing scenario across both runs.S7 — API Docs (Backend, 4.4/5)
Clean path-filtered trigger, correct token-cost mitigation (scope analysis to changed files only), and
update-commentpattern to avoid PR comment spam on re-runs.Areas for Improvement (scores ≤ 3.8)
S2 — Deployment Monitoring (DevOps, 3.2/5)
The
workflow_runtrigger is correct but the requiredactions:readpermission is easy to miss. The scenario implies multi-stage behavior (wait for deployment to stabilize) that the single-job model cannot support — the agent should explicitly call this out.S6 — Terraform Drift (DevOps, 3.8/5)
The framework mechanics are sound, but the agent may generate a syntactically valid workflow that silently fails at runtime because:
The agent should proactively surface these blockers rather than generate a workflow that appears complete.
Recommendations
Add a "complex ops checklist" to
.github/aw/create-agentic-workflow.md— when the agent detects keywords like "cloud", "terraform", "kubernetes", or "deploy", it should automatically prompt the user for: cloud provider, credentials strategy, required binaries, and expected run time. This prevents silent workflow failures on DevOps scenarios.Expand
.github/aw/create-agentic-workflow.mdwithworkflow_runtrigger guidance — document theactions:readpermission requirement and the single-job constraint (no waiting for deployment stabilization). Both S2 and similar monitoring scenarios are blocked by this knowledge gap.Add
update-commentpattern to the PR workflow template in.github/aw/create-agentic-workflow.md— the "check for existing comment before posting" pattern (to avoid spam onsynchronizeevents) came up in multiple scenarios (S1, S5, S7) and should be a first-class recommendation for any PR-triggered workflow that posts comments.Runs: §25995345892 | Previous: 2026-05-16, 2026-05-15
Beta Was this translation helpful? Give feedback.
All reactions