"It compiles" isn't "it works." So the agent now does the checks you'd do yourself before it tells you a task is finished — and it remembers what it learns.
What changed
- Runs your tests. Before a goal step is marked done, the agent runs the project's test suite. A real failure isn't ignored — it's fed back as guidance and the agent tries again.
- Opens the app in a real browser. It launches the running app in a headless browser and flags console errors and failed requests, so a change that builds but breaks the page doesn't slip through.
- Shows you the proof. Test results and the browser check travel with the task, so you can see why it thinks the work is done.
- Learns each repo. The agent recalls prior work on a repo — past changes, conventions, where things live — so the next task starts with context instead of from scratch. The more you use an agent on a codebase, the smarter it gets.
Why
Trust comes from evidence. An agent that quietly verifies itself and gets to know your code is one you can actually hand bigger work to.