Lessons from testing three AI agents on the same complex task

prash2488 4 hours ago

Gave Claude Code, Gemini CLI, and Codex CLI identical instructions: analyze 13 years of writing across three blogs (2 of them are in my regional language which is non english), create a style guide.

Observations:

1. Model-task matching matters. Codex's default code-specialized model struggled with writing analysis. Switching to GPT-5 improved output quality 4x.

2. Autonomy settings affect completion. Gemini with limited autonomy produced incomplete work—it kept pausing for approvals mid-task.

3. All three claimed "done." Output varied from 198 to 2,555 lines. Never trust completion claims without verification.

4. Deep reading beat clever shortcuts. Codex took an API-first approach (RSS, JSON endpoints). Valid methodology, but missed nuances that Claude caught by reading posts directly.

Claude won at 9.5/10, but the more interesting finding was how much configuration affected the other two agents' scores.

Full analysis with methodology in the post linked.