Gave Claude Code, Gemini CLI, and Codex CLI identical instructions: analyze 13 years of writing across three blogs (2 of them are in my regional language which is non english), create a style guide.
Observations:
1. Model-task matching matters. Codex's default code-specialized model struggled with writing analysis. Switching to GPT-5 improved output quality 4x.
2. Autonomy settings affect completion. Gemini with limited autonomy produced incomplete work—it kept pausing for approvals mid-task.
3. All three claimed "done." Output varied from 198 to 2,555 lines. Never trust completion claims without verification.
4. Deep reading beat clever shortcuts. Codex took an API-first approach (RSS, JSON endpoints). Valid methodology, but missed nuances that Claude caught by reading posts directly.
Claude won at 9.5/10, but the more interesting finding was how much configuration affected the other two agents' scores.
Full analysis with methodology in the post linked.
Gave Claude Code, Gemini CLI, and Codex CLI identical instructions: analyze 13 years of writing across three blogs (2 of them are in my regional language which is non english), create a style guide.
Observations:
1. Model-task matching matters. Codex's default code-specialized model struggled with writing analysis. Switching to GPT-5 improved output quality 4x.
2. Autonomy settings affect completion. Gemini with limited autonomy produced incomplete work—it kept pausing for approvals mid-task.
3. All three claimed "done." Output varied from 198 to 2,555 lines. Never trust completion claims without verification.
4. Deep reading beat clever shortcuts. Codex took an API-first approach (RSS, JSON endpoints). Valid methodology, but missed nuances that Claude caught by reading posts directly.
Claude won at 9.5/10, but the more interesting finding was how much configuration affected the other two agents' scores.
Full analysis with methodology in the post linked.