Top AI models fall short when tasked with playing classic Pokémon

Top AI models fall short when tasked with playing classic Pokémon — Api.time.com
Image source: Api.time.com

Time reports that live on Twitch three of the world’s smartest AI systems—GPT 5.2, Claude Opus 4.5, and Gemini 3 Pro—are attempting to beat classic Pokémon games but, by human standards, are slow, overconfident and often confused. The experiment began last February when an Anthropic researcher livestreamed Claude Sonnet 3.7 playing Pokémon Red; Sonnet could not get past the opening and earlier models “wandered aimlessly or got stuck in loops.” Anthropic’s Opus 4.5 is performing much better but still frequently gets stuck—for example, spending four days circling a gym without entering because it did not realize (or could not see) it had to cut down a tree.

Google’s Gemini models completed an equivalent game last May, but comparisons are complicated because each model runs inside a different “harness”: Gemini’s harness translates visuals into text and supplies custom tools, while Claude has been given a more minimal harness. Pokémon is useful for testing long-term planning because play is turn-based and models receive a screenshot and a prompt, then output a single action; Opus 4.5 has been playing for over 500 hours and is on step 170,000, with each step initialized afresh like an amnesiac relying on post-it notes.

As Joel Zhang says, the core challenge is “how well it can stick to doing a task over a long time horizon,” and Peter Whidden adds that models “know everything about Pokémon” but tend to bumble execution.


Key Topics

Tech, Anthropic, Google, Pokémon, Twitch