Six Models, Six Personalities

I use a lot of models daily now. One thing I've realized: every model has a different personality. Even within the same company, the style changes completely between generations. Figuring out who's good at what and where they're prone to fail has itself become a collaborative skill.

Opus 4.6—Reckless, but Capable

Claude Opus 4.6 is still my most-used model.

Strong on engineering, high execution efficiency, often gives solid approaches to complex problems. Quick to diagnose bugs, saving significant time compared to other models. When doing architecture design, it actively asks about your intent—not just coding blindly, but first digging into six aspects for you to judge before getting started. On alignment with human intent, it's currently the best among models I've used.

The downsides are obvious too.

No sense of aesthetics for UI. The web pages it builds are ugly and stiff, and when you ask it to fix them, it doesn't know where to start. This model is quite amusing.

Another issue is recklessness. It charges ahead before fully understanding things. You test it—wrong. You tell it it's wrong, it looks back and realizes it missed this or that, fixes it, tests again—still wrong. This back-and-forth happens frequently. It looks fast, but sometimes it's hollow speed; it gets overly optimistic thinking it's done, then wraps up without checking.

But the ecosystem is well-built. Claude Code plus Chrome extensions, PPT plugins, Excel plugins—one account covers many scenarios. Still the first choice for automation.

GPT 5.4—Can Chew Through Hard Problems, But Doesn't Listen Well

I used Codex 5.3 before; it was mediocre, good at catching bugs, decent at review. 5.4 is very different in style from 5.3.

My impression is that efficiency isn't necessarily high, but it's comprehensive. Tasks that Opus fails at once, twice, or three times, 5.4 can handle in one go. Of course, it takes longer—constantly testing, checking, and revising itself, slowly grinding through.

Alignment with humans is weaker. I used maximum effort to have it write an architecture design doc; it finished without asking many questions. When I looked at it, it was quite different from what I had in mind. Later, when I had it write code according to the doc, it drifted off again—locally neat and tidy, but the overall direction skewed. Suddenly it created some inexplicable module and kept polishing it. I stopped it directly.

Many people have told me that 5.4 always wants to innovate, using weird ways to solve problems. You give it SOPs and guidance; it doesn't necessarily follow them.

But Computer Use surprised me. I had it help configure a WeChat customer service backend—no Chrome extension, pure browser automation. I just scanned a QR code to let it into the backend; it clicked through a bunch of configurations, found something wrong and went back to fix it, back and forth for about thirty minutes until everything was configured. I still don't know how it figured it out.

There's another interesting workflow. I have Opus write something, then have 5.4 critique and review it. 5.4 is sharp in its criticism. Then I feed 5.4's comments back to Opus, turn on Max effort, and let it handle it. Every time, Opus says something like "this comment is very sharp, precise, and makes sense," then honestly fixes it. Haha, they actually respect each other.

So what is 5.4 good for? Chewing through hard bones that other models can't handle, then doing final checks and review. It can also give good advice on deep thinking, but letting it lead the direction tends toward over-engineering.

Gemini 3.1 Pro—Good Eye, Bad Hands

Google's Gemini 3.1 Pro. First off, this model has too many barriers to entry—all kinds of verification and roadblocks. Why is using a model such a hassle?

Finally got it working. It has its own ideas on frontend design. Recently I used it to update my official website; it looked at it and noticed my theme was "weightlessness," but the original block animation was falling down with normal gravity, contradicting the theme. It proactively suggested changing it to a weightless effect, which turned out quite interesting after the change. I'd had other models look at the same page before; none of them suggested this.

Here's the problem—after changing it, errors. Change again, still errors. Change again, more errors. It can tell what's ugly, but after changing it, it can't fix its own bugs. Throw the same error to Opus; it investigates and fixes it in one go. Gemini tells you "it should be fine now, take a look," and you look—still problems. That's roughly how it goes.

GLM 5—Strong Start, Weak Finish

Among domestic models, Zhipu's GLM 5 feels most like early Opus 4.5. Straightforward, gets straight to work, precise and efficient.

But around step twenty, things change. It starts not following the initial instructions, and its global tracking ability falls behind Opus. After step twenty, it might run off to some weird branch and keep expanding there. Opus 4.6 can execute continuously for two or three hours without drifting much; GLM starts drifting at step twenty. Within twenty steps, the difference isn't big; beyond twenty steps, the difference is huge.

MiniMax 2.5—Cheap and Fast, but Overcomplicates Simple Things

MiniMax feels good to use, cheap and fast. But it over-engineers ridiculously—you give it a simple problem, it comes up with an extremely complex solution. You look at it and think, this is clearly just a few lines of code, how did it get like this.

Instruction following is decent, but it tends to give up too early. Push it a bit and it says "can't do this, try another approach." A cost-effective alternative for daily use.

Kimi K2.5—Sweet Talker, Clumsy Hands

I use Kimi K2.5 quite a bit too. Fast, a hundred tokens per second, good frontend, native image understanding.

Lots of small issues. During execution, this is wrong and that's wrong, syntax errors everywhere. Worse still, it's especially good at sweet-talking—what it says looks reasonable, plausible but actually not, but doesn't match actual software behavior. You read its response and think "hmm, that makes sense," then execute and find it's not the case.

But when it writes articles or gives feedback, the language is colloquial and interesting to read. A very amusing model, but if you're relying on it for work, you need to verify multiple times.

So What?

No single model can do everything. Switching models sometimes works better than tuning prompts. Opus as the main workhorse, 5.4 for hard problems and review, Gemini occasionally consulted for UI. Domestic models offer good value for money, comfortable for short tasks. Figuring out each model's personality is far more effective than stubbornly sticking with one.

Originally published at https://guanjiawei.ai/en/blog/ai-model-field-notes

Six Models, Six Personalities

Opus 4.6—Reckless, but Capable

GPT 5.4—Can Chew Through Hard Problems, But Doesn't Listen Well

Gemini 3.1 Pro—Good Eye, Bad Hands

GLM 5—Strong Start, Weak Finish

MiniMax 2.5—Cheap and Fast, but Overcomplicates Simple Things

Kimi K2.5—Sweet Talker, Clumsy Hands

So What?

Comments

More from this blog

The AI Coding Business Ate Itself

AI Is Not a Wishing Well: Two Things I Recently Couldn't Solve

Everyone Needs to Be a Leader: After a Week with GPT-5.5 and Opus 4.7

The Days Around the Opus 4.7 Launch

What Goes Around Comes Around: A New Model Every Month and a Half

Command Palette

Opus 4.6—Reckless, but Capable

GPT 5.4—Can Chew Through Hard Problems, But Doesn't Listen Well

Gemini 3.1 Pro—Good Eye, Bad Hands

GLM 5—Strong Start, Weak Finish

MiniMax 2.5—Cheap and Fast, but Overcomplicates Simple Things

Kimi K2.5—Sweet Talker, Clumsy Hands

So What?

Comments

More from this blog