Six Models, Six Temperaments

These days, I'm using quite a few models daily. One thing I've noticed: every model has a different temperament. Even within the same company, the style changes completely from one generation to the next. Figuring out who's good at what and where they tend to fail has become a collaboration skill in itself.

Opus 4.6—Reckless, but Effective

Claude Opus 4.6 is still my most frequently used.

It excels at engineering with high execution efficiency, often providing solid approaches to complex problems. When it hits bugs, it diagnoses them quickly, saving significant time compared to other models. During architecture design, it proactively asks about your intentions—it doesn't just start coding, but first explores six different aspects for you to evaluate before getting to work. When it comes to alignment with humans, it currently does the best among all models I've used.

The downsides are equally obvious.

It has no sense of aesthetics for UI. The webpages it produces are ugly and stiff, and when you ask it to fix them, it doesn't know which direction to go. This model is quite something.

Another issue is recklessness. It charges ahead before fully understanding the requirements. You test it—it's wrong. You tell it it's wrong, it looks back and realizes it missed this or that, fixes it, tests again—still wrong. This back-and-forth happens frequently. The speed looks fast, but sometimes it's illusory; it gets overly optimistic thinking it's done and wraps up without proper verification.

However, the ecosystem is well-built. Claude Code plus Chrome extensions, PowerPoint plugins, Excel plugins—one account covers many scenarios. It's still the first choice for automation.

GPT 5.4—Can Chew Through Hard Problems, But Doesn't Listen Well

I used Codex 5.3 before; it was mediocre, good at catching bugs, decent at code review. 5.4 is stylistically very different from 5.3.

My impression is that efficiency isn't necessarily high, but it's comprehensive. Tasks that Opus fails at once, twice, or three times, 5.4 can solve in one go. Of course, it takes longer, constantly testing, checking, and revising itself, grinding away slowly.

Alignment with humans is weaker. I used maximum effort to have it write an architecture design document; it finished without asking many questions. One look—it was vastly different from what I had in mind. Later, I had it write code according to the document; halfway through it went off-track again—locally neat and tidy, but the overall direction skewed. Suddenly it created some inexplicable module and kept polishing it. I stopped it directly.

Many people have told me the same: 5.4 always wants to innovate, using strange ways to solve problems. You give it SOPs and guidance; it doesn't necessarily follow them.

But Computer Use surprised me. I had it help configure a WeChat customer service backend—no Chrome extensions, pure browser automation. I just scanned a QR code to let it into the backend; it clicked through a bunch of configurations, found something wrong and went back to fix it, back and forth for about thirty minutes until everything was configured. I don't even know how it figured it out.

There's another interesting play. I have Opus write something, then have 5.4 critique and review it. 5.4 is quite sharp in its criticism. Then I throw 5.4's comments at Opus, turn on Max effort, and let it handle it. Every time, Opus responds with something like "this comment is very sharp, precise, and makes sense," then obediently makes the changes. Haha, they actually respect each other.

So what's 5.4 good for? Chewing through hard problems that other models can't handle, then doing final checks and reviews afterward. It can also give good advice on deep thinking, but letting it lead the direction easily leads to over-engineering.

Gemini 3.1 Pro—Good Eye, Poor Hands

Google's Gemini 3.1 Pro. First off, this model has too many barriers to entry—all kinds of verification and roadblocks. Why is using a model such a hassle?

Finally got it working. For frontend design, it has its own ideas. Recently I used it to update my website; it looked at it and found my theme was "weightlessness," but the original block animation was falling with normal gravity, contradicting the theme. It proactively suggested changing it to a weightless effect, and the result was quite interesting. Previously, I'd shown the same page to other models; none of them suggested this.

Here's the problem—after the change, errors. Change again, still errors. Change again, still errors. It can see what's ugly, but can't fix the bugs after modifying. Throw the same error at Opus, and it diagnoses and fixes it in one go. Gemini tells you "all good, take a look," but when you look, there are still problems. That's about how it goes.

GLM 5—Strong Start, Weak Finish

Among domestic models, Zhipu's GLM 5 feels most like early Opus 4.5. Straightforward and direct, it gets straight to work—precise and efficient.

But around twenty steps in, things change. It starts deviating from initial instructions, and its global tracking ability falls far behind Opus. After twenty steps, it might run off to some strange branch and keep expanding there. Opus 4.6 can execute continuously for two or three hours without drifting much; GLM starts drifting at twenty steps. Within twenty steps, not much difference; beyond twenty steps, huge difference.

MiniMax 2.5—Cheap and Fast, but Overcomplicates Simple Things

MiniMax feels pretty good—cheap and fast. But it over-engineers ridiculously: you give it a simple problem, it comes up with an extremely complex solution. You look at it and think, this was clearly just a few lines of code—how did it become this?

Instruction following is decent, but it gives up too easily. Push it a bit and it says "can't do this, try another approach." A cost-effective alternative for daily use.

Kimi K2.5—Sweet Talker, Clumsy Hands

I also use Kimi K2.5 quite a bit. Fast speed—one hundred tokens per second—decent frontend, native image understanding.

Lots of small issues. During execution, this is wrong and that is wrong, with frequent syntax errors. What's worse is it's particularly good at sweet-talking: what it says looks reasonable, plausible but specious, but doesn't match actual software behavior. You read its response and think "hmm, that makes sense," but when you execute it, you find it's not the case.

However, when writing articles or giving feedback, its language is conversational and interesting to read. A very intriguing model, but if you're relying on it to get work done, you need to verify multiple times.

So What?

No single model can do everything. Sometimes switching models works better than tuning prompts. Use Opus as the main workhorse, 5.4 for hard problems and reviews, and occasionally consult Gemini for UI. Domestic models offer good value for money, comfortable for short tasks. Understanding each model's temperament is far more effective than stubbornly sticking with one.

Originally published at https://guanjiawei.ai/en/blog/ai-model-field-notes

Six Models, Six Temperaments

Opus 4.6—Reckless, but Effective

GPT 5.4—Can Chew Through Hard Problems, But Doesn't Listen Well

Gemini 3.1 Pro—Good Eye, Poor Hands

GLM 5—Strong Start, Weak Finish

MiniMax 2.5—Cheap and Fast, but Overcomplicates Simple Things

Kimi K2.5—Sweet Talker, Clumsy Hands

So What?

Comments

More from this blog

The AI Coding Business Ate Itself

AI Is Not a Wishing Well: Two Things I Recently Couldn't Solve

Everyone Needs to Be a Leader: After a Week with GPT-5.5 and Opus 4.7

The Days Around the Opus 4.7 Launch

What Goes Around Comes Around: A New Model Every Month and a Half

Command Palette

Opus 4.6—Reckless, but Effective

GPT 5.4—Can Chew Through Hard Problems, But Doesn't Listen Well

Gemini 3.1 Pro—Good Eye, Poor Hands

GLM 5—Strong Start, Weak Finish

MiniMax 2.5—Cheap and Fast, but Overcomplicates Simple Things

Kimi K2.5—Sweet Talker, Clumsy Hands

So What?

Comments

More from this blog