Binary Tree Serialization Test: 11 Models, 7 Full Scores, 4 Directly Zero

In the same binary tree serialization problem that requires "return code only, explicitly encode null nodes, and produce consistent stable results," the final scores of the 11 models showed clear polarization: 7 models scored a perfect 100, while 4 directly got zero.

Common characteristics of full-score models

Doubao Pro, Qwen3 Max, ERNIE 4.5, Grok 4, Claude Sonnet 4.6, Claude Opus 4.7, and GPT-o3 all adopted a preorder traversal + explicit null node marker approach.

Typical implementations are as follows:

  • Null nodes are uniformly represented with "#" or "null"
  • Directly return a comma-separated string without extra class wrapper
  • Deserialization uses an iterator or pop(0) to reconstruct the tree structure

The code generated by these models produced completely consistent output formats across multiple runs, satisfying the hard requirement that "serialization results of the same tree must be stable and consistent."

Fatal issues of zero-score models

Gemini 2.5 Pro, Gemini 3.1 Pro, DeepSeek V4 Pro, and GPT-5.5 scored zero, mainly due to two issues:

  • Using a Codec class wrapper instead of directly providing two independent functions: serialize and deserialize
  • Code snippets were clearly truncated, with the deserialization function incomplete

Both Gemini models used a class structure, DeepSeek V4 Pro also returned a Codec class, and GPT-5.5 directly output an unfinished dfs function. The problem explicitly required "return code only, no explanations." The output format of these models already failed the evaluation criteria.

In engineering implementation, format compliance is often more important than algorithmic thinking. The zero score was not because they couldn't write the code, but because they didn't follow the problem's rules.

Real gap in execution dimension

This evaluation only examined the code execution dimension. Full-score models passed boundary cases such as negative numbers, duplicate values, and empty trees. Zero-score models couldn't even enter the test entry due to format errors.

Notably, although Claude Sonnet 4.6 ultimately scored 100, an intermediate version exhibited an undefined regular expression issue, showing fluctuations in code completeness. In comparison, GPT-o3 and Claude Opus 4.7 provided the cleanest and most straightforward implementations.

From the results, current mainstream models can already produce stable usable solutions for strictly constrained code execution tasks. However, nearly 40% of models still fail at the most basic step of "outputting according to rules."

This once again confirms: a model's engineering deployment capability first lies in whether it can read and strictly follow all constraints in the problem statement.


Data source: YZ Index | Run #154 | View Raw Data