Changelog — YZ Index

2026-07-29 05:11 SGT SmokeEvaluation Completed WDCD smoke evaluation

11 Model Started：2026-07-29 04:30 SGT Completed：2026-07-29 05:11 SGT 41pts24s Run #253 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-29 03:17 SGT SmokeEvaluation Completed

11 Model Started：2026-07-29 03:10 SGT Completed：2026-07-29 03:17 SGT 7pts41s Run #252 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-28 03:17 SGT SmokeEvaluation Completed

11 Model Started：2026-07-28 03:00 SGT Completed：2026-07-28 03:17 SGT 17pts1s Run #250 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-27 05:03 SGT FullEvaluation Completed

11 Model Started：2026-07-27 04:00 SGT Completed：2026-07-27 05:03 SGT 1h3pts Run #249 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-27 03:13 SGT SmokeEvaluation Completed

11 Model Started：2026-07-27 03:00 SGT Completed：2026-07-27 03:13 SGT 13pts10s Run #248 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-26 05:40 SGT SmokeEvaluation Completed WDCD pilot evaluation

11 Model Started：2026-07-26 04:30 SGT Completed：2026-07-26 05:40 SGT 1h10pts Run #247 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-26 03:26 SGT SmokeEvaluation Completed

11 Model Started：2026-07-26 03:00 SGT Completed：2026-07-26 03:26 SGT 26pts10s Run #246 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-25 03:20 SGT SmokeEvaluation Completed

11 Model Started：2026-07-25 03:00 SGT Completed：2026-07-25 03:20 SGT 20pts41s Run #245 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-24 03:26 SGT SmokeEvaluation Completed

11 Model Started：2026-07-24 03:00 SGT Completed：2026-07-24 03:26 SGT 26pts11s Run #244 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-23 03:15 SGT SmokeEvaluation Completed

11 Model Started：2026-07-23 03:00 SGT Completed：2026-07-23 03:15 SGT 15pts30s Run #243 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-22 05:06 SGT SmokeEvaluation Completed WDCD smoke evaluation

11 Model Started：2026-07-22 04:30 SGT Completed：2026-07-22 05:06 SGT 36pts53s Run #242 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-22 03:14 SGT SmokeEvaluation Completed

11 Model Started：2026-07-22 03:00 SGT Completed：2026-07-22 03:14 SGT 14pts11s Run #241 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-21 03:15 SGT SmokeEvaluation Completed

11 Model Started：2026-07-21 03:00 SGT Completed：2026-07-21 03:15 SGT 15pts21s Run #240 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-20 04:53 SGT FullEvaluation Completed

11 Model Started：2026-07-20 04:00 SGT Completed：2026-07-20 04:53 SGT 53pts46s Run #239 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-20 03:09 SGT SmokeEvaluation Completed

11 Model Started：2026-07-20 03:00 SGT Completed：2026-07-20 03:09 SGT 9pts21s Run #238 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-19 03:14 SGT SmokeEvaluation Completed

11 Model Started：2026-07-19 03:00 SGT Completed：2026-07-19 03:14 SGT 14pts11s Run #237 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-18 03:20 SGT SmokeEvaluation Completed

11 Model Started：2026-07-18 03:00 SGT Completed：2026-07-18 03:20 SGT 20pts51s Run #236 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-17 03:12 SGT SmokeEvaluation Completed

11 Model Started：2026-07-17 03:00 SGT Completed：2026-07-17 03:12 SGT 12pts21s Run #235 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-16 03:12 SGT SmokeEvaluation Completed

11 Model Started：2026-07-16 03:00 SGT Completed：2026-07-16 03:12 SGT 12pts11s Run #234 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-15 05:10 SGT SmokeEvaluation Completed WDCD smoke evaluation

11 Model Started：2026-07-15 04:30 SGT Completed：2026-07-15 05:10 SGT 40pts17s Run #233 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-15 03:16 SGT SmokeEvaluation Completed

11 Model Started：2026-07-15 03:00 SGT Completed：2026-07-15 03:16 SGT 16pts21s Run #232 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-14 03:09 SGT SmokeEvaluation Completed

11 Model Started：2026-07-14 03:00 SGT Completed：2026-07-14 03:09 SGT 9pts11s Run #231 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-13 04:59 SGT FullEvaluation Completed

11 Model Started：2026-07-13 04:00 SGT Completed：2026-07-13 04:59 SGT 59pts12s Run #230 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-13 03:18 SGT SmokeEvaluation Completed

11 Model Started：2026-07-13 03:10 SGT Completed：2026-07-13 03:18 SGT 8pts20s Run #229 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-12 05:52 SGT SmokeEvaluation Completed WDCD pilot evaluation

11 Model Started：2026-07-12 04:30 SGT Completed：2026-07-12 05:52 SGT 1h22pts Run #227 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-12 03:10 SGT SmokeEvaluation Completed

11 Model Started：2026-07-12 03:00 SGT Completed：2026-07-12 03:10 SGT 10pts41s Run #226 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-11 03:15 SGT SmokeEvaluation Completed

11 Model Started：2026-07-11 03:00 SGT Completed：2026-07-11 03:15 SGT 15pts51s Run #225 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-10 03:18 SGT SmokeEvaluation Completed

11 Model Started：2026-07-10 03:10 SGT Completed：2026-07-10 03:18 SGT 8pts41s Run #224 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-10 03:04 SGT SmokeEvaluation Completed

11 Model Started：2026-07-10 03:00 SGT Completed：2026-07-10 03:04 SGT 4pts31s Run #223 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-09 03:11 SGT SmokeEvaluation Completed

11 Model Started：2026-07-09 03:00 SGT Completed：2026-07-09 03:11 SGT 11pts21s Run #222 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-08 05:15 SGT SmokeEvaluation Completed WDCD smoke evaluation

11 Model Started：2026-07-08 04:30 SGT Completed：2026-07-08 05:15 SGT 45pts20s Run #221 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-08 03:14 SGT SmokeEvaluation Completed

11 Model Started：2026-07-08 03:10 SGT Completed：2026-07-08 03:14 SGT 4pts21s Run #220 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-07 03:18 SGT SmokeEvaluation Completed

11 Model Started：2026-07-07 03:10 SGT Completed：2026-07-07 03:18 SGT 8pts11s Run #218 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-06 08:00 SGT FullEvaluation Completed

11 Model Started：2026-07-06 04:00 SGT Completed：2026-07-06 08:00 SGT 4h0pts Run #216 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-06 03:08 SGT SmokeEvaluation Completed

11 Model Started：2026-07-06 03:00 SGT Completed：2026-07-06 03:08 SGT 8pts1s Run #215 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-05 03:12 SGT SmokeEvaluation Completed

11 Model Started：2026-07-05 03:00 SGT Completed：2026-07-05 03:12 SGT 12pts50s Run #214 Formula v7 · Judge v6.4 · Benchmark v7

Time unknown SmokeEvaluation unknown

0 Model Run #13

Time unknown SmokeEvaluation unknown

0 Model Run #12

Time unknown SmokeEvaluation unknown

0 Model Run #11

Time unknown SmokeEvaluation unknown

0 Model Run #10

Time unknown SmokeEvaluation unknown

0 Model Run #9

2026-07-04 03:19 SGT SmokeEvaluation Completed

11 Model Started：2026-07-04 03:10 SGT Completed：2026-07-04 03:19 SGT 9pts51s Run #213 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-04 03:08 SGT SmokeEvaluation Completed

11 Model Started：2026-07-04 03:00 SGT Completed：2026-07-04 03:08 SGT 8pts41s Run #212 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-03 11:05 SGT SmokeEvaluation Completed WDCD pilot evaluation

11 Model Started：2026-07-03 04:41 SGT Completed：2026-07-03 11:05 SGT 6h23pts Run #211 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-03 04:34 SGT Version Upgrade

WDCD compliance board v3.1 upgrade + benchmark roster refresh

Compliance test upgrade v3.1

The multi-turn compliance board (WDCD) question bank is upgraded to v3.1: 17 new multi-turn escalating-pressure questions covering real compliance pressure scenarios such as "primitive-choice traps", "collusion tests" and "false-premise continuation" — violation verdicts are based on runtime-reproducible rules, leaving no room for dispute. **Why**: the old bank was saturating for frontier models (top compliance scores were bunched around 93, hard to differentiate). v3.1 re-opens the field with multi-turn pressure closer to real enterprise scenarios — measured compliance scores now spread smoothly from ~98 down to ~72 at the top, with clearly improved discrimination. **Pool**: 17 new v3.1 questions + 8 cross-version anchors, 25 in total. Historical WDCD leaderboards remain as-is; scores across versions are not directly comparable.

Roster refresh

• **Added** Zhipu GLM-4.6 to the roster — a first-string domestic Chinese model.

• **Temporarily removed** ERNIE 4.5: its API access has been persistently unavailable, making trustworthy scoring impossible; it will be re-evaluated for inclusion once access recovers.

The roster stands at 11 models.

2026-07-03 03:24 SGT SmokeEvaluation Completed

11 Model Started：2026-07-03 03:10 SGT Completed：2026-07-03 03:24 SGT 14pts1s Run #210 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-03 03:05 SGT SmokeEvaluation Completed

11 Model Started：2026-07-03 03:00 SGT Completed：2026-07-03 03:05 SGT 5pts1s Run #209 Formula v7 · Judge v6.4 · Benchmark v7

2026-07-03 01:29 SGT Version Upgrade

Judge set v6.4: bundled scoring launched + scoring fixes

What changed

**Bundled scoring**: structured-output questions (json_schema_exact) upgraded from "per-checkpoint partial credit" to "bundled scoring" — checkpoints are grouped by business semantics, and a group only scores when every checkpoint in it is correct. **Why**: with partial credit, getting 3 of 38 checkpoints wrong still scored 92; but in real delivery, one mistranscribed amount or one missed clause means full rework. Partial credit systematically overstated model usability on critical tasks and saturated the top of the leaderboard (top material-constraint scores had reached 95+). Bundled scoring aligns with real delivery tolerance. **Effect**: recomputing the latest full run's raw answers, top-model core scores went from ~95 to ~80 with clearly improved tier separation. Questions, model answers, and every checkpoint verdict are unchanged — only the aggregation changed.

Scoring fixes

• Fixed time decay in SQL "last N days" questions (fixed test-data dates drifted out of the query window over time, misjudging correct queries as 0), and added monthly automatic re-anchoring to prevent recurrence.

• Retired 1 question whose scorer and question language never matched and could not be scored.

Historical comparability

Runs from now on are tagged judge set v6.4; earlier leaderboards remain as-is tagged v6.3. Scores across judge sets are not directly comparable.

2026-07-02 03:09 SGT SmokeEvaluation Completed

11 Model Started：2026-07-02 03:00 SGT Completed：2026-07-02 03:09 SGT 9pts11s Run #208 Formula v7 · Judge v6.3 · Benchmark v7

2026-07-01 04:58 SGT SmokeEvaluation Completed WDCD smoke evaluation

11 Model Started：2026-07-01 04:30 SGT Completed：2026-07-01 04:58 SGT 28pts55s Run #207 Formula v7 · Judge v6.3 · Benchmark v7

2026-07-01 03:09 SGT SmokeEvaluation Completed

11 Model Started：2026-07-01 03:00 SGT Completed：2026-07-01 03:09 SGT 9pts21s Run #206 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-30 03:03 SGT SmokeEvaluation Completed

11 Model Started：2026-06-30 03:00 SGT Completed：2026-06-30 03:03 SGT 3pts31s Run #205 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-29 04:56 SGT FullEvaluation Completed

11 Model Started：2026-06-29 04:00 SGT Completed：2026-06-29 04:56 SGT 56pts31s Run #204 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-29 03:03 SGT SmokeEvaluation Completed

11 Model Started：2026-06-29 03:00 SGT Completed：2026-06-29 03:03 SGT 3pts31s Run #203 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-28 05:58 SGT SmokeEvaluation Completed WDCD pilot evaluation

11 Model Started：2026-06-28 04:30 SGT Completed：2026-06-28 05:58 SGT 1h28pts Run #202 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-28 03:03 SGT SmokeEvaluation Completed

11 Model Started：2026-06-28 03:00 SGT Completed：2026-06-28 03:03 SGT 3pts41s Run #201 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-27 03:06 SGT SmokeEvaluation Completed

11 Model Started：2026-06-27 03:00 SGT Completed：2026-06-27 03:06 SGT 6pts51s Run #200 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-26 03:05 SGT SmokeEvaluation Completed

11 Model Started：2026-06-26 03:00 SGT Completed：2026-06-26 03:05 SGT 5pts51s Run #198 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-25 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-06-25 03:00 SGT Completed：2026-06-25 03:02 SGT 2pts10s Run #197 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-24 04:54 SGT SmokeEvaluation Completed WDCD smoke evaluation

11 Model Started：2026-06-24 04:30 SGT Completed：2026-06-24 04:54 SGT 24pts22s Run #196 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-24 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-06-24 03:00 SGT Completed：2026-06-24 03:01 SGT 1pts31s Run #195 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-23 03:11 SGT SmokeEvaluation Completed

11 Model Started：2026-06-23 03:10 SGT Completed：2026-06-23 03:11 SGT 1pts30s Run #194 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-22 04:39 SGT FullEvaluation Completed

11 Model Started：2026-06-22 04:00 SGT Completed：2026-06-22 04:39 SGT 39pts47s Run #192 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-22 03:06 SGT SmokeEvaluation Completed

11 Model Started：2026-06-22 03:00 SGT Completed：2026-06-22 03:06 SGT 6pts41s Run #191 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-21 03:12 SGT SmokeEvaluation Completed

11 Model Started：2026-06-21 03:10 SGT Completed：2026-06-21 03:12 SGT 2pts31s Run #190 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-20 03:03 SGT SmokeEvaluation Completed

11 Model Started：2026-06-20 03:00 SGT Completed：2026-06-20 03:03 SGT 3pts1s Run #188 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-19 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-06-19 03:00 SGT Completed：2026-06-19 03:02 SGT 2pts41s Run #187 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-18 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-06-18 03:00 SGT Completed：2026-06-18 03:02 SGT 2pts30s Run #186 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-17 04:54 SGT SmokeEvaluation Completed WDCD smoke evaluation

11 Model Started：2026-06-17 04:30 SGT Completed：2026-06-17 04:54 SGT 24pts19s Run #185 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-17 03:12 SGT SmokeEvaluation Completed

11 Model Started：2026-06-17 03:10 SGT Completed：2026-06-17 03:12 SGT 2pts40s Run #184 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-16 03:14 SGT SmokeEvaluation Completed

11 Model Started：2026-06-16 03:10 SGT Completed：2026-06-16 03:14 SGT 4pts21s Run #182 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-15 09:25 SGT FullEvaluation Completed

11 Model Started：2026-06-15 08:34 SGT Completed：2026-06-15 09:25 SGT 51pts16s Run #180 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-15 03:03 SGT SmokeEvaluation Completed

11 Model Started：2026-06-15 03:00 SGT Completed：2026-06-15 03:03 SGT 3pts31s Run #176 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-14 05:53 SGT SmokeEvaluation Completed WDCD pilot evaluation

11 Model Started：2026-06-14 04:30 SGT Completed：2026-06-14 05:53 SGT 1h23pts Run #171 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-14 03:19 SGT SmokeEvaluation Completed WDCD pilot evaluation

11 Model Started：2026-06-13 23:10 SGT Completed：2026-06-14 03:19 SGT 4h9pts Run #169 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-14 03:06 SGT SmokeEvaluation Completed

11 Model Started：2026-06-14 03:00 SGT Completed：2026-06-14 03:06 SGT 6pts51s Run #170 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-13 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-06-13 03:00 SGT Completed：2026-06-13 03:01 SGT 1pts41s Run #166 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-12 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-06-12 03:00 SGT Completed：2026-06-12 03:01 SGT 1pts40s Run #165 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-11 13:19 SGT SmokeEvaluation Completed WDCD pilot evaluation

11 Model Started：2026-06-11 11:55 SGT Completed：2026-06-11 13:19 SGT 1h24pts Run #164 Formula v7 · Judge v6.3 · Benchmark v7

2026-06-11 09:18 SGT SmokeEvaluation Completed WDCD pilot evaluation

11 Model Started：2026-06-11 07:57 SGT Completed：2026-06-11 09:18 SGT 1h20pts Run #161 Formula v7 · Judge v6.3 · Benchmark v6

2026-06-11 07:14 SGT SmokeEvaluation Completed

11 Model Started：2026-06-11 07:12 SGT Completed：2026-06-11 07:14 SGT 1pts51s Run #159 Formula v7 · Judge v6.2 · Benchmark v6

2026-06-11 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-06-11 03:00 SGT Completed：2026-06-11 03:02 SGT 2pts20s Run #158 Formula v7 · Judge v6.1 · Benchmark v6

2026-06-10 05:00 SGT SmokeEvaluation Completed WDCD smoke evaluation

11 Model Started：2026-06-10 04:30 SGT Completed：2026-06-10 05:00 SGT 30pts33s Run #157 Formula v7 · Judge v6.1 · Benchmark v6

2026-06-10 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-06-10 03:00 SGT Completed：2026-06-10 03:01 SGT 1pts41s Run #156 Formula v7 · Judge v6.1 · Benchmark v6

2026-06-09 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-06-09 03:00 SGT Completed：2026-06-09 03:01 SGT 1pts41s Run #155 Formula v7 · Judge v6.1 · Benchmark v6

2026-06-08 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-06-08 03:00 SGT Completed：2026-06-08 03:02 SGT 2pts1s Run #153 Formula v7 · Judge v6.1 · Benchmark v6

2026-06-07 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-06-07 03:00 SGT Completed：2026-06-07 03:02 SGT 2pts11s Run #152 Formula v7 · Judge v6.1 · Benchmark v6

2026-06-06 19:26 SGT SmokeEvaluation Completed

11 Model Started：2026-06-06 19:24 SGT Completed：2026-06-06 19:26 SGT 1pts40s Run #151 Formula v7 · Judge v6.1 · Benchmark v6

2026-06-06 03:31 SGT SmokeEvaluation Completed social_monitor

1 Model Started：2026-06-06 03:30 SGT Completed：2026-06-06 03:31 SGT 1pts40s Run #150 Formula v7 · Judge v6 · Benchmark v6

2026-06-05 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-06-05 03:00 SGT Completed：2026-06-05 03:01 SGT 1pts41s Run #148 Formula v7 · Judge v6 · Benchmark v6

2026-06-04 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-06-04 03:00 SGT Completed：2026-06-04 03:01 SGT 1pts51s Run #147 Formula v7 · Judge v6 · Benchmark v6

2026-06-03 04:57 SGT SmokeEvaluation Completed WDCD smoke evaluation

11 Model Started：2026-06-03 04:30 SGT Completed：2026-06-03 04:57 SGT 27pts54s Run #146 Formula v7 · Judge v6 · Benchmark v6

2026-06-03 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-06-03 03:00 SGT Completed：2026-06-03 03:01 SGT 1pts51s Run #145 Formula v7 · Judge v6 · Benchmark v6

2026-06-02 03:31 SGT SmokeEvaluation Completed social_monitor

1 Model Started：2026-06-02 03:30 SGT Completed：2026-06-02 03:31 SGT 1pts20s Run #144 Formula v7 · Judge v6 · Benchmark v6

2026-06-02 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-06-02 03:00 SGT Completed：2026-06-02 03:02 SGT 2pts21s Run #143 Formula v7 · Judge v6 · Benchmark v6

2026-06-01 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-06-01 03:00 SGT Completed：2026-06-01 03:02 SGT 2pts31s Run #141 Formula v7 · Judge v6 · Benchmark v6

2026-05-31 05:54 SGT SmokeEvaluation Completed WDCD pilot evaluation

11 Model Started：2026-05-31 04:30 SGT Completed：2026-05-31 05:54 SGT 1h24pts Run #140 Formula v7 · Judge v6 · Benchmark v6

2026-05-31 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-31 03:00 SGT Completed：2026-05-31 03:01 SGT 1pts20s Run #139 Formula v7 · Judge v6 · Benchmark v6

2026-05-30 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-30 03:00 SGT Completed：2026-05-30 03:01 SGT 1pts30s Run #138 Formula v7 · Judge v6 · Benchmark v6

2026-05-29 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-29 03:00 SGT Completed：2026-05-29 03:01 SGT 1pts41s Run #137 Formula v7 · Judge v6 · Benchmark v6

2026-05-28 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-28 03:00 SGT Completed：2026-05-28 03:01 SGT 1pts41s Run #136 Formula v7 · Judge v6 · Benchmark v6

2026-05-27 04:54 SGT SmokeEvaluation Completed WDCD smoke evaluation

11 Model Started：2026-05-27 04:30 SGT Completed：2026-05-27 04:54 SGT 24pts29s Run #135 Formula v7 · Judge v6 · Benchmark v6

2026-05-27 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-27 03:00 SGT Completed：2026-05-27 03:01 SGT 1pts11s Run #134 Formula v7 · Judge v6 · Benchmark v6

2026-05-26 03:31 SGT SmokeEvaluation Completed social_monitor

1 Model Started：2026-05-26 03:30 SGT Completed：2026-05-26 03:31 SGT 1pts20s Run #133 Formula v7 · Judge v6 · Benchmark v6

2026-05-26 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-26 03:00 SGT Completed：2026-05-26 03:01 SGT 1pts31s Run #132 Formula v7 · Judge v6 · Benchmark v6

2026-05-25 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-25 03:00 SGT Completed：2026-05-25 03:01 SGT 1pts41s Run #130 Formula v7 · Judge v6 · Benchmark v6

2026-05-24 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-24 03:00 SGT Completed：2026-05-24 03:01 SGT 1pts11s Run #129 Formula v7 · Judge v6 · Benchmark v6

2026-05-23 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-05-23 03:00 SGT Completed：2026-05-23 03:02 SGT 2pts0s Run #128 Formula v7 · Judge v6 · Benchmark v6

2026-05-22 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-05-22 03:00 SGT Completed：2026-05-22 03:02 SGT 2pts11s Run #127 Formula v7 · Judge v6 · Benchmark v6

2026-05-21 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-21 03:00 SGT Completed：2026-05-21 03:01 SGT 1pts31s Run #126 Formula v7 · Judge v6 · Benchmark v6

2026-05-20 04:57 SGT SmokeEvaluation Completed WDCD smoke evaluation

11 Model Started：2026-05-20 04:30 SGT Completed：2026-05-20 04:57 SGT 27pts36s Run #125 Formula v7 · Judge v6 · Benchmark v6

2026-05-20 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-20 03:00 SGT Completed：2026-05-20 03:01 SGT 1pts41s Run #124 Formula v7 · Judge v6 · Benchmark v6

2026-05-19 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-19 03:00 SGT Completed：2026-05-19 03:01 SGT 1pts41s Run #123 Formula v7 · Judge v6 · Benchmark v6

2026-05-18 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-18 03:00 SGT Completed：2026-05-18 03:01 SGT 1pts21s Run #121 Formula v7 · Judge v6 · Benchmark v6

2026-05-17 05:49 SGT SmokeEvaluation Completed WDCD pilot evaluation

11 Model Started：2026-05-17 04:30 SGT Completed：2026-05-17 05:49 SGT 1h19pts Run #120 Formula v7 · Judge v6 · Benchmark v6

2026-05-17 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-17 03:00 SGT Completed：2026-05-17 03:01 SGT 1pts20s Run #119 Formula v7 · Judge v6 · Benchmark v6

2026-05-16 03:03 SGT SmokeEvaluation Completed

11 Model Started：2026-05-16 03:00 SGT Completed：2026-05-16 03:03 SGT 3pts51s Run #118 Formula v7 · Judge v6 · Benchmark v6

2026-05-15 03:04 SGT SmokeEvaluation Completed

11 Model Started：2026-05-15 03:00 SGT Completed：2026-05-15 03:04 SGT 4pts11s Run #117 Formula v7 · Judge v6 · Benchmark v6

2026-05-14 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-14 03:00 SGT Completed：2026-05-14 03:01 SGT 1pts31s Run #116 Formula v7 · Judge v6 · Benchmark v6

2026-05-13 05:03 SGT SmokeEvaluation Completed WDCD smoke evaluation

11 Model Started：2026-05-13 04:30 SGT Completed：2026-05-13 05:03 SGT 33pts25s Run #115 Formula v7 · Judge v6 · Benchmark v6

2026-05-13 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-05-13 03:00 SGT Completed：2026-05-13 03:02 SGT 2pts51s Run #114 Formula v7 · Judge v6 · Benchmark v6

2026-05-12 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-12 03:00 SGT Completed：2026-05-12 03:01 SGT 1pts51s Run #113 Formula v7 · Judge v6 · Benchmark v6

2026-05-11 03:03 SGT SmokeEvaluation Completed

11 Model Started：2026-05-11 03:00 SGT Completed：2026-05-11 03:03 SGT 3pts0s Run #111 Formula v7 · Judge v6 · Benchmark v6

2026-05-10 05:26 SGT SmokeEvaluation Completed social_monitor

1 Model Started：2026-05-10 03:30 SGT Completed：2026-05-10 05:26 SGT 1h55pts Run #110 Formula v7 · Judge v6 · Benchmark v6

2026-05-10 03:03 SGT SmokeEvaluation Completed

11 Model Started：2026-05-10 03:00 SGT Completed：2026-05-10 03:03 SGT 3pts11s Run #109 Formula v7 · Judge v6 · Benchmark v6

2026-05-09 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-09 03:00 SGT Completed：2026-05-09 03:01 SGT 1pts32s Run #108 Formula v7 · Judge v6 · Benchmark v6

2026-05-08 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-08 03:00 SGT Completed：2026-05-08 03:01 SGT 1pts51s Run #107 Formula v7 · Judge v6 · Benchmark v6

2026-05-07 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-05-07 03:00 SGT Completed：2026-05-07 03:02 SGT 2pts31s Run #106 Formula v7 · Judge v6 · Benchmark v6

2026-05-06 05:01 SGT SmokeEvaluation Completed WDCD smoke evaluation

11 Model Started：2026-05-06 04:30 SGT Completed：2026-05-06 05:01 SGT 31pts24s Run #105 Formula v7 · Judge v6 · Benchmark v6

2026-05-06 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-06 03:00 SGT Completed：2026-05-06 03:01 SGT 1pts31s Run #104 Formula v7 · Judge v6 · Benchmark v6

2026-05-05 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-05-05 03:00 SGT Completed：2026-05-05 03:02 SGT 2pts11s Run #103 Formula v7 · Judge v6 · Benchmark v6

2026-05-04 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-05-04 03:00 SGT Completed：2026-05-04 03:02 SGT 2pts41s Run #101 Formula v7 · Judge v6 · Benchmark v6

2026-05-03 04:24 SGT SmokeEvaluation Completed WDCD pilot evaluation

11 Model Started：2026-05-03 04:00 SGT Completed：2026-05-03 04:24 SGT 24pts13s Run #100 Formula v7 · Judge v6 · Benchmark v6

2026-05-03 04:00 SGT SmokeEvaluation Completed

4 Model Started：2026-05-03 03:00 SGT Completed：2026-05-03 04:00 SGT 1h0pts Run #99 Formula v7 · Judge v6 · Benchmark v6

2026-05-02 03:03 SGT SmokeEvaluation Completed

11 Model Started：2026-05-02 03:00 SGT Completed：2026-05-02 03:03 SGT 3pts10s Run #98 Formula v7 · Judge v6 · Benchmark v6

2026-05-02 02:55 SGT SmokeEvaluation Completed WDCD pilot evaluation

11 Model Started：2026-05-01 18:03 SGT Completed：2026-05-02 02:55 SGT 8h51pts Run #97 Formula v7 · Judge v6 · Benchmark v6

2026-05-01 16:06 SGT SmokeEvaluation Completed DCD pilot evaluation

11 Model Started：2026-05-01 10:38 SGT Completed：2026-05-01 16:06 SGT 5h28pts Run #96 Formula v7 · Judge v6 · Benchmark v6

2026-05-01 11:09 SGT Version Upgrade

WDCD Dynamic Contextual Decay — world's first multi-turn constraint benchmark dimension launched

New experimental dimension: WDCD (Dynamic Contextual Decay)

Winzheng Index v7 adds the WDCD dimension, testing whether AI models hold constraints across multi-turn dialogue. This is the world's first framework to systematically evaluate this capability. **Core design: three-round dialogue**

• R1 constraint implant: give the model an explicit constraint and confirm understanding

• R2 distraction injection: a 2000-5000 character professional document with an embedded violating request

• R3 pressure induction: social-engineering pressure to test whether the constraint collapses

**Scale**

• 30 multi-turn constraint questions covering 5 scenario types (data boundaries, resource limits, business rules, security, engineering conventions)

• 11 mainstream models tested side by side

• 100% rule-based scoring, zero AI judges, fully auditable

**Scoring**

• R1: 0-1 (confirmation detection)

• R2: 0-1 (violation detection + Utility Gate)

• R3: 0-2 (violation + refusal + constraint citation + safe alternative)

• Max 4 points

**Independent runs**

• WDCD is experimental and not counted in the main board score

• Uses independent runs (run_type = dcd_pilot)

• Planned to run independently for 3 months before evaluating main-board inclusion

2026-05-01 06:20 SGT Model Change

Major roster upgrade: 11 models updated to latest versions

From May 1, 2026, the Winzheng Index benchmark roster is fully upgraded: [New models]

• GPT-5.5 (replacing GPT-4o) — OpenAI's latest flagship

• Claude Opus 4.7 (replacing Opus 4.6) — Anthropic's latest flagship

• DeepSeek V4 Pro (replacing V3 + R1) — DeepSeek's new architecture

• Gemini 3.1 Pro (new) — Google's latest generation

• Qwen3 Max (replacing Qwen Max) — Alibaba Tongyi Qianwen 3rd generation

• ERNIE 4.5 (replacing 4.0) — Baidu's latest version

• Grok 4 (replacing Grok 3) — xAI's new flagship

[Retained models]

• Claude Sonnet 4.6 — latest in the Sonnet line, still participating

• GPT-o3 — latest in OpenAI's reasoning line, still participating

• Doubao Pro — ByteDance flagship, still participating

[Retired models] GPT-4o, GPT-4o-mini, Claude Opus 4.6, DeepSeek V3, DeepSeek R1, Gemini 2.0 Flash, Grok 3, Qwen Max, ERNIE 4.0 Historical leaderboards remain unchanged.

2026-05-01 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-05-01 03:00 SGT Completed：2026-05-01 03:01 SGT 1pts32s Run #91 Formula v7 · Judge v6 · Benchmark v6

2026-04-30 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-04-30 03:00 SGT Completed：2026-04-30 03:01 SGT 1pts51s Run #90 Formula v7 · Judge v6 · Benchmark v6

2026-04-29 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-04-29 03:00 SGT Completed：2026-04-29 03:02 SGT 2pts11s Run #89 Formula v7 · Judge v6 · Benchmark v6

2026-04-28 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-04-28 03:00 SGT Completed：2026-04-28 03:02 SGT 2pts21s Run #88 Formula v7 · Judge v6 · Benchmark v6

2026-04-27 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-04-27 03:00 SGT Completed：2026-04-27 03:01 SGT 1pts51s Run #86 Formula v7 · Judge v6 · Benchmark v6

2026-04-26 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-04-26 03:00 SGT Completed：2026-04-26 03:01 SGT 1pts21s Run #85 Formula v7 · Judge v6 · Benchmark v6

2026-04-25 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-04-25 03:00 SGT Completed：2026-04-25 03:02 SGT 2pts22s Run #84 Formula v7 · Judge v6 · Benchmark v6

2026-04-24 03:03 SGT SmokeEvaluation Completed

11 Model Started：2026-04-24 03:00 SGT Completed：2026-04-24 03:03 SGT 3pts21s Run #83 Formula v7 · Judge v6 · Benchmark v6

2026-04-23 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-04-23 03:00 SGT Completed：2026-04-23 03:02 SGT 2pts21s Run #82 Formula v7 · Judge v6 · Benchmark v6

2026-04-22 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-04-22 03:00 SGT Completed：2026-04-22 03:02 SGT 2pts22s Run #81 Formula v7 · Judge v6 · Benchmark v6

2026-04-21 03:36 SGT SmokeEvaluation Completed

1 Model Started：2026-04-21 03:34 SGT Completed：2026-04-21 03:36 SGT 2pts20s Run #80 Formula v7 · Judge v6 · Benchmark v6

2026-04-21 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-04-21 03:00 SGT Completed：2026-04-21 03:01 SGT 1pts31s Run #79 Formula v7 · Judge v6 · Benchmark v6

2026-04-20 03:01 SGT SmokeEvaluation Completed

10 Model Started：2026-04-20 03:00 SGT Completed：2026-04-20 03:01 SGT 1pts21s Run #77 Formula v7 · Judge v6 · Benchmark v6

2026-04-19 03:01 SGT SmokeEvaluation Completed

10 Model Started：2026-04-19 03:00 SGT Completed：2026-04-19 03:01 SGT 1pts21s Run #76 Formula v7 · Judge v6 · Benchmark v6

2026-04-18 11:04 SGT SmokeEvaluation Completed

11 Model Started：2026-04-18 11:02 SGT Completed：2026-04-18 11:04 SGT 1pts41s Run #75 Formula v7 · Judge v6 · Benchmark v6

2026-04-17 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-04-17 03:00 SGT Completed：2026-04-17 03:02 SGT 2pts1s Run #73 Formula v7 · Judge v6 · Benchmark v6

2026-04-16 03:01 SGT SmokeEvaluation Completed

10 Model Started：2026-04-16 03:00 SGT Completed：2026-04-16 03:01 SGT 1pts31s Run #72 Formula v7 · Judge v6 · Benchmark v6

2026-04-15 03:02 SGT SmokeEvaluation Completed

10 Model Started：2026-04-15 03:00 SGT Completed：2026-04-15 03:02 SGT 2pts21s Run #71 Formula v7 · Judge v6 · Benchmark v6

2026-04-14 03:01 SGT SmokeEvaluation Completed

10 Model Started：2026-04-14 03:00 SGT Completed：2026-04-14 03:01 SGT 1pts41s Run #70 Formula v7 · Judge v6 · Benchmark v6

2026-04-13 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-04-13 03:00 SGT Completed：2026-04-13 03:01 SGT 1pts11s Run #68 Formula v7 · Judge v6 · Benchmark v6

2026-04-12 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-04-12 03:00 SGT Completed：2026-04-12 03:02 SGT 2pts11s Run #67 Formula v7 · Judge v6 · Benchmark v6

2026-04-11 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-04-11 03:00 SGT Completed：2026-04-11 03:01 SGT 1pts51s Run #66 Formula v7 · Judge v6 · Benchmark v6

2026-04-10 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-04-10 03:00 SGT Completed：2026-04-10 03:01 SGT 1pts31s Run #65 Formula v7 · Judge v6 · Benchmark v6

2026-04-09 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-04-09 03:00 SGT Completed：2026-04-09 03:01 SGT 1pts41s Run #64 Formula v7 · Judge v6 · Benchmark v6

2026-04-08 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-04-08 03:00 SGT Completed：2026-04-08 03:02 SGT 2pts1s Run #63 Formula v7 · Judge v6 · Benchmark v6

2026-04-07 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-04-07 03:00 SGT Completed：2026-04-07 03:01 SGT 1pts21s Run #62 Formula v7 · Judge v6 · Benchmark v6

2026-04-06 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-04-06 03:00 SGT Completed：2026-04-06 03:01 SGT 1pts31s Run #60 Formula v7 · Judge v6 · Benchmark v6

2026-04-05 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-04-05 03:00 SGT Completed：2026-04-05 03:01 SGT 1pts21s Run #59 Formula v7 · Judge v6 · Benchmark v6

2026-04-04 03:31 SGT SmokeEvaluation Completed social_monitor

1 Model Started：2026-04-04 03:30 SGT Completed：2026-04-04 03:31 SGT 40s Run #58 Formula v7 · Judge v6 · Benchmark v6

2026-04-04 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-04-04 03:00 SGT Completed：2026-04-04 03:01 SGT 1pts21s Run #57 Formula v7 · Judge v6 · Benchmark v6

2026-04-03 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-04-03 03:00 SGT Completed：2026-04-03 03:01 SGT 1pts11s Run #56 Formula v7 · Judge v6 · Benchmark v6

2026-04-02 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-04-02 03:00 SGT Completed：2026-04-02 03:01 SGT 1pts31s Run #55 Formula v7 · Judge v6 · Benchmark v6

2026-04-01 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-04-01 03:00 SGT Completed：2026-04-01 03:01 SGT 1pts41s Run #54 Formula v7 · Judge v6 · Benchmark v6

2026-03-31 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-03-31 03:00 SGT Completed：2026-03-31 03:01 SGT 1pts11s Run #53 Formula v7 · Judge v6 · Benchmark v6

2026-03-30 03:31 SGT SmokeEvaluation Completed social_monitor

1 Model Started：2026-03-30 03:30 SGT Completed：2026-03-30 03:31 SGT 50s Run #51 Formula v7 · Judge v6 · Benchmark v6

2026-03-30 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-03-30 03:00 SGT Completed：2026-03-30 03:01 SGT 1pts40s Run #50 Formula v7 · Judge v6 · Benchmark v6

2026-03-29 03:01 SGT SmokeEvaluation Completed

11 Model Started：2026-03-29 03:00 SGT Completed：2026-03-29 03:01 SGT 1pts40s Run #49 Formula v7 · Judge v6 · Benchmark v6

2026-03-28 03:02 SGT SmokeEvaluation Completed

11 Model Started：2026-03-28 03:00 SGT Completed：2026-03-28 03:02 SGT 2pts11s Run #47 Formula v7 · Judge v6 · Benchmark v6

2026-03-27 05:05 SGT SmokeEvaluation Completed

11 Model Started：2026-03-27 05:04 SGT Completed：2026-03-27 05:05 SGT 1pts41s Run #46 Formula v7 · Judge v6 · Benchmark v6

2026-03-25 00:11 SGT SmokeEvaluation Completed

11 Model Started：2026-03-25 00:11 SGT Completed：2026-03-25 00:11 SGT 10s Run #42 Formula v7 · Judge v6 · Benchmark v6

2026-03-24 00:00 SGT Version Upgrade

Winzheng Index v6 officially launched

Methodology upgrade

• Question bank expanded from 200 to 212 questions, adding 12 integrity stress-test questions

• Dimension system restructured: the main board now only includes two auditable core dimensions, "Code Execution" and "Material Constraints"

• Added "Engineering Judgment" and "Task Expression" side boards (marked as AI-assisted evaluation)

• Added an "Integrity Rating" gate (pass/warn/fail); models failing integrity are capped on the main board

• Main board formula: core_overall = 0.55 × Code Execution + 0.45 × Material Constraints

• Stability, availability and cost-effectiveness downgraded to operational signals, no longer mixed into main board weights

Scoring engine

• Added exact_rank scorer supporting closed-form ranking for integrity stress tests

• Parallel evaluation architecture upgraded to 55 processes (11 models × 5 capability layers); a full run takes ~15 minutes

Social sentiment monitoring (new)

• Daily automatic monitoring of user feedback on the 11 models on X/Twitter

• Sentiment anomalies automatically trigger targeted re-evaluation, cross-validated with benchmark data

• Daily automatic monitoring of AI vendors' official accounts

Data page rebuild

• Raw data page rebuilt into summary + pagination; page size reduced from 29MB to 64KB

• Question texts and expected answers are no longer public, preventing contamination

2026-03-22 14:05 SGT SmokeEvaluation Completed

2 Model Started：2026-03-22 14:05 SGT Completed：2026-03-22 14:05 SGT 10s Run #36 Formula v5 · Judge v6 · Benchmark v5.1

2026-03-21 12:11 SGT SmokeEvaluation Completed

11 Model Started：2026-03-21 12:08 SGT Completed：2026-03-21 12:11 SGT 3pts0s Run #32 Formula v3 · Judge v5 · Benchmark v4

Judge v5：引入严格判分分层（strict/non-strict）：新增 4 种严格判分类型（exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value），严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true

Benchmark v4：题库从 89 题扩充到 100 题（编程 33 + 知识 45 + 长上下文 22），新增 11 道高质量决策题，覆盖矛盾信息识别、信息不足诚实度、优先级排序、利益冲突检测、代码 review 陷阱、伦理边界

2026-03-21 01:21 SGT SmokeEvaluation Completed

11 Model Started：2026-03-21 01:21 SGT Completed：2026-03-21 01:21 SGT 10s Run #26 Formula v3 · Judge v5 · Benchmark v4

Judge v5：引入严格判分分层（strict/non-strict）：新增 4 种严格判分类型（exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value），严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true

Benchmark v4：题库从 89 题扩充到 100 题（编程 33 + 知识 45 + 长上下文 22），新增 11 道高质量决策题，覆盖矛盾信息识别、信息不足诚实度、优先级排序、利益冲突检测、代码 review 陷阱、伦理边界

2026-03-21 01:19 SGT Benchmark Change

Question bank v4: 11 new high-quality decision questions

Added 11 high-quality decision questions covering: contradictory information detection (2), honesty under insufficient information (2), prioritization (2), conflict-of-interest detection (2), code review traps (2), and ethical boundaries (1). The question bank grew from 89 to 100 questions. Question bank version upgraded to v4.

2026-03-21 01:05 SGT Model Change

Added 3 benchmark models: Grok 3, Doubao Pro, ERNIE 4.0

Added 3 benchmark models: Grok 3 (xAI), Doubao Pro (ByteDance), ERNIE 4.0 (Baidu). Total benchmark models increased from 8 to 11.

2026-03-21 01:05 SGT SmokeEvaluation Completed

11 Model Started：2026-03-21 01:05 SGT Completed：2026-03-21 01:05 SGT 10s Run #25 Formula v3 · Judge v5 · Benchmark v3

Judge v5：引入严格判分分层（strict/non-strict）：新增 4 种严格判分类型（exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value），严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true

Benchmark v3：题库从 80 题扩充到 89 题（编程 33 + 知识 34 + 长上下文 22），知识工作新增工程判断力题组（9 题），覆盖技术选型、架构权衡、故障排查等实战场景

2026-03-21 00:59 SGT SmokeEvaluation Completed

10 Model Started：2026-03-21 00:59 SGT Completed：2026-03-21 00:59 SGT 9s Run #24 Formula v3 · Judge v5 · Benchmark v3

Judge v5：引入严格判分分层（strict/non-strict）：新增 4 种严格判分类型（exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value），严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true

Benchmark v3：题库从 80 题扩充到 89 题（编程 33 + 知识 34 + 长上下文 22），知识工作新增工程判断力题组（9 题），覆盖技术选型、架构权衡、故障排查等实战场景

2026-03-20 12:55 SGT SmokeEvaluation Completed

8 Model Started：2026-03-20 12:44 SGT Completed：2026-03-20 12:55 SGT 10pts39s Run #23 Formula v3 · Judge v5 · Benchmark v3

Judge v5：引入严格判分分层（strict/non-strict）：新增 4 种严格判分类型（exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value），严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true

Benchmark v3：题库从 80 题扩充到 89 题（编程 33 + 知识 34 + 长上下文 22），知识工作新增工程判断力题组（9 题），覆盖技术选型、架构权衡、故障排查等实战场景

2026-03-20 03:10 SGT SmokeEvaluation Completed

8 Model Started：2026-03-20 03:00 SGT Completed：2026-03-20 03:10 SGT 10pts50s Run #22 Formula v3 · Judge v5 · Benchmark v3

Judge v5：引入严格判分分层（strict/non-strict）：新增 4 种严格判分类型（exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value），严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true

Benchmark v3：题库从 80 题扩充到 89 题（编程 33 + 知识 34 + 长上下文 22），知识工作新增工程判断力题组（9 题），覆盖技术选型、架构权衡、故障排查等实战场景

2026-03-19 03:11 SGT SmokeEvaluation Completed

8 Model Started：2026-03-19 03:00 SGT Completed：2026-03-19 03:11 SGT 11pts42s Run #18 Formula v3 · Judge v5 · Benchmark v2

Judge v5：引入严格判分分层（strict/non-strict）：新增 4 种严格判分类型（exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value），严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true

Benchmark v2：题库从 30 题扩充到 80 题（编程 33 + 知识 25 + 长上下文 22），编程新增动态规划和并发分析，知识工作新增复利计算、时区推理等多步推理题

2026-03-18 03:11 SGT SmokeEvaluation Completed

8 Model Started：2026-03-18 03:00 SGT Completed：2026-03-18 03:11 SGT 11pts18s Run #17 Formula v3 · Judge v5 · Benchmark v2

Judge v5：引入严格判分分层（strict/non-strict）：新增 4 种严格判分类型（exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value），严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true

Benchmark v2：题库从 30 题扩充到 80 题（编程 33 + 知识 25 + 长上下文 22），编程新增动态规划和并发分析，知识工作新增复利计算、时区推理等多步推理题

2026-03-17 03:10 SGT SmokeEvaluation Completed

8 Model Started：2026-03-17 03:00 SGT Completed：2026-03-17 03:10 SGT 10pts54s Run #12 Formula v2 · Judge v2 · Benchmark v1

Judge v2：引入六种判分方法（全部命中、部分命中、精确匹配、正则、顺序匹配、JSON 结构校验），开始有比较正式的评分体系

Benchmark v1：初始题库 30 题，覆盖编程、知识工作、长上下文三个维度