Skip to main content
YZ Index

Changelog

Version history of the YZ Index evaluation system.

2026-04-29 03:02 SGT SmokeEvaluation Completed
11 Model Started:2026-04-29 03:00 SGT Completed:2026-04-29 03:02 SGT 2pts11s Run #89 Formula v7 · Judge v6 · Benchmark v6
2026-04-28 03:02 SGT SmokeEvaluation Completed
11 Model Started:2026-04-28 03:00 SGT Completed:2026-04-28 03:02 SGT 2pts21s Run #88 Formula v7 · Judge v6 · Benchmark v6
2026-04-27 04:18 SGT FullEvaluation Completed
11 Model Started:2026-04-27 04:00 SGT Completed:2026-04-27 04:18 SGT 18pts17s Run #87 Formula v7 · Judge v6 · Benchmark v6
2026-04-27 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-04-27 03:00 SGT Completed:2026-04-27 03:01 SGT 1pts51s Run #86 Formula v7 · Judge v6 · Benchmark v6
2026-04-26 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-04-26 03:00 SGT Completed:2026-04-26 03:01 SGT 1pts21s Run #85 Formula v7 · Judge v6 · Benchmark v6
2026-04-25 03:02 SGT SmokeEvaluation Completed
11 Model Started:2026-04-25 03:00 SGT Completed:2026-04-25 03:02 SGT 2pts22s Run #84 Formula v7 · Judge v6 · Benchmark v6
2026-04-24 03:03 SGT SmokeEvaluation Completed
11 Model Started:2026-04-24 03:00 SGT Completed:2026-04-24 03:03 SGT 3pts21s Run #83 Formula v7 · Judge v6 · Benchmark v6
2026-04-23 03:02 SGT SmokeEvaluation Completed
11 Model Started:2026-04-23 03:00 SGT Completed:2026-04-23 03:02 SGT 2pts21s Run #82 Formula v7 · Judge v6 · Benchmark v6
2026-04-22 03:02 SGT SmokeEvaluation Completed
11 Model Started:2026-04-22 03:00 SGT Completed:2026-04-22 03:02 SGT 2pts22s Run #81 Formula v7 · Judge v6 · Benchmark v6
2026-04-21 03:36 SGT SmokeEvaluation Completed
1 Model Started:2026-04-21 03:34 SGT Completed:2026-04-21 03:36 SGT 2pts20s Run #80 Formula v7 · Judge v6 · Benchmark v6
2026-04-21 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-04-21 03:00 SGT Completed:2026-04-21 03:01 SGT 1pts31s Run #79 Formula v7 · Judge v6 · Benchmark v6
2026-04-20 04:15 SGT FullEvaluation Completed
10 Model Started:2026-04-20 04:00 SGT Completed:2026-04-20 04:15 SGT 15pts31s Run #78 Formula v7 · Judge v6 · Benchmark v6
2026-04-20 03:01 SGT SmokeEvaluation Completed
10 Model Started:2026-04-20 03:00 SGT Completed:2026-04-20 03:01 SGT 1pts21s Run #77 Formula v7 · Judge v6 · Benchmark v6
2026-04-19 03:01 SGT SmokeEvaluation Completed
10 Model Started:2026-04-19 03:00 SGT Completed:2026-04-19 03:01 SGT 1pts21s Run #76 Formula v7 · Judge v6 · Benchmark v6
2026-04-18 11:04 SGT SmokeEvaluation Completed
11 Model Started:2026-04-18 11:02 SGT Completed:2026-04-18 11:04 SGT 1pts41s Run #75 Formula v7 · Judge v6 · Benchmark v6
2026-04-17 03:02 SGT SmokeEvaluation Completed
11 Model Started:2026-04-17 03:00 SGT Completed:2026-04-17 03:02 SGT 2pts1s Run #73 Formula v7 · Judge v6 · Benchmark v6
2026-04-16 03:01 SGT SmokeEvaluation Completed
10 Model Started:2026-04-16 03:00 SGT Completed:2026-04-16 03:01 SGT 1pts31s Run #72 Formula v7 · Judge v6 · Benchmark v6
2026-04-15 03:02 SGT SmokeEvaluation Completed
10 Model Started:2026-04-15 03:00 SGT Completed:2026-04-15 03:02 SGT 2pts21s Run #71 Formula v7 · Judge v6 · Benchmark v6
2026-04-14 03:01 SGT SmokeEvaluation Completed
10 Model Started:2026-04-14 03:00 SGT Completed:2026-04-14 03:01 SGT 1pts41s Run #70 Formula v7 · Judge v6 · Benchmark v6
2026-04-13 04:19 SGT FullEvaluation Completed
11 Model Started:2026-04-13 04:00 SGT Completed:2026-04-13 04:19 SGT 19pts46s Run #69 Formula v7 · Judge v6 · Benchmark v6
2026-04-13 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-04-13 03:00 SGT Completed:2026-04-13 03:01 SGT 1pts11s Run #68 Formula v7 · Judge v6 · Benchmark v6
2026-04-12 03:02 SGT SmokeEvaluation Completed
11 Model Started:2026-04-12 03:00 SGT Completed:2026-04-12 03:02 SGT 2pts11s Run #67 Formula v7 · Judge v6 · Benchmark v6
2026-04-11 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-04-11 03:00 SGT Completed:2026-04-11 03:01 SGT 1pts51s Run #66 Formula v7 · Judge v6 · Benchmark v6
2026-04-10 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-04-10 03:00 SGT Completed:2026-04-10 03:01 SGT 1pts31s Run #65 Formula v7 · Judge v6 · Benchmark v6
2026-04-09 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-04-09 03:00 SGT Completed:2026-04-09 03:01 SGT 1pts41s Run #64 Formula v7 · Judge v6 · Benchmark v6
2026-04-08 03:02 SGT SmokeEvaluation Completed
11 Model Started:2026-04-08 03:00 SGT Completed:2026-04-08 03:02 SGT 2pts1s Run #63 Formula v7 · Judge v6 · Benchmark v6
2026-04-07 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-04-07 03:00 SGT Completed:2026-04-07 03:01 SGT 1pts21s Run #62 Formula v7 · Judge v6 · Benchmark v6
2026-04-06 04:18 SGT FullEvaluation Completed
11 Model Started:2026-04-06 04:00 SGT Completed:2026-04-06 04:18 SGT 18pts47s Run #61 Formula v7 · Judge v6 · Benchmark v6
2026-04-06 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-04-06 03:00 SGT Completed:2026-04-06 03:01 SGT 1pts31s Run #60 Formula v7 · Judge v6 · Benchmark v6
2026-04-05 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-04-05 03:00 SGT Completed:2026-04-05 03:01 SGT 1pts21s Run #59 Formula v7 · Judge v6 · Benchmark v6
2026-04-04 03:31 SGT SmokeEvaluation Completed social_monitor
1 Model Started:2026-04-04 03:30 SGT Completed:2026-04-04 03:31 SGT 40s Run #58 Formula v7 · Judge v6 · Benchmark v6
2026-04-04 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-04-04 03:00 SGT Completed:2026-04-04 03:01 SGT 1pts21s Run #57 Formula v7 · Judge v6 · Benchmark v6
2026-04-03 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-04-03 03:00 SGT Completed:2026-04-03 03:01 SGT 1pts11s Run #56 Formula v7 · Judge v6 · Benchmark v6
2026-04-02 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-04-02 03:00 SGT Completed:2026-04-02 03:01 SGT 1pts31s Run #55 Formula v7 · Judge v6 · Benchmark v6
2026-04-01 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-04-01 03:00 SGT Completed:2026-04-01 03:01 SGT 1pts41s Run #54 Formula v7 · Judge v6 · Benchmark v6
2026-03-31 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-03-31 03:00 SGT Completed:2026-03-31 03:01 SGT 1pts11s Run #53 Formula v7 · Judge v6 · Benchmark v6
2026-03-30 04:16 SGT FullEvaluation Completed
11 Model Started:2026-03-30 04:00 SGT Completed:2026-03-30 04:16 SGT 16pts17s Run #52 Formula v7 · Judge v6 · Benchmark v6
2026-03-30 03:31 SGT SmokeEvaluation Completed social_monitor
1 Model Started:2026-03-30 03:30 SGT Completed:2026-03-30 03:31 SGT 50s Run #51 Formula v7 · Judge v6 · Benchmark v6
2026-03-30 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-03-30 03:00 SGT Completed:2026-03-30 03:01 SGT 1pts40s Run #50 Formula v7 · Judge v6 · Benchmark v6
2026-03-29 03:01 SGT SmokeEvaluation Completed
11 Model Started:2026-03-29 03:00 SGT Completed:2026-03-29 03:01 SGT 1pts40s Run #49 Formula v7 · Judge v6 · Benchmark v6
2026-03-28 03:02 SGT SmokeEvaluation Completed
11 Model Started:2026-03-28 03:00 SGT Completed:2026-03-28 03:02 SGT 2pts11s Run #47 Formula v7 · Judge v6 · Benchmark v6
2026-03-27 05:05 SGT SmokeEvaluation Completed
11 Model Started:2026-03-27 05:04 SGT Completed:2026-03-27 05:05 SGT 1pts41s Run #46 Formula v7 · Judge v6 · Benchmark v6
2026-03-25 00:12 SGT FullEvaluation Completed
11 Model Started:2026-03-25 00:11 SGT Completed:2026-03-25 00:12 SGT 16s Run #43 Formula v7 · Judge v6 · Benchmark v6
2026-03-25 00:11 SGT SmokeEvaluation Completed
11 Model Started:2026-03-25 00:11 SGT Completed:2026-03-25 00:11 SGT 10s Run #42 Formula v7 · Judge v6 · Benchmark v6
2026-03-24 16:44 SGT FullEvaluation Completed
11 Model Started:2026-03-24 16:29 SGT Completed:2026-03-24 16:44 SGT 15pts31s Run #41 Formula v7 · Judge v6 · Benchmark v6
2026-03-24 15:50 SGT FullEvaluation Completed migration
11 Model Started:2026-03-24 15:32 SGT Completed:2026-03-24 15:50 SGT 17pts31s Run #40 Formula v7 · Judge v6 · Benchmark v6
2026-03-24 15:31 SGT FullEvaluation Completed migration
11 Model Started:2026-03-24 15:31 SGT Completed:2026-03-24 15:31 SGT 16s Run #39 Formula v7 · Judge v6 · Benchmark v6
2026-03-24 15:23 SGT FullEvaluation Completed migration
11 Model Started:2026-03-24 15:22 SGT Completed:2026-03-24 15:23 SGT 30s Run #38 Formula v7 · Judge v6 · Benchmark v6
2026-03-24 00:00 SGT Version Upgrade
赢政指数 v6 正式上线

方法论升级

• 题库从 200 题扩展至 212 题,新增 12 道诚信压力测试题
• 维度体系重构:主榜只包含代码执行材料约束两个可审计核心维度
• 新增工程判断任务表达侧榜(标注 AI 辅助评估)
• 新增诚信评级门槛机制(pass/warn/fail),诚信不达标的模型主榜封顶
• 主榜公式:core_overall = 0.55 × 代码执行 + 0.45 × 材料约束
• 稳定性、可用性、性价比降级为运行信号,不再混入主榜权重

判分引擎

• 新增 exact_rank 判分器,支持诚信压力测试的封闭式排序判分
• 评测并行架构升级至 55 进程(11 模型 × 5 能力层),full run 耗时约 15 分钟

社交舆情监控(新功能)

• 每日自动监控 11 个模型在 X/Twitter 上的用户反馈
• 舆情异常时自动触发定向复测,与评测数据交叉验证
• 每日自动监控 AI 厂商官方账号动态

数据页重建

• 原始数据页重建为摘要 + 分页模式,页面大小从 29MB 降至 64KB
• 不再公开题目原文和预期答案,防止题库污染

SEO 与口径统一

• 全站旧维度名(编程/知识工作/长文本)统一替换为 v6 表述
• 清理参数页、旧路由等 SEO 污染 URL
2026-03-22 14:26 SGT FullEvaluation Completed
11 Model Started:2026-03-22 14:05 SGT Completed:2026-03-22 14:26 SGT 20pts16s Run #37 Formula v5 · Judge v6 · Benchmark v5.1
2026-03-22 14:05 SGT SmokeEvaluation Completed
2 Model Started:2026-03-22 14:05 SGT Completed:2026-03-22 14:05 SGT 10s Run #36 Formula v5 · Judge v6 · Benchmark v5.1
2026-03-22 11:38 SGT FullEvaluation Completed migration
11 Model Started:2026-03-22 10:44 SGT Completed:2026-03-22 11:38 SGT 53pts30s Run #35 Formula v5 · Judge v6 · Benchmark v5.1
2026-03-21 14:09 SGT FullEvaluation Completed
11 Model Started:2026-03-21 13:35 SGT Completed:2026-03-21 14:09 SGT 33pts30s Run #33 Formula v3 · Judge v5 · Benchmark v4
Judge v5:引入严格判分分层(strict/non-strict):新增 4 种严格判分类型(exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value),严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true
Benchmark v4:题库从 89 题扩充到 100 题(编程 33 + 知识 45 + 长上下文 22),新增 11 道高质量决策题,覆盖矛盾信息识别、信息不足诚实度、优先级排序、利益冲突检测、代码 review 陷阱、伦理边界
2026-03-21 13:29 SGT FullEvaluation Completed
11 Model Started:2026-03-21 10:09 SGT Completed:2026-03-21 13:29 SGT 3h20pts Run #31 Formula v3 · Judge v5 · Benchmark v4
Judge v5:引入严格判分分层(strict/non-strict):新增 4 种严格判分类型(exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value),严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true
Benchmark v4:题库从 89 题扩充到 100 题(编程 33 + 知识 45 + 长上下文 22),新增 11 道高质量决策题,覆盖矛盾信息识别、信息不足诚实度、优先级排序、利益冲突检测、代码 review 陷阱、伦理边界
2026-03-21 12:11 SGT SmokeEvaluation Completed
11 Model Started:2026-03-21 12:08 SGT Completed:2026-03-21 12:11 SGT 3pts0s Run #32 Formula v3 · Judge v5 · Benchmark v4
Judge v5:引入严格判分分层(strict/non-strict):新增 4 种严格判分类型(exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value),严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true
Benchmark v4:题库从 89 题扩充到 100 题(编程 33 + 知识 45 + 长上下文 22),新增 11 道高质量决策题,覆盖矛盾信息识别、信息不足诚实度、优先级排序、利益冲突检测、代码 review 陷阱、伦理边界
2026-03-21 09:55 SGT FullEvaluation Completed
4 Model Started:2026-03-21 08:05 SGT Completed:2026-03-21 09:55 SGT 1h50pts Run #30 Formula v3 · Judge v5 · Benchmark v4
Judge v5:引入严格判分分层(strict/non-strict):新增 4 种严格判分类型(exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value),严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true
Benchmark v4:题库从 89 题扩充到 100 题(编程 33 + 知识 45 + 长上下文 22),新增 11 道高质量决策题,覆盖矛盾信息识别、信息不足诚实度、优先级排序、利益冲突检测、代码 review 陷阱、伦理边界
2026-03-21 07:53 SGT FullEvaluation Completed
9 Model Started:2026-03-21 04:57 SGT Completed:2026-03-21 07:53 SGT 2h56pts Run #29 Formula v3 · Judge v5 · Benchmark v4
Judge v5:引入严格判分分层(strict/non-strict):新增 4 种严格判分类型(exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value),严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true
Benchmark v4:题库从 89 题扩充到 100 题(编程 33 + 知识 45 + 长上下文 22),新增 11 道高质量决策题,覆盖矛盾信息识别、信息不足诚实度、优先级排序、利益冲突检测、代码 review 陷阱、伦理边界
2026-03-21 04:24 SGT FullEvaluation Completed
9 Model Started:2026-03-21 01:30 SGT Completed:2026-03-21 04:24 SGT 2h53pts Run #27 Formula v3 · Judge v5 · Benchmark v4
Judge v5:引入严格判分分层(strict/non-strict):新增 4 种严格判分类型(exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value),严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true
Benchmark v4:题库从 89 题扩充到 100 题(编程 33 + 知识 45 + 长上下文 22),新增 11 道高质量决策题,覆盖矛盾信息识别、信息不足诚实度、优先级排序、利益冲突检测、代码 review 陷阱、伦理边界
2026-03-21 01:21 SGT SmokeEvaluation Completed
11 Model Started:2026-03-21 01:21 SGT Completed:2026-03-21 01:21 SGT 10s Run #26 Formula v3 · Judge v5 · Benchmark v4
Judge v5:引入严格判分分层(strict/non-strict):新增 4 种严格判分类型(exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value),严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true
Benchmark v4:题库从 89 题扩充到 100 题(编程 33 + 知识 45 + 长上下文 22),新增 11 道高质量决策题,覆盖矛盾信息识别、信息不足诚实度、优先级排序、利益冲突检测、代码 review 陷阱、伦理边界
2026-03-21 01:19 SGT Benchmark Change
题库 v4:新增 11 道高质量决策题
新增 11 道高质量决策题,覆盖矛盾信息识别(2题)、信息不足诚实度(2题)、优先级排序(2题)、利益冲突检测(2题)、代码 review 陷阱(2题)、伦理边界(1题)。总题库从 89 题扩充到 100 题。题库版本升级为 v4。
2026-03-21 01:05 SGT Model Change
新增 3 个评测模型:Grok 3、豆包 Pro、文心一言 4.0
新增 3 个评测模型:Grok 3(xAI)、豆包 Pro(字节跳动)、文心一言 4.0(百度)。评测模型总数从 8 个增加到 11 个。
2026-03-21 01:05 SGT SmokeEvaluation Completed
11 Model Started:2026-03-21 01:05 SGT Completed:2026-03-21 01:05 SGT 10s Run #25 Formula v3 · Judge v5 · Benchmark v3
Judge v5:引入严格判分分层(strict/non-strict):新增 4 种严格判分类型(exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value),严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true
Benchmark v3:题库从 80 题扩充到 89 题(编程 33 + 知识 34 + 长上下文 22),知识工作新增工程判断力题组(9 题),覆盖技术选型、架构权衡、故障排查等实战场景
2026-03-21 00:59 SGT SmokeEvaluation Completed
10 Model Started:2026-03-21 00:59 SGT Completed:2026-03-21 00:59 SGT 9s Run #24 Formula v3 · Judge v5 · Benchmark v3
Judge v5:引入严格判分分层(strict/non-strict):新增 4 种严格判分类型(exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value),严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true
Benchmark v3:题库从 80 题扩充到 89 题(编程 33 + 知识 34 + 长上下文 22),知识工作新增工程判断力题组(9 题),覆盖技术选型、架构权衡、故障排查等实战场景
2026-03-20 12:55 SGT SmokeEvaluation Completed
8 Model Started:2026-03-20 12:44 SGT Completed:2026-03-20 12:55 SGT 10pts39s Run #23 Formula v3 · Judge v5 · Benchmark v3
Judge v5:引入严格判分分层(strict/non-strict):新增 4 种严格判分类型(exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value),严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true
Benchmark v3:题库从 80 题扩充到 89 题(编程 33 + 知识 34 + 长上下文 22),知识工作新增工程判断力题组(9 题),覆盖技术选型、架构权衡、故障排查等实战场景
2026-03-20 03:10 SGT SmokeEvaluation Completed
8 Model Started:2026-03-20 03:00 SGT Completed:2026-03-20 03:10 SGT 10pts50s Run #22 Formula v3 · Judge v5 · Benchmark v3
Judge v5:引入严格判分分层(strict/non-strict):新增 4 种严格判分类型(exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value),严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true
Benchmark v3:题库从 80 题扩充到 89 题(编程 33 + 知识 34 + 长上下文 22),知识工作新增工程判断力题组(9 题),覆盖技术选型、架构权衡、故障排查等实战场景
2026-03-19 09:57 SGT FullEvaluation Completed
8 Model Started:2026-03-19 08:07 SGT Completed:2026-03-19 09:57 SGT 1h49pts Run #20 Formula v3 · Judge v5 · Benchmark v3
Judge v5:引入严格判分分层(strict/non-strict):新增 4 种严格判分类型(exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value),严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true
Benchmark v3:题库从 80 题扩充到 89 题(编程 33 + 知识 34 + 长上下文 22),知识工作新增工程判断力题组(9 题),覆盖技术选型、架构权衡、故障排查等实战场景
2026-03-19 03:11 SGT SmokeEvaluation Completed
8 Model Started:2026-03-19 03:00 SGT Completed:2026-03-19 03:11 SGT 11pts42s Run #18 Formula v3 · Judge v5 · Benchmark v2
Judge v5:引入严格判分分层(strict/non-strict):新增 4 种严格判分类型(exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value),严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true
Benchmark v2:题库从 30 题扩充到 80 题(编程 33 + 知识 25 + 长上下文 22),编程新增动态规划和并发分析,知识工作新增复利计算、时区推理等多步推理题
2026-03-18 03:11 SGT SmokeEvaluation Completed
8 Model Started:2026-03-18 03:00 SGT Completed:2026-03-18 03:11 SGT 11pts18s Run #17 Formula v3 · Judge v5 · Benchmark v2
Judge v5:引入严格判分分层(strict/non-strict):新增 4 种严格判分类型(exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value),严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true
Benchmark v2:题库从 30 题扩充到 80 题(编程 33 + 知识 25 + 长上下文 22),编程新增动态规划和并发分析,知识工作新增复利计算、时区推理等多步推理题
2026-03-18 01:19 SGT FullEvaluation Completed
8 Model Started:2026-03-17 23:24 SGT Completed:2026-03-18 01:19 SGT 1h55pts Run #16 Formula v3 · Judge v5 · Benchmark v2
Judge v5:引入严格判分分层(strict/non-strict):新增 4 种严格判分类型(exact_rank、exact_boolean_set、exact_numeric_set、exact_json_value),严格题只给 0 或 100 不给部分分。排名题、True/False 判断题、单值数值题等标记为 strict=true
Benchmark v2:题库从 30 题扩充到 80 题(编程 33 + 知识 25 + 长上下文 22),编程新增动态规划和并发分析,知识工作新增复利计算、时区推理等多步推理题
2026-03-17 11:23 SGT FullEvaluation Completed
8 Model Started:2026-03-17 09:43 SGT Completed:2026-03-17 11:23 SGT 1h40pts Run #15 Formula v3 · Judge v4 · Benchmark v2
Judge v4:评分规则微调,配合题库 v2 的新增题目补充对应的判分逻辑
Benchmark v2:题库从 30 题扩充到 80 题(编程 33 + 知识 25 + 长上下文 22),编程新增动态规划和并发分析,知识工作新增复利计算、时区推理等多步推理题
本次为版本迁移运行,同时升级了公式版本(v2→v3)、判分器版本(v3→v4)、题库版本(v1→v2, 30题→80题)。后续正常周评测将在同一版本下进行。
2026-03-17 09:27 SGT FullEvaluation Completed
8 Model Started:2026-03-17 07:51 SGT Completed:2026-03-17 09:27 SGT 1h35pts Run #14 Formula v2 · Judge v3 · Benchmark v1
Judge v3:收紧评分标准:JSON 校验开始检查嵌套字段是否正确,部分命中从"命中一个就给高分"改成按比例计分,同时给部分题目加了多种可接受的正确答案
Benchmark v1:初始题库 30 题,覆盖编程、知识工作、长上下文三个维度
2026-03-17 03:13 SGT FullEvaluation Completed
8 Model Started:2026-03-17 02:32 SGT Completed:2026-03-17 03:13 SGT 40pts31s Run #11 Formula v2 · Judge v2 · Benchmark v1
Judge v2:引入六种判分方法(全部命中、部分命中、精确匹配、正则、顺序匹配、JSON 结构校验),开始有比较正式的评分体系
Benchmark v1:初始题库 30 题,覆盖编程、知识工作、长上下文三个维度
2026-03-17 03:10 SGT SmokeEvaluation Completed
8 Model Started:2026-03-17 03:00 SGT Completed:2026-03-17 03:10 SGT 10pts54s Run #12 Formula v2 · Judge v2 · Benchmark v1
Judge v2:引入六种判分方法(全部命中、部分命中、精确匹配、正则、顺序匹配、JSON 结构校验),开始有比较正式的评分体系
Benchmark v1:初始题库 30 题,覆盖编程、知识工作、长上下文三个维度
2026-03-17 02:12 SGT FullEvaluation Completed
8 Model Started:2026-03-17 01:33 SGT Completed:2026-03-17 02:12 SGT 39pts0s Run #10 Formula v2 · Judge v2 · Benchmark v1
Judge v2:引入六种判分方法(全部命中、部分命中、精确匹配、正则、顺序匹配、JSON 结构校验),开始有比较正式的评分体系
Benchmark v1:初始题库 30 题,覆盖编程、知识工作、长上下文三个维度
2026-03-17 00:45 SGT FullEvaluation Completed
8 Model Started:2026-03-16 23:58 SGT Completed:2026-03-17 00:45 SGT 47pts30s Run #9 Formula v2 · Judge v2 · Benchmark v1
Judge v2:引入六种判分方法(全部命中、部分命中、精确匹配、正则、顺序匹配、JSON 结构校验),开始有比较正式的评分体系
Benchmark v1:初始题库 30 题,覆盖编程、知识工作、长上下文三个维度