SGLang 的智能体辅助开发初探

Jul 4, 2026 21 Views - Read Source LMSYS

LMSYS SGLang Agent开发 AI Infrastructure 性能优化 GPU Kernel

This article has not been translated into English yet. Showing the original Chinese version.

SGLang 的开发已经不再是单点代码修改。如今，同一个代码库同时覆盖 LLM serving、distributed runtime、GPU kernels、diffusion pipelines、model-specific execution paths，以及生产环境事故处理。过去，很多工作流依赖开发者个人经验：某个模型如何启动、profile trace 如何解读、CUDA crash 该先加哪类日志、性能 PR 应该包含哪些 benchmark。随着 agent 工具成熟，这些经验正在被转化为可执行的 SKILL.md、脚本、benchmark contract 和 review loop。

围绕 SGLang 的 agent-assisted development，社区已经形成了一组面向 LLM 与 diffusion 工作负载的技能体系：

SGLang .claude/skills：维护在 SGLang 主仓库中，覆盖 CUDA crash debugging、kernel integration、tests、CI、profiling、production triage 和 source-tree conventions 等仓库级流程。
SGLang diffusion .claude/skills：聚焦 diffusion-specific workflows，包括新增 diffusion models、benchmark/profiling denoise paths、调优 performance options，以及验证 quantized pipelines。
BBuf/AI-Infra-Auto-Driven-SKILLS：覆盖 cross-framework serving benchmarks、capacity planning、profile and pipeline analysis、model compute simulation、SGLang human-style review、production incident triage，以及面向 SGLang 和其他开源推理框架的 SOTA loops。
kernel-design-agents：即 KDA 项目，也是 MLSys 2026 FlashInfer Kernel Contest 的获胜方案。
BBuf/KDA-Pilot：将 KDA-style agent kernel workflows 应用于 SGLang。其公开的 B200 diffusion summary 目前追踪 10 个 SGLang kernel tasks；多数结果来自 KDA-Pilot 的公开 benchmark ledger，其中 residual_gate_add 使用的是已合入 SGLang integration PR 后报告的 B200 speedup，因为原始 task baseline 已发生变化。KDA-Pilot 衍生工作目前已落地到 3 个 SGLang integration PR。

这些尝试指向同一个方向：agent 的真正价值，不是替代开发者“凭感觉写代码”，而是把工程过程中的程序化知识沉淀下来，包括可执行步骤、可复现实验和可审查证据。

1. TL;DR：Agent 在 SGLang 中最适合做什么

当 agent 能沿着定义清晰的 workflow 持续推进时，价值最大。Benchmarking、profiling、kernel API logging、新增 diffusion pipelines、production incident replay 和 SOTA loops 都可以编码为 skills。
SGLang skill 本质上是一套可执行的开发流程。在 debug-cuda-crash、sglang-diffusion-benchmark-profile、llm-torch-profiler-analysis 等技能中，关键不是提示词本身，而是 preflight checks、hard failure gates、artifact contracts、reproduction commands 和 result formats。
Profile evidence 是性能优化的核心。SGLang profiler skills 会产出固定 kernel tables、overlap-opportunity tables 和 fuse-pattern tables；KDA-Pilot 进一步扩展为 same-ABI baseline/candidate comparison、real workloads、correctness gates、NCU evidence 以及 per-shape results。
长期优化开始进入 Loop Engineering 阶段。SGLang SOTA Performance Loop 将“追赶 SOTA”拆解为 fair benchmarking、gap decision、profiling、patching 和 revalidation。Humanize/RLCR 提供外部评审，Codex Goal 则能以更低协调成本运行同类循环。
Review 变得更重要。Agent 可以运行更多实验，但也会生成更多“看起来合理”的改动。开发者的职责转向定义问题、选择证据、设计流程，并判断结果是否足以进入 production paths。

2. 为什么 SGLang 适合 Agent-Assisted Development

SGLang 是面向 LLM 和多模态模型的高性能 serving framework。随着模型家族和硬件路径不断扩展，开发中反复出现几类问题。

LLM 路径复杂

一个性能问题可能横跨 Python runtime、scheduler、CUDA graph、Triton/CUDA kernels、FlashInfer/FlashAttention、distributed collectives，以及 model-specific wrappers。单靠局部观察很难判断瓶颈究竟位于哪一层。

Diffusion 路径同样复杂

一次较慢的 denoise pass 可能涉及 pipeline/stage partitioning、DiT blocks、attention backends、torch.compile graph breaks、CFG/SP parallelism、VAE，或者 custom fused kernels。不同模块之间的相互影响也会放大排查难度。

验证成本高

许多改动必须在 H100、H200、B200 或 RTX 5090 等真实硬件上，结合真实模型与真实 workload 进行验证。仅依赖本地 unit tests 通常不足以说明问题。

Profile 难以人工复用

单个 trace 可能包含数百次 kernel launch。手工阅读 Perfetto 容易遗漏 kernel 到 Python source 的映射，也容易混淆 prefill 与 decode。开发者会在长期分析中积累经验，例如哪些 kernel name 对应哪些 model logic，哪些 launch pattern 暗示 graph breaks，哪些 NCCL、attention、MLP layout 属于正常现象。如果这些知识只停留在个人脑中，下一次任务就无法复用。

性能结论高度依赖上下文

GPU type、shape、batch size、parallelism、precision、backend 和 compile state 都会改变结果。孤立 microbenchmark 往往无法证明真实模型级收益，因此需要端到端、长周期的测试流程，在固定 workload 下反复验证 throughput、latency、memory、accuracy 和 stability。这个过程既费人力，也费时间。

这些问题天然适合 agent 参与。启动服务、固定 workload、收集 trace、分析 profile row、补充 test、记录实验结果，都有明确输入与输出，适合脚本化和重复执行。开发者需要做的是划定边界：统一 benchmark setup、统一 profile interpretation rules、统一 accuracy gates，并规定 agent 在什么条件下必须停止修改代码。

因此，这里的 agent 并不是自由发挥的“自动程序员”，而是被工程 workflow 约束的 executor。重复出现的 SGLang 开发流程可以沉淀为 skills，让 agent 执行重复操作、收集证据、维护状态；开发者则继续负责设定目标、判断证据，以及审查改动是否适合进入真实 serving path。

3. 从 Prompt Engineering 到 SKILL：协议化工程流程

在 SGLang 框架中，一个有用的 skill 至少需要回答五类问题：

问题	Skill 应捕获的内容
When to use it	触发场景、支持的模型、支持的硬件，以及必须 hard-stop 的情况
How to start	Preflight checks、environment variables、repository state、dependency checks 和 model configuration
How to validate	Benchmark commands、profile commands、test entry points 和 accuracy gates
How to decide	Output tables、failure modes、priorities、risk categories 和 fallback conditions
How to deliver	Artifact directories、result schemas、PR descriptions、reproduction commands 和 review requirements

SGLang 相关 skills 覆盖多个层级：有些接近源码改动，例如 debugging、testing、新增 diffusion model、benchmark/profile workflow；另一些则面向 cross-framework benchmarking、capacity planning、compute simulation、production incident triage、PR optimization knowledge、SGLang human-style review，以及 Humanize/RLCR 等更高层 workflow。

4. 当前 Skill Stack：从 Crash 调试到性能闭环

目前常用的 SGLang agent-related skills 可以分为以下几组：

层级	代表 skill / project	解决的问题
CUDA crash	`debug-cuda-crash`	在 custom op/kernel API 边界记录 inputs、exceptions 和 dumps，将偶发 crash 转化为可离线分析的样本。
LLM benchmark	`llm-serving-auto-benchmark`	在 SGLang 和其他 OpenAI-compatible inference stacks 之间执行公平、有边界、可恢复的 serving benchmark search。
Capacity planning	`llm-serving-capacity-planner`	解析 SGLang 和其他 inference framework 的 startup logs，解释 weight memory、KV cache budget、CUDA graph overhead、request capacity 和 OOM pressure。
Trace triage	`llm-torch-profiler-analysis`	产出 fixed kernel、overlap-opportunity、fuse-pattern tables，并将 kernels 映射回 Python source；同一统一流程也存在于 AI-Infra 中，便于跨框架使用。
Pipeline/layer analysis	`llm-pipeline-analysis`	将 torch profiler traces 切分为 forward passes、layers 和 kernel flows，定位 steady-state passes、bottleneck layer types 与 Perfetto time ranges。
Model compute simulation	`model-compute-simulation`	为 LLM 构建 operator-level compute templates，估算 tensor shapes、FLOPs、MFU、kernel-to-op mapping 和 parallelism what-ifs。
Diffusion benchmark/profile	`sglang-diffusion-benchmark-profile`	捕获 denoise latency、perf dumps 和 torch profiler traces，同时优先检查执行是否真正使用了 native SGLang diffusion backend。

这些 skill 的共同点是：它们不只是告诉 agent“做什么”，而是明确规定“如何开始、如何失败、如何记录、如何复现、如何交付”。这使得 agent 的输出更接近可审查工程证据，而不是一次性回答。

5. KDA-Pilot：把 Kernel 优化变成可验证流程

KDA-Pilot 将 KDA-style agent kernel workflow 引入 SGLang diffusion 场景，重点不只是生成 kernel，而是建立 baseline/candidate 对照、真实 workload 验证、correctness gate、NCU evidence 和 per-shape result。其公开的 B200 diffusion kernel 结果已经覆盖 10 个 SGLang kernel tasks，并且相关工作已经进入多个 SGLang integration PR。

这类流程说明，agent 在性能工程中的关键价值不是“自动找到神奇优化”，而是降低重复实验、结果整理和证据维护的成本。对于 kernel 优化这类高度依赖硬件、shape 和 ABI 的任务，流程约束比单次生成质量更重要。

6. 结论：Agent 让工程经验可执行，开发者仍负责判断

SGLang 的实践表明，agent-assisted development 的核心并不是把开发者移出循环，而是把开发者长期积累的程序化经验变成可执行、可复现、可评审的工程资产。对于 SGLang 这样同时涉及 LLM serving、distributed runtime、GPU kernels、diffusion pipelines 和 production incident handling 的复杂系统，agent 最适合承担重复执行、证据收集和状态跟踪。

与此同时，review 的重要性反而上升。Agent 可以更快地产生 patch、benchmark 和 profile，但这些结果仍可能存在上下文不充分、指标选择错误或生产风险被低估的问题。未来的开发者角色会更偏向于：定义问题边界、设计 skill workflow、选择可信证据、设置停止条件，并决定某项优化是否真正适合进入生产路径。

This article is from LMSYS blog, translated in full by Winzheng (winzheng.com). Click here to view the original When republishing the translation, please credit the source. Thank you!