LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

A "standardized exam" for LLM Agents with high fidelity— Including 134 Cases acrossing 10 domains, 22 reusable mock services, with 6 complexity factors to make task-distribution fidelity inspectable, auditable, and extensible.

LiveClawBench Overview

LiveClawBench overview. The benchmark first characterizes tasks with the Triple-Axis Complexity Framework, covering Environment Complexity, Cognitive Demand, and Runtime Adaptability. Each instruction is executed in an image-pinned Docker environment, where an agent interacts with full-stack mock services, bash tools, and a file system through a cross-server environment framework. The coupled environment supports stateful workflow execution, while database verification and rubric judging jointly determine the final score. Task example shown in the below Figure.

Task Demonstration

Leaderboard Preview

Top models on LiveClawBench.

Rank Model Overall Easy Medium Hard
1 gpt-5.5 73.8 95.2 70.1 23.7
2 qwen3.6-plus 70.6 96.8 63.8 15.3
3 qwen3.6-flash 70.2 95.1 64.5 15.6
4 deepseek-v4-pro 65.6 87.6 61.4 15
5 deepseek-v4-flash 60.5 87.8 49.9 13.2

Data Distribution

Task Data Distribution

134 Tasks View on GitHub →

Difficulty Distribution

53
Easy
39.6%
58
Medium
43.3%
23
Hard
17.2%

Domain Distribution

Documents & Knowledge
12
9.0%
Communication & Email
10
7.5%
E-commerce & Daily Svcs
22
16.4%
Calendar & Task Mgmt
10
7.5%
Coding & Software Dev
10
7.5%
DevOps & Env Repair
18
13.4%
Deep Research & Report
17
12.7%
Health & Fitness
11
8.2%
Social Media
11
8.2%
Finance & Data Analytics
13
9.7%

Complexity Factor Distribution

Tasks are tagged with complexity factors that make them challenging for agents. A single task may match multiple factors.

A1: Cross-Service Dependency
45
33.6%
A2: Contaminated Initial State
38
28.4%
B1: Implicit Goal Resolution
43
32.1%
B2: Knowledge System Maintenance
17
12.7%
C1: Environmental State Invalidation
7
5.2%
C2: Outcome Verification under Altered State
6
4.5%

Multi-Factor Overlap Distribution

How many factors each task has enabled simultaneously.

0 factors
41
30.6%
1 factor
44
32.8%
2 factors
35
26.1%
3 factors
14
10.4%
4+ factors
0
0.0%

Analysis

Factor Delta

Per-model performance impact of each complexity factor. Error bars indicate 95% CI.

Factor Behavior Frontier

Average behaviour metric delta per complexity factor derived from frontier models (GPT 5.5, Deepseek-v4-Pro, Deepseek-v4-Flash). Columns are grouped by what the metric measures (Effort, Looping, Diversity, Errors, State awareness, Termination); cell text is the raw delta of metrics and cell colour is z-score standardized value.

Size Scaling

The cost-quality scatter of mean reward against mean agent steps per trial.

Evaluation

Submitting to the LiveClawBench leaderboard

Leaderboard logs are stored in this HuggingFace repo. To submit your results, open a PR there following the instructions in the README.

Citation

@article{long2026liveclawbench,
  title={LiveClawBench: Benchmarking LLM agents on complex, real-world assistant tasks},
  author={Long, Xiang and Du, Li and Xu, Yilong and Liu, Fangcheng and Wang, Haoqing and Ding, Ning and Li, Ziheng and Guo, Jianyuan and Tang, Yehui},
  journal={arXiv preprint arXiv:2604.13072},
  year={2026}
}