User Instruction
View on GitHubPlease check if I have received an email from brian.griffin in my company email system website (http://localhost:5174/, open it in your browser). He told me to buy a birthday gift for stewie griffin. If you have indeed received his email, please purchase a gift from Mosi Shop (address: http://localhost:1234/, open it in your browser) followed his advice in the email. Please choose the correct gift and add to cart, and finally make payment.
Task Description
EN: Read brian.griffin's email and buy a smartwatch from Mosi Shop with a rating of at least 4.6 and the lowest price
中文: 读取 brian.griffin 的邮件,并从 Mosi Shop 上买个智能手表,要求评分达到 4.6 且价格最便宜。
Complexity Factors
A1
✓
Cross-Service Dependency
A2
✗
Contaminated Initial State
B1
✗
Implicit Goal Resolution
B2
✗
Knowledge System Maintenance
C1
✗
Environmental State Invalidation
C2
✗
Outcome Verification under Altered State
Evaluation
Verifier Type:
verify.py Partial Credit: Yes
Reward Range:
0 – 1 Results for This Task
| Model | Avg Score | Attempts | All Passed |
|---|---|---|---|
| deepseek-v4-flash | 1 | 3 | ✓ |
| deepseek-v4-pro | 1 | 3 | ✓ |
| gpt-5.5 | 1 | 3 | ✓ |
| qwen3.5-flash | 1 | 3 | ✓ |
| qwen3.5-397b-a17b | 1 | 3 | ✓ |
| qwen3.6-27b | 1 | 3 | ✓ |
| qwen3.6-flash | 1 | 3 | ✓ |
| qwen3.6-plus | 1 | 3 | ✓ |
| qwen3.5-27b | 0.667 | 3 | ✗ |
Public Trajectories
Run trajectories for this task live on HuggingFace.
View trajectories on HuggingFaceSource Files
task.toml: tasks/email-watch-shop/task.toml
instruction: tasks/email-watch-shop/instruction.md
environment: tasks/email-watch-shop/environment/Dockerfile