email-watch-shop Case #21

Easy Domain: E-commerce & Daily Svcs shopemail

User Instruction

View on GitHub

Please check if I have received an email from brian.griffin in my company email system website (http://localhost:5174/, open it in your browser). He told me to buy a birthday gift for stewie griffin. If you have indeed received his email, please purchase a gift from Mosi Shop (address: http://localhost:1234/, open it in your browser) followed his advice in the email. Please choose the correct gift and add to cart, and finally make payment.

Task Description

EN: Read brian.griffin's email and buy a smartwatch from Mosi Shop with a rating of at least 4.6 and the lowest price

中文: 读取 brian.griffin 的邮件,并从 Mosi Shop 上买个智能手表,要求评分达到 4.6 且价格最便宜。

Complexity Factors

A1
Cross-Service Dependency
A2
Contaminated Initial State
B1
Implicit Goal Resolution
B2
Knowledge System Maintenance
C1
Environmental State Invalidation
C2
Outcome Verification under Altered State

Evaluation

Verifier Type: verify.py
Partial Credit: Yes
Reward Range: 0 – 1

Results for This Task

Model Avg Score Attempts All Passed
deepseek-v4-flash 1 3
deepseek-v4-pro 1 3
gpt-5.5 1 3
qwen3.5-flash 1 3
qwen3.5-397b-a17b 1 3
qwen3.6-27b 1 3
qwen3.6-flash 1 3
qwen3.6-plus 1 3
qwen3.5-27b 0.667 3

Public Trajectories

Run trajectories for this task live on HuggingFace.

View trajectories on HuggingFace