watch-shop Case #18

Easy Domain: E-commerce & Daily Svcs shop

User Instruction

View on GitHub

Please help me purchase a smart watch from the Mosi Shop (address: http://localhost:1234/, open it in your browser). The product must be the cheapest among those with a rating of 4.6 or higher. Please complete the selection process and then add to cart and make payment.

Task Description

EN: Buy a smartwatch from Mosi Shop with a rating of at least 4.6 and the lowest price

中文: 我希望从 Mosi Shop 上买个智能手表,要求评分达到 4.6 且价格最便宜。

Complexity Factors

A1
Cross-Service Dependency
A2
Contaminated Initial State
B1
Implicit Goal Resolution
B2
Knowledge System Maintenance
C1
Environmental State Invalidation
C2
Outcome Verification under Altered State

Evaluation

Verifier Type: verify.py
Partial Credit: Yes
Reward Range: 0 – 1

Results for This Task

Model Avg Score Attempts All Passed
deepseek-v4-flash 1 3
gpt-5.5 1 3
qwen3.5-flash 1 3
qwen3.5-397b-a17b 1 3
qwen3.6-27b 1 3
qwen3.6-flash 1 3
qwen3.6-plus 1 3
deepseek-v4-pro 0.833 3
qwen3.5-27b 0.667 3

Public Trajectories

Run trajectories for this task live on HuggingFace.

View trajectories on HuggingFace