Task Browser
Browse and filter all benchmark tasks.
134 tasks
| ID ↑ | Name ↕ | Description | Difficulty ↕ | Domain ↕ | Factors | Mock Apps |
|---|---|---|---|---|---|---|
| 1 | skill-creation | Identify patterns from interaction history and create a SKILL from scratch | hard | Documents & Knowledge | B2 | |
| 2 | skill-supplementation | Supplement and update an existing SKILL based on interaction history | medium | Documents & Knowledge | B2 | |
| 3 | skill-conflict-resolution | Update existing SKILL based on interaction history: identify conflicting content, identify potentially misleading content in the old SKILL version, and update accordingly | easy | Documents & Knowledge | B2 | |
| 4 | skill-repository-curation | Organize existing SKILL library: merge redundant SKILLs, remove obsolete ones | easy | Documents & Knowledge | B2 | |
| 5 | skill-dependency-fix | After a user modifies a lower-level SKILL, identify the dependency relationships of higher-level SKILLs that call the lower-level SKILL, and update the higher-level SKILLs accordingly | easy | Documents & Knowledge | B2 | |
| 6 | email-writing | Help the user compose and send an email to a specified recipient | easy | Communication & Email | email | |
| 7 | email-reply | Check if an expected email exists; if so compose and send a reply | easy | Communication & Email | email | |
| 8 | flight-booking | Help the user book a flight for a specified date and route | medium | E-commerce & Daily Svcs | airline | |
| 9 | flight-seat-selection | Help the user check in and select a seat for a specified booked flight; first retrieve the booking details from email, then select a seat on the airline website according to user requirements | medium | E-commerce & Daily Svcs | A1 | airlineemail |
| 10 | flight-seat-selection-failed | Help the user check in and select a seat for a specified booked flight; first retrieve the booking details from email, then select a seat on the airline website; when user seat requirements cannot be met, find the best alternative | medium | E-commerce & Daily Svcs | A1B1 | airlineemail |
| 11 | flight-cancel-claim | Check the status of a booked flight; if the flight is cancelled, help the user file a compensation claim according to the airline website announcement | hard | E-commerce & Daily Svcs | A1B1 | airlineemail |
| 12 | flight-info-change-notice | Check the inbox for flight status change notifications from the airline; if found, check the calendar for affected plans and notify co-travelers | hard | Calendar & Task Mgmt | A1B1 | airlineemailtodolist |
| 13 | baggage-tracking-application | Find a way to register lost luggage with the airline on its website | easy | E-commerce & Daily Svcs | airline | |
| 14 | schedule-change-request | A schedule change requires the agent to find conflicting calendar entries, email relevant parties to inform them of the change, and explain the reason | medium | Calendar & Task Mgmt | A1 | airlineemailtodolist |
| 15 | blog-site-from-scratch | Build a blog site from scratch: Astro frontend, Node.js backend, SQLite database; support regular user/editor/admin roles; support posting, keyword search, Markdown rendering, email registration/login, personal dashboard; editors can delete posts, admins can appoint editors, all pages support mutual navigation | easy | Coding & Software Dev | ||
| 16 | blog-site-completion-from-starter | Starting from a blog code skeleton, implement complete blog features: Astro frontend, Node.js backend, SQLite database; fill in user/editor/admin role permissions, search, Markdown rendering and other functionality | easy | Coding & Software Dev | A2 | |
| 17 | washer-shop | Buy a washing machine from Mosi Shop with a rating of at least 4.6 and portable | easy | E-commerce & Daily Svcs | shop | |
| 18 | watch-shop | Buy a smartwatch from Mosi Shop with a rating of at least 4.6 and the lowest price | easy | E-commerce & Daily Svcs | shop | |
| 19 | washer-change | The purchased item is not portable enough; exchange it for a washing machine rated at least 4.6 and portable, completing the return and re-ordering | easy | E-commerce & Daily Svcs | shop | |
| 20 | info-change | Change shipping address to '4278 Maple View Drive, Sacramento, CA 95814, USA' and phone number to '12345678901' | easy | E-commerce & Daily Svcs | shop |