Task Browser

Browse and filter all benchmark tasks.

134 tasks
ID Name DescriptionDifficulty Domain FactorsMock Apps
1skill-creationIdentify patterns from interaction history and create a SKILL from scratchhardDocuments & Knowledge
B2
2skill-supplementationSupplement and update an existing SKILL based on interaction historymediumDocuments & Knowledge
B2
3skill-conflict-resolutionUpdate existing SKILL based on interaction history: identify conflicting content, identify potentially misleading content in the old SKILL version, and update accordinglyeasyDocuments & Knowledge
B2
4skill-repository-curationOrganize existing SKILL library: merge redundant SKILLs, remove obsolete oneseasyDocuments & Knowledge
B2
5skill-dependency-fixAfter a user modifies a lower-level SKILL, identify the dependency relationships of higher-level SKILLs that call the lower-level SKILL, and update the higher-level SKILLs accordinglyeasyDocuments & Knowledge
B2
6email-writingHelp the user compose and send an email to a specified recipienteasyCommunication & Email
email
7email-replyCheck if an expected email exists; if so compose and send a replyeasyCommunication & Email
email
8flight-bookingHelp the user book a flight for a specified date and routemediumE-commerce & Daily Svcs
airline
9flight-seat-selectionHelp the user check in and select a seat for a specified booked flight; first retrieve the booking details from email, then select a seat on the airline website according to user requirementsmediumE-commerce & Daily Svcs
A1
airlineemail
10flight-seat-selection-failedHelp the user check in and select a seat for a specified booked flight; first retrieve the booking details from email, then select a seat on the airline website; when user seat requirements cannot be met, find the best alternativemediumE-commerce & Daily Svcs
A1B1
airlineemail
11flight-cancel-claimCheck the status of a booked flight; if the flight is cancelled, help the user file a compensation claim according to the airline website announcementhardE-commerce & Daily Svcs
A1B1
airlineemail
12flight-info-change-noticeCheck the inbox for flight status change notifications from the airline; if found, check the calendar for affected plans and notify co-travelershardCalendar & Task Mgmt
A1B1
airlineemailtodolist
13baggage-tracking-applicationFind a way to register lost luggage with the airline on its websiteeasyE-commerce & Daily Svcs
airline
14schedule-change-requestA schedule change requires the agent to find conflicting calendar entries, email relevant parties to inform them of the change, and explain the reasonmediumCalendar & Task Mgmt
A1
airlineemailtodolist
15blog-site-from-scratchBuild a blog site from scratch: Astro frontend, Node.js backend, SQLite database; support regular user/editor/admin roles; support posting, keyword search, Markdown rendering, email registration/login, personal dashboard; editors can delete posts, admins can appoint editors, all pages support mutual navigationeasyCoding & Software Dev
16blog-site-completion-from-starterStarting from a blog code skeleton, implement complete blog features: Astro frontend, Node.js backend, SQLite database; fill in user/editor/admin role permissions, search, Markdown rendering and other functionalityeasyCoding & Software Dev
A2
17washer-shopBuy a washing machine from Mosi Shop with a rating of at least 4.6 and portableeasyE-commerce & Daily Svcs
shop
18watch-shopBuy a smartwatch from Mosi Shop with a rating of at least 4.6 and the lowest priceeasyE-commerce & Daily Svcs
shop
19washer-changeThe purchased item is not portable enough; exchange it for a washing machine rated at least 4.6 and portable, completing the return and re-orderingeasyE-commerce & Daily Svcs
shop
20info-changeChange shipping address to '4278 Maple View Drive, Sacramento, CA 95814, USA' and phone number to '12345678901'easyE-commerce & Daily Svcs
shop