Will METR report a frontier AI model with a 50%-success task time horizon above 40 hours by the end of 2027?

Resolves Feb 1, 2028 (in 2y)·$317 volume·1 trades

Yes

Market Insights

House estimate

5% YES

This market asks whether METR will publish a measurement showing a frontier AI model achieving a 50%-task-success time horizon exceeding 40 hours by end of 2027. METR's public data shows the current best model (Claude 3.7 Sonnet) at approximately 55 minutes—far below the 40-hour threshold. The 50%-time horizon has been growing exponentially with a 7-month doubling time, but even projecting that trend forward, reaching 40 hours from the current ~1-hour baseline would require roughly 5-6 doublings (35-42 months), more time than the ~18 months remaining through 2027. Additionally, METR explicitly notes that measurements above 16 hours remain unreliable with their current task suite, suggesting methodological limitations may persist. The MirrorCode benchmark showing weeks-long task capabilities is promising but represents early-stage research not yet integrated into standard METR time-horizon measurements.

Latest context: MirrorCode Benchmark Shows AI Can Complete Weeks-Long Coding Tasks · 2mo ago

Yes price

MirrorCode Benchmark Shows AI Can Complete Weeks-Long Coding Tasks2mo ago
METR released early results from MirrorCode benchmark showing AI agents can complete weeks-long coding tasks including reimplementing a 16,000-line codebase.
- METR
- Epoch
METR Releases Time Horizon 1.1 Updated Methodology1y ago
METR updated their time horizon measurement methodology with a larger task suite, providing more comprehensive evaluation of frontier AI models.
- METR
- Epoch