METR

METR
Formation	2022
Founder	Beth Barnes
Type	Nonprofit research institute
Legal status	501(c)(3) tax exempt charity
Purpose	AI safety research and model evaluation
Location	Berkeley, California;
Website	metr.org

Model Evaluation and Threat Research (METR) (MEE-tər), is a nonprofit research institute, based in Berkeley, California,^[1] that evaluates frontier AI models' capabilities to carry out long-horizon, agentic tasks that some researchers argue could pose catastrophic risks to society.^[2]^[3] They have worked with leading AI companies to conduct pre-deployment model evaluations and contribute to system cards, including OpenAI's o3, o4-mini, GPT-4o and GPT-4.5, and Anthropic's Claude models.^[3]^[4]^[5]^[6]^[7]

METR's CEO and founder is Beth Barnes, a former alignment researcher at OpenAI who left in 2022 to form ARC Evals, the evaluation division of Paul Christiano's Alignment Research Center. In December 2023, ARC Evals was then spun off into an independent 501(c)(3) nonprofit and renamed METR.^[8]^[9]^[10]

Research

A substantial amount of METR's research is focused on evaluating the capabilities of AI systems to conduct research and development of AI systems themselves, including RE-Bench, a benchmark designed to test whether AIs can "solve research engineering tasks and accelerate AI R&D".^[11]^[12]

Doubling time estimates

In March 2025, METR published a paper noting that the length of software engineering tasks that the leading AI model could complete had a doubling time of around 7 months between 2019 and 2024.^[14]

In January 2026, METR has released a new version of their time horizon estimates model (Time Horizon 1.1). According to their new model the rate of progress of AI capabilities has increased since 2023. They now estimate that the post-2023 doubling-time is 130.8 days (4.3 months). Progress is thus estimated to be 20% more rapid.^[15]

Time horizon measurements

METR releases a "task-completion time horizon" for analysed AI models. This measures the "task duration (measured by human expert completion time) at which an AI agent is predicted to succeed with a given level of reliability."^[16] They release it in two variants: The 50%-time horizon, which gives the task duration at which an AI model is estimated to succeed 50% of the time and the 80%-time horizon, which gives the task duration at which an AI model is estimated to succeed 80% of the time.^[16] They have two versions of horizon estimates: Time Horizon 1.1, introduced in January 2026, and the original Time Horizon 1.0.^[16]

As of 21 February 2026 the best performing model is Claude Opus 4.6 with a 14 hours 30 minutes 50%-time horizon and a 80%-time horizon of 1 hour and 3 minutes.^[16] The following table provides the time horizon estimates ordered by the model's release date:^[16]

Task duration (for humans)
Model	Release date	Time Horizon 1.1		Time Horizon 1.0
Model	Release date	50%	80%	50%	80%
GPT-2	February 2019	—	—	2 seconds	0 seconds
GPT-3	May 2020	—	—	9 seconds	2 seconds
GPT-3.5	March 2022	—	—	36 seconds	10 seconds
GPT-4	March 2023	4 minutes	37 seconds	5 minutes	1 minute
GPT-4 (November 2023)	November 2023	4 minutes	34 seconds	9 minutes	1 minute
Claude 3 Opus	March 2024	4 minutes	29 seconds	6 minutes	1 minute
GPT-4 Turbo	April 2024	3 minutes	37 seconds	7 minutes	2 minutes
GPT-4o	May 2024	6 minutes	57 seconds	9 minutes	2 minutes
Qwen2-72B	June 2024	—	—	2 minutes	25 seconds
Claude 3.5 Sonnet (Old)	June 2024	11 minutes	1 minute	19 minutes	3 minutes
Qwen2.5-72B	September 2024	—	—	5 minutes	56 seconds
o1-preview	September 2024	19 minutes	3 minutes	22 minutes	5 minutes
Claude 3.5 Sonnet (New)	October 2024	20 minutes	2 minutes	30 minutes	5 minutes
DeepSeek-V3	December 2024	—	—	18 minutes	4 minutes
o1	December 2024	38 minutes	6 minutes	41 minutes	6 minutes
Claude 3.7 Sonnet	February 2025	1 hour	10 minutes	56 minutes	15 minutes
o3	April 2025	2 hours 1 minute	24 minutes	1 hour 34 minutes	21 minutes
o4-mini	April 2025	—	—	1 hour 19 minutes	16 minutes
Claude Opus 4	May 2025	1 hour 41 minutes	17 minutes	1 hour 26 minutes	21 minutes
DeepSeek-R1-0528	May 2025	—	—	32 minutes	4 minutes
Gemini 2.5 Pro Preview	June 2025	—	—	40 minutes	9 minutes
Grok 4	July 2025	—	—	1 hour 49 minutes	15 minutes
Claude Opus 4.1	August 2025	1 hour 41 minutes	19 minutes	—	—
GPT-5	August 2025	3 hours 34 minutes	32 minutes	2 hours 18 minutes	27 minutes
gpt-oss-120b	August 2025	—	—	45 minutes	7 minutes
Claude Sonnet 4.5	September 2025	—	—	2 hours 2 minutes	21 minutes
Gemini 3 Pro	November 2025	3 hours 57 minutes	43 minutes	—	—
Claude Opus 4.5	November 2025	5 hours 20 minutes	42 minutes	4 hours 49 minutes	27 minutes
GPT-5.1-Codex-Max	November 2025	3 hours 57 minutes	41 minutes	2 hours 53 minutes	32 minutes
Kimi K2 Thinking (inference via Novita AI)	November 2025	—	—	58 minutes	12 minutes
GPT-5.2 (high)	December 2025	6 hours 34 minutes	55 minutes	—	—
Claude Opus 4.6	February 2026	11 hours 59 minutes	1 hour 10 minutes	—	—
GPT-5.3-Codex (high)	February 2026	6 hours 30 minutes	47 minutes	—	—

References

^ Witt, Stephen (10 October 2025). "The A.I. Prompt That Could End the World". The New York Times. Archived from the original on 29 October 2025. Retrieved 29 October 2025.
^ "About METR". METR. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
^ ^a ^b "OpenAI o3 and o4-mini System Card". OpenAI. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
^ "GPT-4.5 system card". OpenAI. Retrieved 15 June 2025.
^ "Introducing Claude 3.5 Sonnet". Anthropic. Archived from the original on 6 February 2025. Retrieved 15 June 2025.
^ "Details about METR's preliminary evaluation of Claude 3.7". METR's Autonomy Evaluation Resources. 4 April 2025. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
^ Robison, Kylie (2024-08-08). "OpenAI says its latest GPT-4o model is 'medium' risk" Archived 6 February 2026 at the Wayback Machine. The Verge Archived 21 October 2025 at the Wayback Machine. Retrieved 2025-10-29.
^ "ARC Evals is now METR". METR Blog. 4 December 2023. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
^ Booth, Harry (5 September 2024). "TIME100 AI 2024: Beth Barnes". TIME. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
^ Henshall, Will (21 March 2024). "Nobody Knows How to Safety-Test AI". TIME. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
^ "Claude 3.7 Sonnet System Card". Anthropic. 24 February 2025. Retrieved 15 June 2025.
^ "Gemini 2.5 Pro Preview Model Card". Google. 6 June 2025. Archived from the original on 28 May 2025. Retrieved 15 June 2025.
^ "Measuring AI Ability to Complete Long Tasks". METR Blog. 19 March 2025. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
^ Lovely, Garrison (19 March 2025). "AI could soon tackle projects that take humans weeks". Nature. doi:10.1038/d41586-025-00831-8. ISSN 1476-4687. Archived from the original on 1 July 2025. Retrieved 15 June 2025.
^ "Time Horizon 1.1". METR Blog. 29 January 2026. Archived from the original on 12 February 2026. Retrieved 14 February 2026.
^ ^a ^b ^c ^d ^e "Task-Completion Time Horizons of Frontier AI Models". METR. February 2026. Retrieved 20 February 2026.

External links

Official website

[1] Witt, Stephen (10 October 2025). "The A.I. Prompt That Could End the World". The New York Times. Archived from the original on 29 October 2025. Retrieved 29 October 2025.

[2] "About METR". METR. Archived from the original on 15 June 2025. Retrieved 15 June 2025.

[:0-3] "OpenAI o3 and o4-mini System Card". OpenAI. Archived from the original on 15 June 2025. Retrieved 15 June 2025.

[4] "GPT-4.5 system card". OpenAI. Retrieved 15 June 2025.

[5] "Introducing Claude 3.5 Sonnet". Anthropic. Archived from the original on 6 February 2025. Retrieved 15 June 2025.

[6] "Details about METR's preliminary evaluation of Claude 3.7". METR's Autonomy Evaluation Resources. 4 April 2025. Archived from the original on 15 June 2025. Retrieved 15 June 2025.

[7] Robison, Kylie (2024-08-08). "OpenAI says its latest GPT-4o model is 'medium' risk" Archived 6 February 2026 at the Wayback Machine. The Verge Archived 21 October 2025 at the Wayback Machine. Retrieved 2025-10-29.

[8] "ARC Evals is now METR". METR Blog. 4 December 2023. Archived from the original on 15 June 2025. Retrieved 15 June 2025.

[9] Booth, Harry (5 September 2024). "TIME100 AI 2024: Beth Barnes". TIME. Archived from the original on 15 June 2025. Retrieved 15 June 2025.

[10] Henshall, Will (21 March 2024). "Nobody Knows How to Safety-Test AI". TIME. Archived from the original on 15 June 2025. Retrieved 15 June 2025.

[11] "Claude 3.7 Sonnet System Card". Anthropic. 24 February 2025. Retrieved 15 June 2025.

[12] "Gemini 2.5 Pro Preview Model Card". Google. 6 June 2025. Archived from the original on 28 May 2025. Retrieved 15 June 2025.

[13] "Measuring AI Ability to Complete Long Tasks". METR Blog. 19 March 2025. Archived from the original on 15 June 2025. Retrieved 15 June 2025.

[14] Lovely, Garrison (19 March 2025). "AI could soon tackle projects that take humans weeks". Nature. doi:10.1038/d41586-025-00831-8. ISSN 1476-4687. Archived from the original on 1 July 2025. Retrieved 15 June 2025.

[15] "Time Horizon 1.1". METR Blog. 29 January 2026. Archived from the original on 12 February 2026. Retrieved 14 February 2026.

[:1-16] "Task-Completion Time Horizons of Frontier AI Models". METR. February 2026. Retrieved 20 February 2026.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[14]

[15]

[16]

Existential risk from artificial intelligence
Concepts	AGI AI alignment AI boom AI capability control AI safety AI takeover Effective accelerationism Ethics of artificial intelligence Existential risk from artificial intelligence Friendly artificial intelligence Instrumental convergence Intelligence explosion Longtermism Machine ethics Suffering risks Superintelligence Technological singularity Vulnerable world hypothesis
Organizations	AI Futures Project Alignment Research Center Center for AI Safety Center for Applied Rationality Center for Human-Compatible Artificial Intelligence Centre for the Study of Existential Risk Future of Humanity Institute Future of Life Institute Google DeepMind Humanity+ Institute for Ethics and Emerging Technologies Leverhulme Centre for the Future of Intelligence Machine Intelligence Research Institute METR OpenAI PauseAI Safe Superintelligence Stop AI
People	Scott Alexander Sam Altman Dario Amodei Yoshua Bengio Nick Bostrom Paul Christiano Eric Drexler Owain Evans Sam Harris Stephen Hawking Dan Hendrycks Geoffrey Hinton Bill Joy Daniel Kokotajlo Shane Legg Jan Leike Elon Musk Steve Omohundro Toby Ord Huw Price Martin Rees Stuart J. Russell Nate Soares Ilya Sutskever Jaan Tallinn Max Tegmark Alan Turing Frank Wilczek Roman Yampolskiy Eliezer Yudkowsky
Books	Do You Trust This Computer? Human Compatible If Anyone Builds It, Everyone Dies Our Final Invention Superintelligence: Paths, Dangers, Strategies The Precipice: Existential Risk and the Future of Humanity
Other	Artificial Intelligence Act Open letter on artificial intelligence Regulation of artificial intelligence US Statement on AI Risk
Category