A multidimensional adaptive test for the psychometric assessment of LLM capabilities

At a glance

Project duration
06/2026  – 05/2029
DFG classification of subject areas

General, Cognitive and Mathematical Psychology

Funded by

DFG Priority Programme DFG Priority Programme

Project description

With the rise of Large Language Models (LLMs), we see new models being released on a constant basis. This is accompanied by the equally fast release of new benchmark datasets to assess the performance of these models in various domains – from language processing and problem solving to more specialized capabilities such as emotion detection and theory of mind. In this dynamic environment, assessing the performance of each new model on the entire item pool of each relevant benchmark is not only a technical challenge, but raises fundamental concerns about the scalability, resource demands, and sustainability of benchmark performance assessment. In the present project, we address these issues by adopting a multidimensional item response theory (mIRT) framework developed in psychometric assessment to LLM benchmarking. In the IRT framework, population-invariant and item-specific difficulty and discrimination parameters of each individual item are estimated from the empirical performance of a norming sample, which allows us to identify the most informative items for assessing LLMs’ latent abilities. The mIRT framework extends this towards multiple different ability dimensions. Here, we will collect responses of a norming sample of LLMs on a diverse set of benchmark items from various domains, and use mIRT to estimate the item parameters. We will then use the most informative items to implement a computerized adaptive test (CAT) for LLM capabilities: Here, items are presented successively until the LLM capability parameters are estimated with sufficient confidence, allowing for a maximally efficient capability assessment. This assessment infrastructure – which will be designed as a future-proof “living environment” where new items can be added and those that turn uninformative over time can be removed – will be made available in the form of local software packages as well as via an online interface.

Principal investigator

  • Person

    Dr. Fritz Günther

    • Department of Psychology
    • Psychological Methods

Participating institutions