Expected Outcome:
Project results are expected to contribute to some of the following expected outcomes:
- New assessment and validations methodologies developed allowing to evaluate General Purpose AI (GPAI) models, including multimodal systems, and systems’ capabilities and risks.
- Use of the research outcomes by GPAI providers, policymakers, public institutions, and other relevant stakeholders to evaluate GPAI models and systems’ capabilities and risks.
- Support to the AI Office in its function to conduct evaluations of general purpose AI models with a view to enforce the AI Act' rules for general purpose AI models and facilitate self-evaluation for GPAI model developers to ensure compliance with AI Act requirements.
Scope:
The rapid advancement of artificial intelligence (AI) has led to the development of increasingly sophisticated general-purpose AI (GPAI) models and systems. These models, such as large language models and multimodal AI systems, demonstrate remarkable capabilities across a wide range of tasks. However, assessing the capabilities of these models remains a significant challenge. Traditional evaluation methods often fail to capture the full spectrum of abilities exhibited by GPAI models and systems. Therefore, there is a pressing need for the development of new assessment frameworks, methodologies and tools that can comprehensively evaluate these models in terms of their trustworthy and ethical behaviour and operation, ensuring their reliability, fairness, and alignment with human values.
This topic aims to develop robust assessment tools, techniques, and benchmarks specifically designed to rigorously evaluate GPAI models and systems, including multimodal systems. Proposals should cover one or more of the following research areas:
- Innovative methods for proactively identifying and forecasting emergent capabilities in GPAI models and systems. This encompasses the identification of capabilities with both beneficial and potentially detrimental uses.
- Assessment of GPAI capabilities with a significant economic impact or potential for misuse. This includes assessing capabilities that drive beneficial innovation and societal good, as well as evaluating potential risks in areas such as chemical, biological, radiological, and nuclear (CBRN) hazards or cybersecurity threats.
- Developing assessment techniques that illuminate the underlying mechanisms of emergent capabilities in AI systems, emphasising interpretability and explainability.
Projects should generate example benchmark tests to examine trained AI models, systematically uncovering latent capabilities. These benchmarks will be made available to GPAI providers, policymakers, and other relevant stakeholders to implement robust evaluation tools.
This topic strongly encourages the formation of interdisciplinary teams combining the necessary technical expertise. Such a collaborative approach will ensure that assessments accurately capture real-world use cases, including capabilities elicitation techniques, and that the developed frameworks, methodologies and tools are responsive to the concerns of all relevant stakeholders.
This topic requires the effective contribution of SSH disciplines and the involvement of SSH experts, institutions as well as the inclusion of relevant SSH expertise, in order to produce meaningful and significant effects enhancing the societal impact of the related research activities.
Proposals must adhere to Horizon Europe's requirements regarding Open Science. Open access to research outputs should be provided unless there is a legitimate reason or constraint; in such cases, the proposal should detail how GPAI providers, policymakers, and other stakeholders will access the research outcomes.
All proposals are expected to incorporate mechanisms for assessing and demonstrating progress, including qualitative and quantitative KPIs, benchmarking, and progress monitoring. This should include participation in international evaluation contests and the presentation of illustrative application use-cases that demonstrate concrete potential added value. Communicable results should be shared with the European R&D community through the AI-on-demand platform, and if necessary, other relevant digital resource platforms to bolster the European AI, Data, and Robotics ecosystem by disseminating results and best practices.
This topic implements the co-programmed European Partnership on AI, data and robotics (ADRA), and all proposals are expected to allocate tasks for cohesion activities with ADRA and the CSA HORIZON-CL4-2025-03-HUMAN-18: GenAI4EU central Hub.
Proposals should also build on or seek collaboration with existing projects and develop synergies with other relevant International, European, national or regional initiatives. Regarding European programmes, projects are expected to develop synergies and complementarities with relevant projects funded under Horizon Europe but also under the Digital Europe Programme (DEP).