TP2: Vision, Language and Multimodal Challenges

PIs:

  • PIs: Rita Cucchiara (CNR – Università di Modena e Reggio)
  • Roberto Navigli (La Sapienza Università di Roma)

Computer Vision, Natural Language Processing, and Multimodal data processing are key disciplines in AI, that are recently becoming strictly connected by the introduction of so-called Foundation models, in the form of powerful language-only and language-vision models which integrate visual information both as input and output, while providing a dialogue-based interface and instruction-following capabilities.
In this context, the FAIR Transversal Project (TP) on “Vision, Language and Multimodal Challenges” aims at bringing together the Italian scientific and industrial community and achieve important breakthroughs on the theoretical and the applicative sides. Ultimately, the objective of the TP is the development of a family of language models that have the Italian language as a first-class citizen as well as supporting multimodal inputs and outputs. The TP also covers the creation of language-only and multimodal benchmarks curated by Italian experts and specifically tailored for the Italian language. All in all, the TP will empower the Italian scientific community with the fundamental skills needed to train and evaluate large-scale models both from a theoretical and a practical point of view. For this reason, the activities are carried out in collaboration with National HPC infrastructures, and especially with CINECA, who is providing computational support to the community through ISCRA-B projects and dedicated budgets on the Leonardo infrastructure.