- Multimedia Understanding with Pre-trained Models
Multimedia Understanding with Pre-trained Models
Multi-modal understanding plays a crucial role in enabling the machine to perceive the physical world with multiple sensor cues as humans. Recently, large-scale pre-trained models (PTMs) has become a research hotspot in the field of artificial intelligence. Existing techniques follow the self-supervised learning paradigm achieve great success on the uni-modal scenes, such as computer vision (CV) and natural language process (NLP). The recent advances in large-scale pre-trained models inspire the researchers to explore more and deeper pre-training techniques for the multi-modal understanding problem. In this workshop, we aim to bring together researchers from the field of multimedia to discuss recent research and future directions on pre-trained models with self-supervised learning for multimedia understanding.
In recent years, we have witnessed the great success of pre-trained models (PTM) in natural language processing (NLP), such as GPT3, BERT, Roberta, DEBERTA, etc. It motivates the researchers in the multimedia community to leverage the idea of PTM to address multi-modal tasks. The scope of this workshop is focused on pre-trained models with self-supervised learning for multimedia understanding. The potential topics include architecture design for multi-modal PTM, pre-text task design for self-supervised learning, multi-modal data modeling, efficiency enhancing for PTM, interpretability of PTM, etc.