[논문] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
https://arxiv.org/abs/2104.11178 VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and TextWe present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations tarxiv.org해당 논문을 보고 작..
연구실 공부
2024. 6. 10.