The IRMA Community
Newsletters
Research IRM
Click a keyword to search titles using our InfoSci-OnDemand powered search:
|
Cross-Modal Learning for Free-Text Video Search
Abstract
This article focuses on cross-modal video retrieval, a technology with wide-ranging applications across media networks, security organizations, and even individuals managing large personal video collections. The authors discuss the concept of cross-modal video learning and offer an overview of deep neural network architectures in the literature, focusing on methods combining visual and textual representations for cross-modal video retrieval. They also examine the impact of vision transformers, a learning paradigm significantly improving cross-modal learning performance. Also, they present a novel cross-modal network architecture for free-text video retrieval called T×V+Objects. This method extends an existing state-of-the-art network by incorporating object-based video encoding using transformers. It leverages multiple latent spaces and combines detected objects with textual features, creating a joint embedding space for improved text-video similarity.
Related Content
|
Christian Rainero, Giuseppe Modarelli.
© 2025.
26 pages.
|
|
Beatriz Maria Simões Ramos da Silva, Vicente Aguilar Nepomuceno de Oliveira, Jorge Magalhães.
© 2025.
21 pages.
|
|
Ann Armstrong, Albert J. Gale.
© 2025.
19 pages.
|
|
Zhi Quan, Yueyi Zhang.
© 2025.
21 pages.
|
|
Sanaz Adibian.
© 2025.
19 pages.
|
|
Le Ngoc Quang, Kulthida Tuamsuk.
© 2025.
21 pages.
|
|
Jorge Lima de Magalhães, Carla Cristina de Freitas da Silveira, Tatiana Aragão Figueiredo, Felipe Gilio Guzzo.
© 2025.
17 pages.
|
|
|