Humans can easily identify moments when their favorite actor appears or talks in a movie. However, computer vision systems struggle with this task. It is challenging because appearance, facial expressions, pose, and illumination change as a video progresses.
A recent study proposes a novel dataset and benchmark for audiovisual person retrieval in long untrimmed videos.
The dataset includes a set of 15-minutes videos from movies annotated with person identities. The identities are matched with faces and voices. A two-stream model that predicts people’s identities using audiovisual cues is developed as a baseline.
The benchmarks are introduced for two tasks: Seen and Seen & Heard. They aim at retrieving all segments when a query face appears on-screen or talks. It is shown that the novel dataset complements previous datasets, which are focused on visual analysis only.
Research paper: Alcazar, J. L., “APES: Audiovisual Person Search in Untrimmed Video”, 2021. Link: https://arxiv.org/abs/2106.01667
Discover more from Today Headline
Subscribe to get the latest posts to your email.