Strumming To The Beat: Audio-Conditioned Contrastive Video Textures

Representation-learning methods enabled researchers to generate images “from scratch.” Nevertheless, video generation is still a difficult task. A recent paper on arXiv.org combines Video Textures, a video synthesis method for creating simple, repetitive videos, with current advances of self-supervised learning.

Image credit: Pxhere, CC0 Public Domain

The approach synthesizes textures by resampling frames from a single input video. A deep model is trained to learn features that are spatially and temporally best suited to the input. For synthesizing the texture, the video is represented as a graph where the individual frames are nodes, and the edges represent transition probabilities. Output videos are generated by randomly traversing edges with high transition probabilities.

In one of the applications, a video is generated from a source video with associated audio and new conditioning audio. The approach outperforms previous methods on perceptual studies.

We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning. We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order. This classic work, however, was constrained by its use of hand-designed distance metrics, limiting its use to simple, repetitive videos. We draw on recent techniques from self-supervised learning to learn this distance metric, allowing us to compare frames in a manner that scales to more challenging dynamics, and to condition on other data, such as audio. We learn representations for video frames and frame-to-frame transition probabilities by fitting a video-specific model trained using contrastive learning. To synthesize a texture, we randomly sample frames with high transition probabilities to generate diverse temporally smooth videos with novel sequences and transitions. The model naturally extends to an audio-conditioned setting without requiring any finetuning. Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.

Research paper: Narasimhan, M., Ginosar, S., Owens, A., Efros, A. A., and Darrell, T., “Strumming to the Beat: Audio-Conditioned Contrastive Video Textures”, 2021. Link: https://arxiv.org/abs/2104.02687

Strumming to the Beat: Audio-Conditioned Contrastive Video Textures

Related Posts

Microsoft’s revenue and cloud sales beat expectations

FCC reinstates net neutrality rules, prompting commissioner Brendan Carr to slam move

Accidental Discovery Reveals Diamond Dust’s Potential as MRI Contrast Agent

Product And Service Training For Sales Professionals

Hyundai Motor Group U.S. EV Sales Outpaced Ford And GM In Q1 2024

New insights lead to better next-gen solar cells

He Fought Grown Men for Money as a Tween. As an Adult, He Became a Beast

Airbus lifts A350 rate, highlighting edge over Boeing

Female bodybuilder found bound, gagged and stabbed to death at work after haunting post about ‘toxic relationship’

From sedans to SUVs, there’s a reasonably priced electric vehicle for everyone

Tomb Raider: Definitive Edition Finally Arrives on PC After a Decade, but Only on Microsoft Store

PopularStories

Airbus lifts A350 rate, highlighting edge over Boeing

Female bodybuilder found bound, gagged and stabbed to death at work after haunting post about ‘toxic relationship’

From sedans to SUVs, there’s a reasonably priced electric vehicle for everyone

Tomb Raider: Definitive Edition Finally Arrives on PC After a Decade, but Only on Microsoft Store

About Us

Pages