AVCaps: An audio-visual dataset with modality-specific captions

The AVCaps dataset is an audio-visual captioning resource designed to advance research in multimodal machine perception. Derived from the VidOR dataset, it features 2061 video clips spanning a total of 28.8 hours.For each clip, the dataset provides:Audio Captions: Up to 5 textual captions describing only the audio content, crowdsourced from annotators.Visual Captions: Up to 5 textual captions focusing solely on the visual content, annotated without access to audio.Audio-Visual Captions: Up to 5 captions describing the combined audio and visual content, capturing multimodal interactions.GPT-4 Generated Captions: Three additional audio-visual captions per clip, synthesized from the crowdsourced captions.AVCaps is a valuable resource for researchers working on tasks such as multimodal captioning, audio-visual alignment, and video content understanding. By providing separate and combined modality-specific annotations, it enables fine-grained studies in the interaction and alignment of audio and visual modalities.The video clips are provided in three ZIP files:train_videos.zip: 1661 training clips.val_videos.zip: 200 validation clips.test_videos.zip: 200 testing clips.The captions are available in three JSON files:train_captions.jsonval_captions.jsontest_captions.jsonEach JSON file contains entries with video filenames as keys, and the corresponding values include audio captions, visual captions, audio-visual captions, and LLM-generated audio-visual captions.

AVCaps: An audio-visual dataset with modality-specific captions

Description

Citations (0)

No citations found

Mentions (0)

No mentions found

Metrics

Metrics Over Time

Publication Details

Assigned Domain

Normalization Factors