AVCaps: An audio-visual dataset with modality-specific captions
View DatasetDescription
The AVCaps dataset is an audio-visual captioning resource designed to advance research in multimodal machine perception. Derived from the VidOR dataset, it features 2061 video clips spanning a total of 28.8 hours.For each clip, the dataset provides:Audio Captions: Up to 5 textual captions describing only the audio content, crowdsourced from annotators.Visual Captions: Up to 5 textual captions focusing solely on the visual content, annotated without access to audio.Audio-Visual Captions: Up to 5 captions describing the combined audio and visual content, capturing multimodal interactions.GPT-4 Generated Captions: Three additional audio-visual captions per clip, synthesized from the crowdsourced captions.AVCaps is a valuable resource for researchers working on tasks such as multimodal captioning, audio-visual alignment, and video content understanding. By providing separate and combined modality-specific annotations, it enables fine-grained studies in the interaction and alignment of audio and visual modalities.The video clips are provided in three ZIP files:train_videos.zip: 1661 training clips.val_videos.zip: 200 validation clips.test_videos.zip: 200 testing clips.The captions are available in three JSON files:train_captions.jsonval_captions.jsontest_captions.jsonEach JSON file contains entries with video filenames as keys, and the corresponding values include audio captions, visual captions, audio-visual captions, and LLM-generated audio-visual captions.
Citations (0)
No citations found
Mentions (0)
No mentions found
Metrics Over Time
Publication Details
Subfield
Language and Linguistics
Field
Arts and Humanities
Domain
Social Sciences
Confidence Score
61%
Source
Scholar Data Model