AVCaps: An audio-visual dataset with modality-specific captions

View Dataset
Sudarsanam, Parthasaarathy;Martín Morató, Irene;Hakala, Aapo;Virtanen, Tuomas

Description

The AVCaps dataset is an audio-visual captioning resource designed to advance research in multimodal machine perception. Derived from the VidOR dataset, it features 2061 video clips spanning a total of 28.8 hours.For each clip, the dataset provides:Audio Captions: Up to 5 textual captions describing only the audio content, crowdsourced from annotators.Visual Captions: Up to 5 textual captions focusing solely on the visual content, annotated without access to audio.Audio-Visual Captions: Up to 5 captions describing the combined audio and visual content, capturing multimodal interactions.GPT-4 Generated Captions: Three additional audio-visual captions per clip, synthesized from the crowdsourced captions.AVCaps is a valuable resource for researchers working on tasks such as multimodal captioning, audio-visual alignment, and video content understanding. By providing separate and combined modality-specific annotations, it enables fine-grained studies in the interaction and alignment of audio and visual modalities.The video clips are provided in three ZIP files:train_videos.zip: 1661 training clips.val_videos.zip: 200 validation clips.test_videos.zip: 200 testing clips.The captions are available in three JSON files:train_captions.jsonval_captions.jsontest_captions.jsonEach JSON file contains entries with video filenames as keys, and the corresponding values include audio captions, visual captions, audio-visual captions, and LLM-generated audio-visual captions.

Citations (0)

Mentions (0)

Metrics

Dataset Index

0.3

FAIR Score

79%

Citations

0

Mentions

0

Metrics Over Time

Publication Details

DOI

Publisher

Zenodo

Assigned Domain

Subfield

Language and Linguistics

Field

Arts and Humanities

Domain

Social Sciences

Confidence Score

61%

Source

Scholar Data Model

Normalization Factors

FT

15.38

CTw

1.00

MTw

1.00