Automated Author ProfileCartwright, Mark
New York University0000-0002-5908-390x
Cartwright, Mark
Current S-Index
Sum of Dataset Indices for all datasets
Average Dataset Index per Dataset
Average Dataset Index per dataset
Total Datasets
Total datasets for this author
Average FAIR Score
Average FAIR Score per dataset
Total Citations
Total citations to the author's datasets
Total Mentions
Total mentions of the author's datasets
S-Index Interpretation
The S-Index (Sharing Index) is a comprehensive metric that represents the cumulative impact of all your datasets. It is calculated as the sum of Dataset Index scores across all your claimed datasets.
What it means:
- A higher S-index indicates greater overall impact of your datasets relative to typical datasets in their fields of research
- The S-Index grows as you add more datasets or as existing datasets gain more citations and mentions
- It provides a single number to track your research data impact over time
Current S-Index: 40.6 (sum of 21 datasets Dataset Index scores)
More information here.
S-Index Over Time
Cumulative Citations Over Time
Cumulative Mentions Over Time
Datasets
Audio files for the paper:Jonathan Morse, Azadeh Naderi, Swen Gaudl, Mark Cartwright, Amy K. Hoover, Mark J. Nelson (2025). Expressive range characterization of open text-to-audio models. In: Proceedings of the 21st AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment.Contents:fig1_samples.zip: Generated audio for the examples in Fig. 1. Two prompts; two models; 100 samples for each.thunder_samples.zip: Generated audio for the running "thunder" example. One prompt; two models; 100 samples for each. Source for Figs. 2-4.esc50_samples.zip: Generated audio for the prompt "Sound of X" for each label X in the ESC-50 environmental audio dataset. Fifty prompts; three models; 100 samples for each. Source for Figs. 5-6 and Table 1.generation_scripts.zip: Python scripts used to generate audio from the three models.
Authors
- Morse, Jonathan ;
- Naderi, Azadeh ;
- Gaudl, Swen ;
- Cartwright, Mark ;
- Hoover, Amy K. ;
- Nelson, Mark
Audio files for the paper:Jonathan Morse, Azadeh Naderi, Swen Gaudl, Mark Cartwright, Amy K. Hoover, Mark J. Nelson (2025). Expressive range characterization of open text-to-audio models. In: Proceedings of the 21st AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment.Contents:fig1_samples.zip: Generated audio for the examples in Fig. 1. Two prompts; two models; 100 samples for each.thunder_samples.zip: Generated audio for the running "thunder" example. One prompt; two models; 100 samples for each. Source for Figs. 2-4.esc50_samples.zip: Generated audio for the prompt "Sound of X" for each label X in the ESC-50 environmental audio dataset. Fifty prompts; three models; 100 samples for each. Source for Figs. 5-6 and Table 1.generation_scripts.zip: Python scripts used to generate audio from the three models.
Authors
- Morse, Jonathan ;
- Naderi, Azadeh ;
- Gaudl, Swen ;
- Cartwright, Mark ;
- Hoover, Amy K. ;
- Nelson, Mark
Version 1.0, October 2024Created byMithun Manivannan (1), Vignesh Nethrapalli (1), Mark Cartwright (1)Sound Interaction and Computer Lab, New Jersey Institute of TechnologyPublicationIf using this data in an academic work, please reference the DOI and version, as well as cite the following paper, which presented the data collection procedure and the first version of the dataset:Manivannan, M., Nethrapalli, V., Cartwright, M. EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation. arXiv preprint arXiv:2410.12028, 2024.DescriptionEmotionCaps is a ChatGPT-assisted, weakly-labeled audio captioning dataset developed to bridge the gap between soundscape emotion recognition (SER) and automated audio captioning (AAC). Created through a three-stage pipeline, the dataset leverages ground-truth annotations from AudioSet SL, which are enhanced by ChatGPT using tailored prompts and emotions assigned via a soundscape emotion recognition model trained on Emo-Soundscapes Dataset. It comprises four subsets of captions for 120,071 audio clips, each reflecting a different prompt variation: WavCaps-like, Scene-Focused, Emotion Addon, and Emotion Rewrite. The average word counts for these subsets are: WavCaps-like (12.61), Scene-Focused (14.04), Emotion Addon (18.35), and Emotion Rewrite (18.65). The increase in word count for the emotion prompts illustrates the difference in sentence length when integrating emotion information into the captions.Audio DataThe audio data is from AudioSet SL, the strongly-labled subset of 120,071 audio clips from the larger AudioSet dataset.Synthetic CaptionsThe synthetic captions were generated using a three-stage pipeline, beginning with training a soundscape emotion recognition model. This model assesses the valence and arousal of each audio clip, mapping the resulting vector to an emotion identifier. Next, we leveraged the ground-truth annotations from AudioSet SL, and extracted the list of sound events. Using these sound events, we employed ChatGPT to create different variations of captions by applying distinct prompts.We first used the WavCaps prompt for AudioSet SL as a base, the output of which we call WavCaps-like. Building on this, we created three new prompt variations (1) scene-focused which is a modified WavCaps prompt that describes the scene, (2) emotion addon which is an extension of the scene-Focused prompt, where an emotion is appended to the list of sound events to guide the caption generation, and (3) emotion rewrite which consists of two-step prompt where ChatGPT first generates the scene-focused caption, then is instructed to rewrite it with a specific emotion in mind.Using these four prompt styles — WavCaps, Scene-Focused, Emotion Addon, and Emotion Rewrite — along with the AudioSet SL sound events and predicted emotions, we employed ChatGPT-3.5 Turbo to generate four corresponding caption variations for the dataset.Each caption variation has been organized into separate CSV files for clarity and accessibility. All files correspond to the same set of audio clips from AudioSet SL, with the key distinction being the caption variation associated with each clip. The different subsets are designed to be used independently, as they each fulfill specific roles in understanding the impact of emotion in audio captions.wavcaps-like.csv: Contains captions generated using the WavCaps prompt, serving as the baseline before emotion is introduced.scene-focused.csv: Provides captions focused on describing the scene or environment of the audio clip, without emotion integration.emotion-addon.csv: Captions where emotion data is appended to the scene-focused base caption.emotion-rewrite.csv: Captions that are completely rewritten based on the scene-focused base caption and the assigned emotion.This structure allows users to explore how emotional content influences captioning models by comparing the variations both with and without emotional enrichment.Columns in CSV filessegment_id : The ID of the audio recording in AudioSet SL. These are in the form caption : The caption generated for each audio clip, corresponding to the specific subset (e.g., WavCaps, Scene-Focused, Emotion Addon, or Emotion Rewrite) as indicated by the file name.Conditions of useDataset created by Mithun Manivannan, Vignesh Nethrapalli, Mark CartwrightThe EmotionCaps dataset is offered free of charge under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license:https://creativecommons.org/licenses/by/4.0/The dataset and its contents are made available on an “as is” basis and without warranties of any kind, including without limitation satisfactory quality and conformity, merchantability, fitness for a particular purpose, accuracy or completeness, or absence of errors. Subject to any liability that may not be excluded or limited by law, New Jersey Institute of Technology is not liable for, and expressly excludes all liability for, loss or damage however and whenever caused to anyone by any use of the EmotionCaps dataset or any part of it.FeedbackPlease help us improve EmotionCaps by sending your feedback to:Mithun Manivannan: [email protected] Cartwright: [email protected] case of a problem, please include as many details as possible.AcknowledgmentsThis work was partially supported by the New Jersey Institute of Technology Honors Summer Research Institute (HSRI).
Authors
- Manivannan, Mithun ;
- Nethrapalli, Vignesh ;
- Cartwright, Mark
Version 1.0, October 2024Created byMithun Manivannan (1), Vignesh Nethrapalli (1), Mark Cartwright (1)Sound Interaction and Computer Lab, New Jersey Institute of TechnologyPublicationIf using this data in an academic work, please reference the DOI and version, as well as cite the following paper, which presented the data collection procedure and the first version of the dataset:Manivannan, M., Nethrapalli, V., Cartwright, M. EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation. arXiv preprint arXiv:2410.12028, 2024.DescriptionEmotionCaps is a ChatGPT-assisted, weakly-labeled audio captioning dataset developed to bridge the gap between soundscape emotion recognition (SER) and automated audio captioning (AAC). Created through a three-stage pipeline, the dataset leverages ground-truth annotations from AudioSet SL, which are enhanced by ChatGPT using tailored prompts and emotions assigned via a soundscape emotion recognition model trained on Emo-Soundscapes Dataset. It comprises four subsets of captions for 120,071 audio clips, each reflecting a different prompt variation: WavCaps-like, Scene-Focused, Emotion Addon, and Emotion Rewrite. The average word counts for these subsets are: WavCaps-like (12.61), Scene-Focused (14.04), Emotion Addon (18.35), and Emotion Rewrite (18.65). The increase in word count for the emotion prompts illustrates the difference in sentence length when integrating emotion information into the captions.Audio DataThe audio data is from AudioSet SL, the strongly-labled subset of 120,071 audio clips from the larger AudioSet dataset.Synthetic CaptionsThe synthetic captions were generated using a three-stage pipeline, beginning with training a soundscape emotion recognition model. This model assesses the valence and arousal of each audio clip, mapping the resulting vector to an emotion identifier. Next, we leveraged the ground-truth annotations from AudioSet SL, and extracted the list of sound events. Using these sound events, we employed ChatGPT to create different variations of captions by applying distinct prompts.We first used the WavCaps prompt for AudioSet SL as a base, the output of which we call WavCaps-like. Building on this, we created three new prompt variations (1) scene-focused which is a modified WavCaps prompt that describes the scene, (2) emotion addon which is an extension of the scene-Focused prompt, where an emotion is appended to the list of sound events to guide the caption generation, and (3) emotion rewrite which consists of two-step prompt where ChatGPT first generates the scene-focused caption, then is instructed to rewrite it with a specific emotion in mind.Using these four prompt styles — WavCaps, Scene-Focused, Emotion Addon, and Emotion Rewrite — along with the AudioSet SL sound events and predicted emotions, we employed ChatGPT-3.5 Turbo to generate four corresponding caption variations for the dataset.Each caption variation has been organized into separate CSV files for clarity and accessibility. All files correspond to the same set of audio clips from AudioSet SL, with the key distinction being the caption variation associated with each clip. The different subsets are designed to be used independently, as they each fulfill specific roles in understanding the impact of emotion in audio captions.wavcaps-like.csv: Contains captions generated using the WavCaps prompt, serving as the baseline before emotion is introduced.scene-focused.csv: Provides captions focused on describing the scene or environment of the audio clip, without emotion integration.emotion-addon.csv: Captions where emotion data is appended to the scene-focused base caption.emotion-rewrite.csv: Captions that are completely rewritten based on the scene-focused base caption and the assigned emotion.This structure allows users to explore how emotional content influences captioning models by comparing the variations both with and without emotional enrichment.Columns in CSV filessegment_id : The ID of the audio recording in AudioSet SL. These are in the form caption : The caption generated for each audio clip, corresponding to the specific subset (e.g., WavCaps, Scene-Focused, Emotion Addon, or Emotion Rewrite) as indicated by the file name.Conditions of useDataset created by Mithun Manivannan, Vignesh Nethrapalli, Mark CartwrightThe EmotionCaps dataset is offered free of charge under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license:https://creativecommons.org/licenses/by/4.0/The dataset and its contents are made available on an “as is” basis and without warranties of any kind, including without limitation satisfactory quality and conformity, merchantability, fitness for a particular purpose, accuracy or completeness, or absence of errors. Subject to any liability that may not be excluded or limited by law, New Jersey Institute of Technology is not liable for, and expressly excludes all liability for, loss or damage however and whenever caused to anyone by any use of the EmotionCaps dataset or any part of it.FeedbackPlease help us improve EmotionCaps by sending your feedback to:Mithun Manivannan: [email protected] Cartwright: [email protected] case of a problem, please include as many details as possible.AcknowledgmentsThis work was partially supported by the New Jersey Institute of Technology Honors Summer Research Institute (HSRI).
Authors
- Manivannan, Mithun ;
- Nethrapalli, Vignesh ;
- Cartwright, Mark
Open-set Tagging (OST) is a synthetic dataset of 1s clips used to evaluate source-centric representation learning models in the paper Compositional Audio Representation Learning.Due to the size of the dataset, we only share the source files, and provide the scripts to generate the dataset are available here.The dataset generation process is as follows:1. From single-source FSD50K audio files, we generate a dataset of 10s soundscapes called Open-set Soundscapes (OSS) using Scaper.2. We then center a 1s window around the center of each sound event in the 10s soundscapes to generate Open-set Tagging (OST), which contains ~500k clips. If you are not going to use OSS, you can choose to synthesize it without audio-- this will synthesize only the JAMS annotation files needed for the 1s clips. Using the OSS JAMS files, OST clips can be generated deterministically.There are five dataset variants (~17GB each), each with a different random assignment of classes to the known and unknown class categories. For further details, refer to our previous paper Multi-label open-set audio classification. In this work, OST dataset variant 1 is referred to as OST for simplicity. We also introduce a tiny version of the dataset called OST-Tiny, which contains ~20k clips and only 10 known classes. This is convenient for faster prototyping and to evaluate models in a more challenging open-set classification scenario.
Authors
- Sridhar, Sripathi ;
- Cartwright, Mark
Open-set Tagging (OST) is a synthetic dataset of 1s clips used to evaluate source-centric representation learning models in the paper Compositional Audio Representation Learning.Due to the size of the dataset, we only share the source files, and provide the scripts to generate the dataset are available here.The dataset generation process is as follows:1. From single-source FSD50K audio files, we generate a dataset of 10s soundscapes called Open-set Soundscapes (OSS) using Scaper.2. We then center a 1s window around the center of each sound event in the 10s soundscapes to generate Open-set Tagging (OST), which contains ~500k clips. If you are not going to use OSS, you can choose to synthesize it without audio-- this will synthesize only the JAMS annotation files needed for the 1s clips. Using the OSS JAMS files, OST clips can be generated deterministically.There are five dataset variants (~17GB each), each with a different random assignment of classes to the known and unknown class categories. For further details, refer to our previous paper Multi-label open-set audio classification. In this work, OST dataset variant 1 is referred to as OST for simplicity. We also introduce a tiny version of the dataset called OST-Tiny, which contains ~20k clips and only 10 known classes. This is convenient for faster prototyping and to evaluate models in a more challenging open-set classification scenario.
Authors
- Sridhar, Sripathi ;
- Cartwright, Mark
Version 1.0, March 2024Created byLloyd May (1), Keita Ohshiro (2,3), Khang Dang (2,3), Sripathi Sridhar (2,3), Jhanvi Pai (2,3), Magdalena Fuentes (4), Sooyeon Lee (3), Mark Cartwright (2,3,4)Center for Computer Research in Music and Acoustics, Stanford UniversitySound Interaction and Computing Lab, New Jersey Institute of TechnologyDepartment of Informatics, New Jersey Institute of TechnologyMusic and Audio Research Lab, New York UniversityPublicationIf using this data in an academic work, please reference the DOI and version, as well as cite the following paper, which presented the data collection procedure and the first version of the dataset:May, L., Ohshiro, K., Dang, K., Sridhar, S., Pai, J., Fuentes, M., Lee, S., Cartwright, M. Unspoken Sound: Identifying Trends in Non-Speech Audio Captioning on YouTube. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI), 2024.DescriptionThe YouTube NSI Captioning Dataset was developed to analyze the contemporary and historical state of non-speech information (NSI) captioning on YouTube. NSI includes information about non-speech sounds such as environmental sounds, sound effects, incidental sounds, and music, as well as additional narrative information and extra-speech information (ESI), which gives context to spoken or signed language such as manner of speech (e.g. "[Whispering] Oh no") or speaker label (e.g., "[Juan] Oh no"). The dataset contains measures of estimated and annotated NSI in the captions of two different samples of videos: a popular video sample and a studio video sample. The aim of the popular sample is to understand the captioning practices in a broad spectrum of popular, impactful videos on YouTube. In contrast, the aim of the studio sample is to examine captioning practices among the top-tier production houses, often viewed as industry benchmarks due to their influence and vast resources available for accessibility. Using the YouTube API, we queried for videos in these two samples for each month from 2013 to 2022. We then estimated which captions contain NSI by searching for non-alphanumeric symbols that are indicative of NSI, e.g., "[" and "]" (see Section 3.2 of the paper for a full list). In addition, the research team manually annotated which captions have NSI from a subset of approximately 1800 videos from years 2013, 2018, and 2022. Please see the Section 3.3 of the paper for details of the annotation process.The resulting YouTube NSI Captioning Dataset consists of NSI information from ~715k videos containing ~273M lines of captions, ~ 6M of which are estimated instances of NSI. These videos span 10 years and 21 topics. The annotated subset consists of 1799 videos with a total of ~36k annotated captions lines, ~114k of which are instances of NSI annotated on 7 different categories. These videos span 3 years (2013, 2018, and 2022) and 20 YouTube-assigned topics. Each video was annotated by two annotators along with the consensus annotation. The dataset contains the links to the YouTube videos, video metadata from the YouTube API, and measures of both estimated and annotated NSI. Due to copyright concerns, we are only publicly releasing data consisting of summary NSI measures for each video. If you need access to the raw data used to create these summary NSI measures, contact Mark Cartwright at [email protected]_full_set_aggregate.csv : Data file containing the full set of video data with measures of estimated NSI.annotated_subset_aggregate.csv : Data file containing the smaller annotated subset of video data with measures of both annotated and estimated NSI.ColumnsThe following columns are present in both data files.video_id : The YouTube video IDyear : The year associated with the time period from which the video was sampled.sample : The sample which the video is from (i.e., popular or studio)sampling_period_start_date : The start date of the time period from which the video was sampled.sampling_period_end_date : The end date of the time period from which the video was sampled.caption_type : This can take one of three values: auto which indicates a caption was provided by YouTube's automated caption system, manual which indicates a caption was provided by the uploader, or none which indicates that no captions are present for the video.duration_minutes : The duration of the video in minutes.channel_id : The ID that YouTube uses to uniquely identify the channel.published_datetime : The date and time at which the video was published on YouTube.youtube_topics : The YouTube-provided list of Wikipedia URLs that provide a description of the video's content.category_id : The YouTube video category associated with the video.view_count : The count of views on YouTube at the time of sampling (Spring 2023).like_count : The count of likes on YouTube at the time of sampling (Spring 2023).comment_count : The count of comments on YouTube at the time of sampling (Spring 2023).high_level_topics : List of topics at a higher semantic level than youtube_topics that provide a description of the video's content. See paper for details on the mapping between youtube_topics and high_level_topics.__ : The remainder of the columns take this form with the values listed below.Values for :estimated_nsi : This NSI type is an estimation of NSI based on the presence of particular non-alphanumeric characters that are indicative of NSI as described in Section 3.2 of the paper.general_nsi (only in annotated_subset_aggregate.csv) : The most general of NSI types that is inclusive of music_nsi, environmental_nsi, additionalnarrativ_nsi, and quotedspeech_nsi. All of these NSI types are included in the calculation of measures associated with general_nsi. Note that misc_nsi and nonenglish_captions are not included as those may or may not contain NSI, and thus, we opt for precision over recall. Not present for the unlabeledmusic_nsi (only in annotated_subset_aggregate.csv) : Any genre of music, whether diegetic or not.environmental_nsi (only in annotated_subset_aggregate.csv) : Environmental sounds, sound effects, and incidental sounds, i.e., non-music and non-speech sounds. This includes non-verbal vocalizations like laughter, grunts, and crying, provided they aren't used to modify speech.extraspeech_nsi (only in annotated_subset_aggregate.csv) : Extra-speech Information (ESI), i.e., text that gives added context to spoken or signed language.additionalnarrative_nsi (only in annotated_subset_aggregate.csv) : Additional narrative information in the form of descriptive text that doesn't pertain directly to sounds.quotedspeech_nsi (only in annotated_subset_aggregate.csv) : Quoted Speech Captions containing internal quotation marks.misc_nsi (only in annotated_subset_aggregate.csv) : Unsure, misc, or ambiguous, i.e., instances where the appropriate label is unclear or the caption doesn't fit current categories.nonenglish_captions (only in annotated_subset_aggregate.csv) : Captions not written in English and thus have uncertain NSI status.Values for :count : The number of captions identified as containing NSI of the specified type in the video.presence : Indication of whether there is NSI of the specified type present in the video. 1 if present (e.g., count > 0), 0 if not present (e.g., count==0).count_per_minute : A measure of the density of NSI captions. count_per_min = count / duration_minutescount_per_minute_if_present : If presence==1, then count_per_minute, else, NaN. This is used for computing the aggregate CPMIP measure, which as discussed in the paper is intended to be a measure of the quality of NSI captions based on the assumption that more frequently captioned NSI within a video is an indicator of better NSI captioning. See Section 5 of the paper for details.Conditions of useDataset created by Lloyd May, Keita Ohshiro, Khang Dang, Sripathi Sridhar, Jhanvi Pai, Magdalena Fuentes, Sooyeon Lee, and Mark CartwrightThe YouTube NSI Captioning Dataset dataset is offered free of charge under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license: https://creativecommons.org/licenses/by/4.0/FeedbackPlease help us improve YouTube NSI Captioning Dataset by sending your feedback to:Mark Cartwright: [email protected] case of a problem, please include as many details as possible.
Authors
- May, Lloyd ;
- Ohshiro, Keita ;
- Dang, Khang ;
- Sridhar, Sripathi ;
- Pai, Jhanvi ;
- Fuentes, Magdalena ;
- Lee, Sooyeon ;
- Cartwright, Mark
Version 1.0, March 2024Created byLloyd May (1), Keita Ohshiro (2,3), Khang Dang (2,3), Sripathi Sridhar (2,3), Jhanvi Pai (2,3), Magdalena Fuentes (4), Sooyeon Lee (3), Mark Cartwright (2,3,4)Center for Computer Research in Music and Acoustics, Stanford UniversitySound Interaction and Computing Lab, New Jersey Institute of TechnologyDepartment of Informatics, New Jersey Institute of TechnologyMusic and Audio Research Lab, New York UniversityPublicationIf using this data in an academic work, please reference the DOI and version, as well as cite the following paper, which presented the data collection procedure and the first version of the dataset:May, L., Ohshiro, K., Dang, K., Sridhar, S., Pai, J., Fuentes, M., Lee, S., Cartwright, M. Unspoken Sound: Identifying Trends in Non-Speech Audio Captioning on YouTube. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI), 2024.DescriptionThe YouTube NSI Captioning Dataset was developed to analyze the contemporary and historical state of non-speech information (NSI) captioning on YouTube. NSI includes information about non-speech sounds such as environmental sounds, sound effects, incidental sounds, and music, as well as additional narrative information and extra-speech information (ESI), which gives context to spoken or signed language such as manner of speech (e.g. "[Whispering] Oh no") or speaker label (e.g., "[Juan] Oh no"). The dataset contains measures of estimated and annotated NSI in the captions of two different samples of videos: a popular video sample and a studio video sample. The aim of the popular sample is to understand the captioning practices in a broad spectrum of popular, impactful videos on YouTube. In contrast, the aim of the studio sample is to examine captioning practices among the top-tier production houses, often viewed as industry benchmarks due to their influence and vast resources available for accessibility. Using the YouTube API, we queried for videos in these two samples for each month from 2013 to 2022. We then estimated which captions contain NSI by searching for non-alphanumeric symbols that are indicative of NSI, e.g., "[" and "]" (see Section 3.2 of the paper for a full list). In addition, the research team manually annotated which captions have NSI from a subset of approximately 1800 videos from years 2013, 2018, and 2022. Please see the Section 3.3 of the paper for details of the annotation process.The resulting YouTube NSI Captioning Dataset consists of NSI information from ~715k videos containing ~273M lines of captions, ~ 6M of which are estimated instances of NSI. These videos span 10 years and 21 topics. The annotated subset consists of 1799 videos with a total of ~36k annotated captions lines, ~114k of which are instances of NSI annotated on 7 different categories. These videos span 3 years (2013, 2018, and 2022) and 20 YouTube-assigned topics. Each video was annotated by two annotators along with the consensus annotation. The dataset contains the links to the YouTube videos, video metadata from the YouTube API, and measures of both estimated and annotated NSI. Due to copyright concerns, we are only publicly releasing data consisting of summary NSI measures for each video. If you need access to the raw data used to create these summary NSI measures, contact Mark Cartwright at [email protected]_full_set_aggregate.csv : Data file containing the full set of video data with measures of estimated NSI.annotated_subset_aggregate.csv : Data file containing the smaller annotated subset of video data with measures of both annotated and estimated NSI.ColumnsThe following columns are present in both data files.video_id : The YouTube video IDyear : The year associated with the time period from which the video was sampled.sample : The sample which the video is from (i.e., popular or studio)sampling_period_start_date : The start date of the time period from which the video was sampled.sampling_period_end_date : The end date of the time period from which the video was sampled.caption_type : This can take one of three values: auto which indicates a caption was provided by YouTube's automated caption system, manual which indicates a caption was provided by the uploader, or none which indicates that no captions are present for the video.duration_minutes : The duration of the video in minutes.channel_id : The ID that YouTube uses to uniquely identify the channel.published_datetime : The date and time at which the video was published on YouTube.youtube_topics : The YouTube-provided list of Wikipedia URLs that provide a description of the video's content.category_id : The YouTube video category associated with the video.view_count : The count of views on YouTube at the time of sampling (Spring 2023).like_count : The count of likes on YouTube at the time of sampling (Spring 2023).comment_count : The count of comments on YouTube at the time of sampling (Spring 2023).high_level_topics : List of topics at a higher semantic level than youtube_topics that provide a description of the video's content. See paper for details on the mapping between youtube_topics and high_level_topics.__ : The remainder of the columns take this form with the values listed below.Values for :estimated_nsi : This NSI type is an estimation of NSI based on the presence of particular non-alphanumeric characters that are indicative of NSI as described in Section 3.2 of the paper.general_nsi (only in annotated_subset_aggregate.csv) : The most general of NSI types that is inclusive of music_nsi, environmental_nsi, additionalnarrativ_nsi, and quotedspeech_nsi. All of these NSI types are included in the calculation of measures associated with general_nsi. Note that misc_nsi and nonenglish_captions are not included as those may or may not contain NSI, and thus, we opt for precision over recall. Not present for the unlabeledmusic_nsi (only in annotated_subset_aggregate.csv) : Any genre of music, whether diegetic or not.environmental_nsi (only in annotated_subset_aggregate.csv) : Environmental sounds, sound effects, and incidental sounds, i.e., non-music and non-speech sounds. This includes non-verbal vocalizations like laughter, grunts, and crying, provided they aren't used to modify speech.extraspeech_nsi (only in annotated_subset_aggregate.csv) : Extra-speech Information (ESI), i.e., text that gives added context to spoken or signed language.additionalnarrative_nsi (only in annotated_subset_aggregate.csv) : Additional narrative information in the form of descriptive text that doesn't pertain directly to sounds.quotedspeech_nsi (only in annotated_subset_aggregate.csv) : Quoted Speech Captions containing internal quotation marks.misc_nsi (only in annotated_subset_aggregate.csv) : Unsure, misc, or ambiguous, i.e., instances where the appropriate label is unclear or the caption doesn't fit current categories.nonenglish_captions (only in annotated_subset_aggregate.csv) : Captions not written in English and thus have uncertain NSI status.Values for :count : The number of captions identified as containing NSI of the specified type in the video.presence : Indication of whether there is NSI of the specified type present in the video. 1 if present (e.g., count > 0), 0 if not present (e.g., count==0).count_per_minute : A measure of the density of NSI captions. count_per_min = count / duration_minutescount_per_minute_if_present : If presence==1, then count_per_minute, else, NaN. This is used for computing the aggregate CPMIP measure, which as discussed in the paper is intended to be a measure of the quality of NSI captions based on the assumption that more frequently captioned NSI within a video is an indicator of better NSI captioning. See Section 5 of the paper for details.Conditions of useDataset created by Lloyd May, Keita Ohshiro, Khang Dang, Sripathi Sridhar, Jhanvi Pai, Magdalena Fuentes, Sooyeon Lee, and Mark CartwrightThe YouTube NSI Captioning Dataset dataset is offered free of charge under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license: https://creativecommons.org/licenses/by/4.0/FeedbackPlease help us improve YouTube NSI Captioning Dataset by sending your feedback to:Mark Cartwright: [email protected] case of a problem, please include as many details as possible.
Authors
- May, Lloyd ;
- Ohshiro, Keita ;
- Dang, Khang ;
- Sridhar, Sripathi ;
- Pai, Jhanvi ;
- Fuentes, Magdalena ;
- Lee, Sooyeon ;
- Cartwright, Mark
Created by Yu Wang, Mark Cartwright, and Juan Pablo Bello Publication If using this data in academic work, please cite the following paper, which presented this dataset: Y. Wang, M. Cartwright, and J. P. Bello. "Active Few-Shot Learning for Sound Event Detection", INTERSPEECH, 2022 Description SONYC-FSD-SED is an open dataset of programmatically mixed audio clips that simulates audio data in an environmental sound monitoring system, where sound class occurrences and co-occurrences exhibit seasonal periodic patterns. We use recordings collected from the Sound of New York City (SONYC) acoustic sensor network as backgrounds, and single-labeled clips in the FSD50K dataset as foreground events to generate 576,591 10-second strongly-labeled soundscapes with Scaper (including 111,294 additional test data for the experiment of sampling window). Instead of sampling foreground sound events uniformly, we simulate the occurrence probability of each class at different times in a year, creating more realistic temporal characteristics. Source material and annotations Due to the large size of the dataset, instead of releasing the raw audio files, we release the source material and soundscape annotations in JAMS format, which can be used to reproduce SONYC-FSD-SED using Scaper with the script in the project repository. Background material from SONYC recordings We pick a sensor from the SONYC sensor network and subsample from recordings it collected within a year (2017). We categorize these ∼550k 10-second clips into 96 bins based on timestamps, where each bin represents a unique combination of the month of a year, day of a week (weekday or weekend), and time of a day (divided into four 6-hour blocks). Next, we run a pre-trained urban sound event classifier over all recordings and filter out clips with active sound classes. We do not filter out footstep and bird since they appear too frequently, instead, we remove these two classes from the foreground sound material. Then from each bin, we choose the clip with the lowest sound pressure level, yielding 96 background clips. Foreground material from FSD50K We follow the same filtering process as in FSD-MIX-SED to get the subset of FSD50K with short single-labeled clips. In addition, we remove two classes, "Chirp_and_tweet" and "Walk_and_footsteps", that exist in our SONYC background recordings. This results in 87 sound classes. vocab.json contains the list of 87 classes, each class is then labeled by its index in the list. 0-42: train, 43-56: val, 57-86: test. Occurrence probability modelling For each class, we model its occurrence probability within a year. We use von Mises probability density functions to simulate the probability distribution over different weeks in a year and hours in a day considering their cyclic characteristics: (f(x|μ, κ) = e^{κcos(x−μ)}/2πI_0(κ)), where (I_0(κ)) is the modified Bessel function of order (0), (\mu) and (1/\kappa) are analogous to the mean and variance in the normal distribution. We randomly sample ((\mu_{year}, \mu_{day})) from ([-\pi, \pi]) and ((\kappa_{year}, \kappa_{day})) from ([0, 10]). We also randomly assign (p_{weekday} \in [0, 1] ), (p_{weekend} = 1 − p_{weekday}) to simulate the probability distribution over different days in a week. Finally, we get the probability distribution over the entire year with a 1-hour resolution. At a given timestamp, we integrate (f_{year}) and (f_{day}) over the 1-hour window and multiply them together with (p_{weekday}) or (p_{weekend}) depends on the day. To speed up the following sampling process, we scale the final probability distribution using a temperature parameter randomly sampled from ([2,3]). Files SONYC_FSD_SED.source.tar.gz: 96 SONYC backgrounds and 10,158 foreground sounds in .wav format. The original file size is 2GB. SONYC_FSD_SED.annotations.tar.gz: 465,467 JAMS files. The original file size is 57GB. SONYC_FSD_SED_add_test.annotations.tar.gz: 111,294 JAMS files for additional test data. The original file size is 14GB. vocab.json: 87 classes. occ_prob_per_cl.pkl: Occurrence probability for each foreground sound class. References [1] J. P. Bello, C. T. Silva, O. Nov, R. L. DuBois, A. Arora, J. Salamon, C. Mydlarz, and H. Doraiswamy, “SONYC: A system for monitoring, analyzing, and mitigating urban noise pollution,” Commun. ACM, 2019 [2] E. Fonseca, X. Favory, J. Pons, F. Font, X. Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.
Authors
- Wang, Yu ;
- Cartwright, Mark ;
- Bello, Juan Pablo
Created by Yu Wang, Mark Cartwright, and Juan Pablo Bello Publication If using this data in academic work, please cite the following paper, which presented this dataset: Y. Wang, M. Cartwright, and J. P. Bello. "Active Few-Shot Learning for Sound Event Detection", INTERSPEECH, 2022 Description SONYC-FSD-SED is an open dataset of programmatically mixed audio clips that simulates audio data in an environmental sound monitoring system, where sound class occurrences and co-occurrences exhibit seasonal periodic patterns. We use recordings collected from the Sound of New York City (SONYC) acoustic sensor network as backgrounds, and single-labeled clips in the FSD50K dataset as foreground events to generate 576,591 10-second strongly-labeled soundscapes with Scaper (including 111,294 additional test data for the experiment of sampling window). Instead of sampling foreground sound events uniformly, we simulate the occurrence probability of each class at different times in a year, creating more realistic temporal characteristics. Source material and annotations Due to the large size of the dataset, instead of releasing the raw audio files, we release the source material and soundscape annotations in JAMS format, which can be used to reproduce SONYC-FSD-SED using Scaper with the script in the project repository. Background material from SONYC recordings We pick a sensor from the SONYC sensor network and subsample from recordings it collected within a year (2017). We categorize these ∼550k 10-second clips into 96 bins based on timestamps, where each bin represents a unique combination of the month of a year, day of a week (weekday or weekend), and time of a day (divided into four 6-hour blocks). Next, we run a pre-trained urban sound event classifier over all recordings and filter out clips with active sound classes. We do not filter out footstep and bird since they appear too frequently, instead, we remove these two classes from the foreground sound material. Then from each bin, we choose the clip with the lowest sound pressure level, yielding 96 background clips. Foreground material from FSD50K We follow the same filtering process as in FSD-MIX-SED to get the subset of FSD50K with short single-labeled clips. In addition, we remove two classes, "Chirp_and_tweet" and "Walk_and_footsteps", that exist in our SONYC background recordings. This results in 87 sound classes. vocab.json contains the list of 87 classes, each class is then labeled by its index in the list. 0-42: train, 43-56: val, 57-86: test. Occurrence probability modelling For each class, we model its occurrence probability within a year. We use von Mises probability density functions to simulate the probability distribution over different weeks in a year and hours in a day considering their cyclic characteristics: (f(x|μ, κ) = e^{κcos(x−μ)}/2πI_0(κ)), where (I_0(κ)) is the modified Bessel function of order (0), (\mu) and (1/\kappa) are analogous to the mean and variance in the normal distribution. We randomly sample ((\mu_{year}, \mu_{day})) from ([-\pi, \pi]) and ((\kappa_{year}, \kappa_{day})) from ([0, 10]). We also randomly assign (p_{weekday} \in [0, 1] ), (p_{weekend} = 1 − p_{weekday}) to simulate the probability distribution over different days in a week. Finally, we get the probability distribution over the entire year with a 1-hour resolution. At a given timestamp, we integrate (f_{year}) and (f_{day}) over the 1-hour window and multiply them together with (p_{weekday}) or (p_{weekend}) depends on the day. To speed up the following sampling process, we scale the final probability distribution using a temperature parameter randomly sampled from ([2,3]). Files SONYC_FSD_SED.source.tar.gz: 96 SONYC backgrounds and 10,158 foreground sounds in .wav format. The original file size is 2GB. SONYC_FSD_SED.annotations.tar.gz: 465,467 JAMS files. The original file size is 57GB. SONYC_FSD_SED_add_test.annotations.tar.gz: 111,294 JAMS files for additional test data. The original file size is 14GB. vocab.json: 87 classes. occ_prob_per_cl.pkl: Occurrence probability for each foreground sound class. References [1] J. P. Bello, C. T. Silva, O. Nov, R. L. DuBois, A. Arora, J. Salamon, C. Mydlarz, and H. Doraiswamy, “SONYC: A system for monitoring, analyzing, and mitigating urban noise pollution,” Commun. ACM, 2019 [2] E. Fonseca, X. Favory, J. Pons, F. Font, X. Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.
Authors
- Wang, Yu ;
- Cartwright, Mark ;
- Bello, Juan Pablo