Scholar Data

First DIHARD Challenge Evaluation - Nine Sources

Introduction

First DIHARD Challenge Evaluation - Nine Sources was developed by the Linguistic Data Consortium (LDC) and contains approximately 18 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge.

The First DIHARD Challenge was an attempt to reinvigorate work on diarization through a shared task focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As such, it included speech from a wide sampling of domains representing diversity in number of speakers, speaker demographics, interaction style, recording quality, and environmental conditions, including, but not limited to: clinical interviews, extended child language acquisition recordings, YouTube recordings, and conversations collected in restaurants.

Data

This release, when combined with First DIHARD Challenge Evaluation - SEEDLingS (LDC2019S13), contains the evaluation set audio data and annotation as well as the official scoring tool. The development data for the First DIHARD Challenge is also available from LDC as Eight Sources (LDC2019S09) and SEEDLingS (LDC2019S10).

The source data was drawn from the following (all sources are in English unless otherwise indicated):

Autism Diagnostic Observation Schedule (ADOS) interviews

Conversations in Restaurants

DCIEM/HCRC map task (LDC96S38)

Audiobook recordings from LibriVox

Meeting speech collected by LDC in 2001 for the ROAR project (see, e.g., ISL Meeting Speech Part 1 (LDC2004S05))

2001 U.S. Supreme Court oral arguments

Mixer 6 Speech (LDC2013S02)

Chinese video collected by LDC as part of the Video Annotation for Speech Technologies (VAST) project

YouthPoint radio interviews

All audio is provided in the form of 16 kHz, mono-channel FLAC files. The diarization for each recording is stored as a NIST Rich Transcription Time Marked (RTTM) file. RTTM files are space-separated text files containing one turn per line. Segmentation files are stored as HTK label files. Each of these files contains one speech segment per line. Both of the annotation file types are encoded as UTF-8. More information about the file formats are in the included documentation.

Samples

Please view the following samples:

Speech

Segmentation

Diarization

Updates

None at this time.

Portions © 1995 Defence and Civil Institute of Environmental Medicine, © 2002 Interactive Systems Laboratories, Carnegie Mellon University, © 2003 SIL International (IPA93 Fonts), © 2011-2018 YouTube, LLC, © 1996, 2001, 2004, 2009-2010, 2013, 2018, 2019 Trustees of the University of Pennsylvania

Authors

Ryant, Neville ;
Liberman, Mark ;
Fiumara, James ;
Cieri, Christopher

0 Citations0 Mentions35% FAIR0.9 Dataset Index

10.35111/1bsf-4c55July 2019

First DIHARD Challenge Evaluation - SEEDLingS

Introduction

First DIHARD Challenge Evaluation - SEEDLingS was developed by Duke University and the Linguistic Data Consortium (LDC) and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the First DIHARD Challenge.

Data

The source data was drawn from SEEDLingS (The Study of Environmental Effects on Developing Linguistic Skills) corpus, designed to investigate how infants' early linguistic and environmental input plays a role in their learning. Recordings were generated in the home environment of infants in the Rochester, New York area. A subset of that data was annotated by LDC for use in the First DIHARD Challenge.

This release, when combined with First DIHARD Challenge Evaluation - Nine Sources (LDC2019S12), contains the evaluation set audio data and annotation as well as the official scoring tool. The development data for the First DIHARD Challenge is also available from LDC as Eight Sources (LDC2019S09) and SEEDLingS (LDC2019S10).

Updates

None at this time.

Authors

Ryant, Neville ;
Liberman, Mark ;
Fiumara, James ;
Cieri, Christopher

0 Citations0 Mentions35% FAIR0.8 Dataset Index

10.35111/qa6w-kx44July 2019

First DIHARD Challenge Development - Eight Sources

Introduction

First DIHARD Challenge Development - Eight Sources was developed by the Linguistic Data Consortium (LDC) and contains approximately 17 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge.

Data

This release, when combined with First DIHARD Challenge Development - SEEDLingS (LDC2019S10), contains the development set audio data and annotation as well as the official scoring tool. The evaluation data for the First DIHARD Challenge is also available from LDC as Nine Sources (LDC2019S12) and SEEDLingS (LDC2019S13).

The source data was drawn from the following (all sources are in English unless otherwise indicated):

Autism Diagnostic Observation Schedule (ADOS) interviews

DCIEM/HCRC map task (LDC96S38)

Audiobook recordings from LibriVox

Meeting speech from 2004 Spring NIST Rich Transcription (RT-04S) Development (LDC2007S11) and Evaluation (LDC2007S12) releases.

2001 U.S. Supreme Court oral arguments

Sociolinguistic interviews from SLX Corpus of Classic Sociolinguistic Interviews (LDC2003T15)

Chinese video collected by LDC as part of the Video Annotation for Speech Technologies (VAST) project

YouthPoint radio interviews

Samples

Please view the following samples:

Speech

Segmentation

Diarization

Updates

None at this time.

Portions © 1995 Defence and Civil Institute of Environmental Medicine, © 2002 Interactive Systems Laboratories, Carnegie Mellon University, © 2000-2001 International Computer Science Institute, © 2003 SIL International (IPA93 Fonts), © 2011-2018 YouTube, LLC, © 1996, 2001, 2003, 2004, 2007, 2019 Trustees of the University of Pennsylvania

Authors

Ryant, Neville ;
Liberman, Mark ;
Fiumara, James ;
Cieri, Christopher

0 Citations0 Mentions35% FAIR0.9 Dataset Index

10.35111/t7z1-2m30June 2019

First DIHARD Challenge Development - SEEDLingS

Introduction

First DIHARD Challenge Development - SEEDLingS was developed by Duke University and the Linguistic Data Consortium (LDC) and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the First DIHARD Challenge. This release, when combined with First DIHARD Challenge Development - Eight Sources (LDC2019S09), contains the development set audio data and annotation as well as the official scoring tool. The evaluation data for the First DIHARD Challenge is also available from LDC as Nine Sources (LDC2019S12) and SEEDLingS (LDC2019S13).

Data

The source data was drawn from the SEEDLingS (The Study of Environmental Effects on Developing Linguistic Skills) corpus, designed to investigate how infants' early linguistic and environmental input plays a role in their learning. Recordings were generated in the home environment of infants in the Rochester, New York area. A subset of that data was annotated by LDC for use in the First DIHARD Challenge.

Updates

None at this time.

Authors

Ryant, Neville ;
Liberman, Mark ;
Fiumara, James ;
Cieri, Christopher

0 Citations0 Mentions35% FAIR0.9 Dataset Index

10.35111/9mth-hy18June 2019

Mandarin Chinese Phonetic Segmentation and Tone

Introduction

Mandarin Chinese Phonetic Segmentation and Tone was developed by the Linguistic Data Consortium (LDC) and contains 7,849 Mandarin Chinese "utterances" and their phonetic segmentation and tone labels separated into training and test sets. The utterances were derived from 1997 Mandarin Broadcast News Speech and Transcripts (HUB4-NE) (LDC98S73 and LDC98T24, respectively). That collection consists of approximately 30 hours of Chinese broadcast news recordings from Voice of America, China Central TV and KAZN-AM, a commercial radio station based in Los Angeles, CA.

The ability to use large speech corpora for research in phonetics, sociolinguistics and psychology, among other fields, depends on the availability of phonetic segmentation and transcriptions. This corpus was developed to investigate the use of phone boundary models on forced alignment in Mandarin Chinese. Using the approach of embedded tone modeling (also used for incorporating tones for automatic speech recognition), the performance on forced alignment between tone-dependent and tone-independent models was compared.

Data

Utterances were considered as the time-stamped between-pause units in the transcribed news recordings. Those with background noise, music, unidentified speakers and accented speakers were excluded. A test set was developed with 300 utterances randomly selected from six speakers (50 utterances for each speaker). The remaining 7,549 utterances formed a training set.

The utterances in the test set were manually labeled and segmented into initials and finals in Pinyin, a Roman alphabet system for transcribing Chinese characters. Tones were marked on the finals, including Tone1 through Tone4, and Tone0 for the neutral tone. The Sandhi Tone3 was labeled as Tone2. The training set was automatically segmented and transcribed using the LDC forced aligner, which is a Hidden Markov Model (HMM) aligner trained on the same utterances (Yuan et al. 2014). The aligner achieved 93.1% agreement (of phone boundaries) within 20 ms on the test set compared to manual segmentation. The quality of the phonetic transcription and tone labels of the training set was evaluated by checking 100 utterances randomly selected from it. The 100 utterances contained 1,252 syllables: 15 syllables had mistaken tone transcriptions; two syllables showed mistaken transcriptions of the final, and there were no syllables with transcription errors on the initial.

Each utterance has three associated files: a flac compressed wav file, a word transcript file, and a phonetic boundaries and label file.

Samples

Please view this audio sample, transcript sample and phonetic labels sample.

Acknowledgement

This work was supported in part by National Science Foundation Grant No. IIS-0964556.

Updates

None at this time

Additional Licensing Instructions

This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact [email protected] for information about becoming a member.

Authors

Yuan, Jiahong ;
Ryant, Neville ;
Liberman, Mark

0 Citations0 Mentions35% FAIR0.9 Dataset Index

10.35111/djnc-2014April 2015

Automated Author Profile
Ryant, Neville

Ryant, Neville

Current S-Index

Average Dataset Index per Dataset

Total Datasets

Average FAIR Score

Total Citations

Total Mentions

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

First DIHARD Challenge Evaluation - Nine Sources

Introduction

Data

Samples

Updates

First DIHARD Challenge Evaluation - SEEDLingS

Introduction

Data

Updates

First DIHARD Challenge Development - Eight Sources

Introduction

Data

Samples

Updates

First DIHARD Challenge Development - SEEDLingS

Introduction

Data

Updates

Mandarin Chinese Phonetic Segmentation and Tone

Introduction

Data

Samples

Acknowledgement

Updates

Additional Licensing Instructions

Automated Author ProfileRyant, Neville

Ryant, Neville

Current S-Index

Average Dataset Index per Dataset

Total Datasets

Average FAIR Score

Total Citations

Total Mentions

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

First DIHARD Challenge Evaluation - Nine Sources

Introduction

Data

Samples

Updates

First DIHARD Challenge Evaluation - SEEDLingS

Introduction

Data

Updates

First DIHARD Challenge Development - Eight Sources

Introduction

Data

Samples

Updates

First DIHARD Challenge Development - SEEDLingS

Introduction

Data

Updates

Mandarin Chinese Phonetic Segmentation and Tone

Introduction

Data

Samples

Acknowledgement

Updates

Additional Licensing Instructions

Automated Author Profile
Ryant, Neville