Automated Author ProfileLinguistic Data Consortium
Linguistic Data Consortium
Current S-Index
Sum of Dataset Indices for all datasets
Average Dataset Index per Dataset
Average Dataset Index per dataset
Total Datasets
Total datasets for this author
Average FAIR Score
Average FAIR Score per dataset
Total Citations
Total citations to the author's datasets
Total Mentions
Total mentions of the author's datasets
S-Index Interpretation
The S-Index (Sharing Index) is a comprehensive metric that represents the cumulative impact of all your datasets. It is calculated as the sum of Dataset Index scores across all your claimed datasets.
What it means:
- A higher S-index indicates greater overall impact of your datasets relative to typical datasets in their fields of research
- The S-Index grows as you add more datasets or as existing datasets gain more citations and mentions
- It provides a single number to track your research data impact over time
Current S-Index: 28.4 (sum of 43 datasets Dataset Index scores)
More information here.
S-Index Over Time
Cumulative Citations Over Time
Cumulative Mentions Over Time
Datasets
Introduction
LDC (Linguistic Data Consortium) Spoken Language Sampler - Fifth Release contains samples from 19 corpora published by LDC between 1996 and 2019.
LDC distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and resource sharing. With the support of its members, LDC provides critical services to the language research community that include: maintaining the LDC data archives, producing and distributing data via media or web download, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world.
Resources available from LDC include speech, text, video and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDC's publications, browse the Catalog.
The sampler is available as a free download.
Data
The LDC Spoken Language Sampler - Fifth Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the speech-related resources available from the LDC Catalog. The sound files included in this release are excerpts that have been modified in various ways relative to the original data as published by LDC:
- Most excerpts are truncated to be much shorter than the original files, typically about 2 minutes. Samples shorter than this typically represent the entirety of a single file.
- Signal amplitude has been adjusted where necessary to normalize playback volume.
- Some corpora are published in compressed form, but all samples here are uncompressed.
- Some text files are presented as images to ensure foreign character sets display properly.
In the below table, the link for the catalog number takes you to the catalog entry for that corpus.
| LDC2018S06 | 2011 NIST Language Recognition Evaluation Test Set | 2011 NIST Language Recognition Evaluation Test Set contains selected training data and the evaluation test set for the 2011 NIST Language Recognition Evaluation. It consists of approximately 204 hours of conversational telephone speech and broadcast audio collected by the Linguistic Data Consortium (LDC) in the following 24 languages and dialects: Arabic (Iraqi), Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Bengali, Czech, Dari, English (American), English (Indian), Farsi, Hindi, Lao, Mandarin, Punjabi, Pashto, Polish, Russian, Slovak, Spanish, Tamil, Thai, Turkish, Ukrainian and Urdu. |
| LDC2018S14 | AISHELL-1 | AISHELL-1 contains approximately 520 hours of Chinese Mandarin speech from 400 speakers recorded simultaneously on three different devices with associated transcripts. The goal of the collection was to support speech recognition system development in domains such as smart homes, autonomous driving, entertainment, finance, and science and technology. |
| LDC2018S15 | Avatar Education Portuguese | Avatar Education Portuguese contains approximately 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions. The data was developed for Avatar Education, an animated virtual assistant designed to enhance communication and interaction in educational contexts, such as online learning. |
| LDC96S60 | CALLFRIEND Vietnamese | CALLFRIEND Vietnamese consists of approximately 60 unscripted telephone conversations between native speakers of Vietnamese. The duration of each conversation was between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers. |
| LDC2019S07 | CIEMPIESS Experimentation | CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Experimentation was developed at the National Autonomous University of Mexico (UNAM) and consists of approximately 22 hours of Mexican Spanish broadcast and read speech with associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. |
| LDC97S63 | The CMU Kids Corpus | The CMU Kids Corpus was developed in 1995-1996 and is a database of sentences read aloud by 76 children, totaling 5,180 utterances. This data set was designed as a training set of children's speech for the SPHINX II automatic speech recognizer in the LISTEN project at Carnegie Mellon University. |
| LDC2008S01 | CSLU: Portland Cellular Telephone Speech Version 1.3 | Created by the Center for Spoken Language Understanding (CSLU) at Oregon Health and Science University, CSLU: Portland Cellular Telephone Speech Version 1.3 is a collection of cellular telephone speech (7,571 utterances) and corresponding orthographic and phonetic transcriptions. |
| LDC2018S01 | DIRHA English WSJ Audio | DIRHA English WSJ Audio is comprised of approximately 85 hours of real and simulated read speech by six native American English speakers. It was developed as part of the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project, which addressed natural spontaneous speech interaction with distant microphones in a domestic environment. |
| LDC2019S14 | The DKU-JNU-EMA Electromagnetic Articulography Database | The DKU-JNU-EMA Electromagnetic Articulography Database was developed by Duke Kunshan University and Jinan University and contains approximately 10 hours of articulography and speech data in Mandarin, Cantonese, Hakka, and Teochew Chinese from two to seven native speakers for each dialect. |
| LDC2002S28 | Emotional Prosody Speech and Transcripts | Emotional Prosody Speech and Transcripts was developed by LDC and contains audio recordings and corresponding transcripts, designed to support research in emotional prosody and collected over an eight-month period in 2000-2001. The recordings consist of professional actors reading a series of semantically neutral utterances (dates and numbers) spanning 14 distinct emotional categories. |
| LDC2019S09 | First DIHARD Challenge Development - Eight Sources | First DIHARD Challenge Development - Eight Sources was developed by LDC and contains approximately 17 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge. This release, when combined with First DIHARD Challenge Development - SEEDLingS (LDC2019S10), contains the development set audio data and annotation (diarization, segmentation) as well as the official scoring tool. |
| LDC2017S19 | IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e | IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 211 hours of Zulu conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts. |
| LDC2004S02 | ICSI Meeting Speech | ICSI Meeting Speech contains approximately 72 hours of speech from 53 unique speakers in 75 meetings collected at Berkeley’s International Computer Science Institute (ICSI) in 2000-2002. The recordings were made during regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project. The speech files range in length from 17 to 103 minutes, but in general are less than one hour each. |
| LDC2012S04 | Malto Speech and Transcripts | Malto Speech and Transcripts contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females), accompanying transcripts, English translations and glosses for 6 hours of the collection. Speakers were asked to talk about themselves, their lives, rituals and folklore; elicitation interviews were then conducted. The goal of the work was to present the current state and dialectal variation of Malto. |
| LDC2018S08 | Multi-Language Conversational Telephone Speech 2011 -- Central European | Multi-Language Conversational Telephone Speech 2011 -- Central European was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 44 hours of telephone speech in two distinct language varieties of Central Europe: Czech and Slovak. The data was collected to support research and technology evaluation in automatic language identification, specifically language pair discrimination for closely related languages/dialects. Portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation. |
| LDC2006S13 | N4 NATO Native and Non-Native Speech | N4 NATO Native and Non-Native Speech corpus was developed by the NATO research group on Speech and Language Technology in order to provide a military-oriented database for multilingual and non-native speech processing studies. It consists of 115 native and non-native speakers using NATO English procedure between ships and reading from a text, "The North Wind and the Sun," in both English and the speaker's native language. |
| LDC2018S10 | RATS Language Identification | RATS Language Identification was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 5,400 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Language Identification (LID) task in the DARPA RATS (Robust Automatic Transcription of Speech) program. |
| LDC2012S06 | Turkish Broadcast News Speech and Transcripts | Turkish Broadcast News Speech and Transcripts was developed by Boğaziçi University, Istanbul, Turkey and contains approximately 130 hours of Voice of America (VOA) Turkish radio broadcasts and corresponding transcripts. This is part of a larger corpus of Turkish broadcast news data collected and transcribed with the goal to facilitate research in Turkish automatic speech recognition and its applications. The VOA material was collected between December 2006 and June 2009 using a PC and TV/radio card setup. The data collected during the period 2006-2008 was recorded from analog FM radio; the 2009 broadcasts were recorded from digital satellite transmissions. |
| LDC2017S17 | Vehicle City Voices Corpus – Part I | Vehicle City Voices Corpus – Part I was developed at the University of Michigan-Flint, and is an ongoing oral history project and survey of English language variation in Flint, Michigan. It contains approximately 16 hours of speech with corresponding transcripts from 21 interviews of Flint residents conducted between 2012 and 2015. The corpus was designed to provide high-quality recordings for acoustic analysis and to examine narrative structure and discursive construction of individual and collective identity in urban spaces. |
Portions © 2019 Trustees of the University of Pennsylvania
Authors
- Linguistic Data Consortium
Introduction
HUB5 Mandarin Telephone Speech and Transcripts Second Edition was developed by the Linguistic Data Consortium (LDC) in support of US government projects for language recognition and Large Vocabulary Conversational Speech Recognition (LVCSR). The first edition was released by LDC in two data sets, HUB5 Mandarin Telephone Speech Corpus (LDC98S69) and HUB5 Mandarin Transcripts (LDC98T26). This second edition merges the speech and transcript releases, updates the audio format and adds Pinyin transcripts, forced alignment and updated documentation and metadata.
Data
This release consists of (1) approximately 19 hours of Mandarin speech from 42 unscripted telephone conversations between native speakers of Mandarin from CALLFRIEND Mandarin Chinese-Mainland Dialect (LDC96S55), which has also been released in a second, updated edition (LDC2018S09) and (2) associated transcripts of contiguous 5-30 minute segments from those telephone conversations.
Audio data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations lasted up to 30 minutes.
The audio data was recorded as 8kHz u-law SPH encoded stereo files with one end of the phone call on each channel. In this release, files were converted to WAV format, and information from the original SPH headers is included with the corpus. SPH files are not included in this second edition.
Completed calls passed through two human audits. The first audit was conducted to verify that the target language was spoken by the participants and to check the quality of the recordings. The second audit was conducted by a native speaker familiar with Mainland and Taiwan Mandarin dialects to classify the conversations under one of the two categories. Audit information is available in in the corpus documentation.
Transcripts were created manually by native Mandarin speakers in the GB2312 encoding schema. This release adds Pinyin translations of the transcripts in UTF-8 and includes the original transcripts converted to UTF-8. For forced alignment, files were converted to linear-PCM encoding, and the speaker channels were split into separate files to avoid overlapping. The aligned files are presented in tab-separated files and in TextGrid files. Alignment data is provided in UTF-8.
Samples
Please view the following samples:
Updates
None at this time.
Portions © 1996, 1998, 2018 Trustees of the University of Pennsylvania
Authors
- Linguistic Data Consortium
Introduction
TRAD Arabic-French Parallel Text -- Newswire was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 20,000 Arabic words from NIST 2008 Open Machine Translation (OpenMT) Evaluation (LDC2010T21).
The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains. ELDA developed several corpora for this effort.
The Linguistic Data Consortium (LDC) has also released the following TRAD corpora:
- TRAD Chinese-French Parallel Text -- Blog (LDC2018T02)
- TRAD Arabic-French Parallel Text -- Newsgroup (LDC2018T13)
- TRAD Chinese-French Parallel Text -- Broadcast News (LDC2018T17)
Data
This release consists of 813 segments (translations units) from 74 documents. The source data is Arabic newswire text collected and translated into English by LDC. Information about the ELDA translation team, translation guidelines and validation results is contained in the documentation accompanying this release.
The Arabic source file contains 19,902 words and the French reference translation contains 29,104 words. The data is presented in two unicode-encoded XML files along with an associated DTD.
Samples
Please view this Arabic sample and French sample.
Updates
None at this time.
Portions © 2007 Agence France Presse, Al-Ahram, Al Hayat, An Nahar, Al Quds-Al Arabi, Asharq Al-Awsat, Assabah, Xinhua News Agency, © 2018 ELDA, © 2007, 2009, 2010, 2018 Trustees of the University of Pennsylvania
Authors
- Linguistic Data Consortium ;
- ELDA
Introduction
TRAD Chinese-French Parallel Text -- Broadcast News was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 30,000 Chinese characters from GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 (LDC2008T18).
The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains. ELDA developed several corpora for this effort.
The Linguistic Data Consortium (LDC) has also released the following TRAD corpora:
- TRAD Chinese-French Parallel Text -- Blog (LDC2018T02)
- TRAD Arabic-French Parallel Text -- Newsgroup (LDC2018T13)
- TRAD Arabic-French Parallel Text -- Newswire (LDC2018T21)
Data
This release consists of 977 segments (translation units) from 139 documents. The source data is Chinese broadcast news collected and translated into English by LDC for the DARPA GALE (Global Autonomous Language Exploitation) program. Information about the ELDA translation team, translation guidelines and validation results is contained in the documentation accompanying this release.
The Chinese source file contains 33,571 characters and the French reference translation contains 22,424 words. The data is presented in two unicode-encoded XML files along with an associated DTD.
Samples
Please view this source sample and reference sample.
Updates
None at this time.
Portions © 2005, 2006 China Central TV, © 2005, 2006 Phoenix TV, © 2018 ELDA, © 2005-2006, 2008, 2018 Trustees of the University of Pennsylvania
Authors
- Linguistic Data Consortium ;
- ELDA
Introduction
TRAD Arabic-French Parallel Text -- Newsgroup was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Arabic words from GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03).
The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains. ELDA developed several corpora for this effort.
The Linguistic Data Consortium (LDC) has also released the following TRAD corpora:
- TRAD Chinese-French Parallel Text -- Blog (LDC2018T02)
- TRAD Chinese-French Parallel Text -- Broadcast News (LDC2018T17)
- TRAD Arabic-French Parallel Text -- Newswire (LDC2018T21)
Data
This release consists of 398 segments (translation units) from 17 documents. The source data is Arabic newsgroup text collected and translated into English by the Linguistic Data Consortium for the DARPA GALE (Global Autonomous Language Exploitation) program. Information about the ELDA translation team, translation guidelines and validation results is contained in the documentation accompanying this release.
The Arabic source file contains 10,706 words and the French reference translation contains 15,843 words. The data is presented in two unicode-encoded XML files along with an associated DTD.
Samples
Please view this source sample and reference sample.
Updates
None at this time.
Portions © 2018 ELDA, © 2005-2007, 2009, 2018 Trustees of the University of Pennsylvania
Authors
- Linguistic Data Consortium ;
- ELDA
Introduction
2007 CoNLL Shared Task - Arabic & English consists of dependency treebanks in two languages used as part of the CoNLL 2007 shared task on multi-lingual dependency parsing and domain adaptation. The languages covered in this release are Arabic and English.
LDC also released the following 2006 & 2007 CoNLL Shared Task corpora:
- 2007 CoNLL Shared Task - Greek, Hungarian & Italian (LDC2018T07)
- 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish (LDC2018T06)
- 2006 CoNLL Shared Task - Ten Languages (LDC2015T11)
- 2006 CoNLL Shared Task - 2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12)
This corpus is cross listed with ELRA as ELRA-W0123.
The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. In 2006 and 2007, the shared tasks were devoted to the parsing of syntactic dependencies using corpora from up to thirteen languages. The task aimed to define and extend the then-current state of the art in dependency parsing, a technology that complemented previous tasks by producing a different kind of syntactic description of input text. The 2007 shared task added a domain adaptation track for English in addition to the multilingual track. More information about the 2007 shared task is available at the CoNLL Previous Tasks web site.
LDC has released data sets from other CoNLL shared tasks. 2008 CoNLL Shared Task Data (LDC2009T12) contains the English material used in the 2008 shared task which focused on English, employed a unified dependency-based formalism and merged the tasks of syntactic dependency parsing, identifying semantic arguments and labeling them with semantic roles. 2009 CoNLL Shared Task Data Parts 1 and 2 (LDC2012T03 and LDC2012T04) consists of the English, Catalan, Chinese, Czech, German and Spanish resources used in the 2009 task which included a comparison of time and space complexity based on participants' input and learning curve comparison for languages with large datasets. 2015-2016 CoNLL Shared Task (LDC2017T13) contains Chinese and English resources used in the 2015 and 2016 shared tasks on dependency parsing.
Data
The source data in the treebanks in this release consists principally of various texts (e.g., textbooks, news, literature) annotated in dependency format. In general, dependency grammar is based on the idea that the verb is the center of the clause structure and that other units in the sentence are connected to the verb as directed links or dependencies. This is a one-to-one correspondence: for every element in the sentence there is one node in the sentence structure that corresponds to that element. In constituency or phrase structure grammars, on the other hand, clauses are divided into noun phrases and verb phrases and in each sentence, one or more nodes may correspond to one element. The Penn Treebank (LDC99T42) is an example of a constituency or phrase structure approach. All of the data sets in this release are dependency treebanks.
The individual data sets are:
- Prague Arabic Dependency Treebank (Arabic)
- CHILDES (English)
- PennBioIE Oncology 1.0 (English)
- Treebank-3 (English)
Samples
Please view these samples:
Updates
None at this time.
Portions © 2000 Agence France Presse, © 2001 Al Hayat, © 2002 An Nahar, © 1987-1989 Dow Jones & Company, Inc., © 2002 Ummah Press Service, © 2003 Xinhua News Agency, © 1999, 2000-2008, 2018 Trustees of the University of Pennsylvania
Authors
- Linguistic Data Consortium
Introduction
TRAD Chinese-French Parallel Text -- Blog was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Chinese words from GALE Phase 1 Chinese Blog Parallel Text (LDC2008T06).
The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains. ELDA developed several corpora for this effort.
The Linguistic Data Consortium (LDC) has also released the following TRAD corpora:
- TRAD Arabic-French Parallel Text -- Newsgroup (LDC2018T13)
- TRAD Chinese-French Parallel Text -- Broadcast News (LDC2018T17)
- TRAD Arabic-French Parallel Text -- Newswire (LDC2018T21)
Data
This release consists of 444 segments (translation units) from 17 documents. The source data is Chinese blog text collected and translated into English by LDC for the DARPA GALE (Global Autonomous Language Exploitation) program. Information about the ELDA translation team, translation guidelines and validation results is contained in the documentation accompanying this release.
The Chinese source file contains 15,809 characters and the French reference translation contains 11,769 words. The data is presented in two unicode-encoded XML files along with an associated DTD.
Samples
Please view this source sample and reference sample.
Updates
None at this time.
Portions © 2018 ELDA, © 2005-2007, 2008, 2018 Trustees of the University of Pennsylvania
Authors
- Linguistic Data Consortium ;
- ELDA
Introduction
ASpIRE Development and Development Test Sets was developed for the Automatic Speech recognition In Reverberant Environments (ASpIRE) Challenge sponsored by IARPA (the Intelligent Advanced Research Projects Activity). It contains approximately 226 hours of English speech with transcripts and scoring files.
The ASpIRE challenge asked solvers to develop innovative speech recognition systems that could be trained on conversational telephone speech, and yet work well on far-field microphone data from noisy, reverberant rooms. Participants had the opportunity to evaluate their techniques on a common set of challenging data that included significant room noise and reverberation.
Data
The audio data is a subset of Mixer 6 Speech (LDC2013S03), audio recordings of interviews, transcript readings and conversational telephone speech collected by the Linguistic Data Consortium in 2009 and 2010 from native English speakers local to the Philadelphia area. The transcripts were developed by Appen for the ASpIRE challenge.
Data is divided into development and development test sets.
Audio is presented as single channel, 16kHz 16-bit Signed Integer PCM *.wav files. Transcripts are plain text tdf files. Scoring files are also included.
Samples
Please view this audio sample and transcript sample.
Updates
None at this time.
Portions © 2014 U.S. Government, © 2009-2010, 2013, 2017 Trustees of the University of Pennsylvania
Authors
- Linguistic Data Consortium ;
- Appen Pty Ltd
Introduction
LDC (Linguistic Data Consortium) Spoken Language Sampler - Fourth Release, LDC catalog number LDC2017S16 and ISBN 1-58563-811-0, contains samples from 18 different corpora published by LDC between 1996 and 2017.
LDC distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and resource sharing. With the support of its members, LDC provides critical services to the language research community that include: maintaining the LDC data archives, producing and distributing data via media or web download, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world.
Resources available from LDC include speech, text, video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDC's publications, browse the Catalog.
The sampler is available as a free download.
Data
The LDC Spoken Language Sampler - Fourth Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the speech-related resources available from the LDC Catalog. The sound files included in this release are excerpts that have been modified in various ways relative to the original data as published by LDC:
- Most excerpts are truncated to be much shorter than the original files, typically between 1.5 and 2 minutes.
- Signal amplitude has been adjusted where necessary to normalize playback volume.
- Some corpora are published in compressed form, but all samples here are uncompressed.
- Some text files are presented as images to ensure foreign character sets display properly.
- In some publications, NIST SPHERE file format is used for audio data, but the audio files in this sampler are MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. FLAC files have been expanded into their wav form as well.
The link for the catalog number takes you to the catalog entry, and the link for the title takes you to further documentation for that corpus.
| LDC2017S06 | 2010 NIST Speaker Recognition Evaluation Test Set | 2010 NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST (National Institute of Standards and Technology). It contains 2,255 hours of American English telephone speech and interview speech recorded over a microphone channel used as test data in the NIST-sponsored 2010 Speaker Recognition Evaluation (SRE). |
| LDC2015S10 | Arabic Learner Corpus | Arabic Learner Corpus was developed at the University of Leeds and consists of written essays and spoken recordings by Arabic learners collected in Saudi Arabia in 2012 and 2013. The corpus includes 282,732 words in 1,585 materials, produced by 942 students from 67 nationalities studying at pre-university and university levels. The average length of an essay is 178 words. |
| LDC2015S12 | Articulation Index LSCP | Articulation Index LSCP was developed by researchers at Laboratoire de Sciences Cognitives et Psycholinguistique (LSCP), Ecole Normale Supérieure. It revises and enhances a subset of Articulation Index (AIC) (LDC2005S22), a corpus of persons speaking English syllables. Changes include the addition of forced alignment to sound files, time alignment of syllable utterances and format conversions. |
| LDC2014S01 | CALLFRIEND Farsi Second Edition Speech | CALLFRIEND Farsi Second Edition Speech was developed by LDC and consists of approximately 42 hours of telephone conversation (100 recordings) among native Farsi speakers. The CALLFRIEND project supported the development of language identification technology. Each CALLFRIEND corpus consists of unscripted telephone conversations lasting between 5-30 minutes. |
| LDC2016S04 | CHM150 | CHM150 (Corpus Hecho en México 150) was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 1.63 hours of Mexican Spanish speech, associated transcripts, and speaker metadata. The goal of this work was to support spoken term detection and forensic speaker identification. |
| LDC2007S18 | CSLU: Kids` Speech Version 1.1 | CSLU: Kids' Speech Version 1.1 is a collection of spontaneous and prompted speech from 1100 children between Kindergarten and Grade 10 in the Forest Grove School District in Oregon. Approximately 100 children at each grade level read around 60 items from a total list of 319 phonetically-balanced but simple words, sentences or digit strings. Each utterance of spontaneous speech begins with a recitation of the alphabet and contains a monologue of about one minute in length. This release consists of 1017 files containing approximately 8-10 minutes of speech per speaker. Corresponding word-level transcriptions are also included. |
| LDC2016S12 | IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a | IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 190 hours of Georgian conversational and scripted telephone speech collected in 2014-2015 along with corresponding transcripts. |
| LDC2003S07 | Korean Telephone Conversations Complete (S), (T), (L) | The Korean telephone conversations were originally recorded as part of the CALLFRIEND project. Korean Telephone Conversations Speech consists of 100 telephone conversations, 49 of which were published in 1996 as CALLFRIEND Korean, while the remaining 51 are previously unexposed calls. Korean Telephone Conversations Transcripts consists of 100 text files, totaling approximately 190K words and 25K unique words. All files are in Korean orthography: orthographic Korean characters are in Hangul, encoded in KSC5601 (Wansung) system. The complete set of Korean Telephone Conversations also includes a transcript (LDC2003T08) and lexicon (LDC2003L02) corpus. |
| LDC2012S04 | Malto Speech and Transcripts | Malto Speech and Transcripts contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females), accompanying transcripts, English translations and glosses for 6 hours of the collection. Speakers were asked to talk about themselves, their lives, rituals and folklore; elicitation interviews were then conducted. The goal of the work was to present the current state and dialectal variation of Malto. |
| LDC2015S04 | Mandarin-English Code-Switching in South-East Asia | Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia and includes approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts. |
| LDC2017S11 | Metalogue Multi-Issue Bargaining Dialogue | Metalogue Multi-Issue Bargaining Dialogue was developed by the Metalogue Consortium under the European Community's Seventh Framework Programme for Research and Technological Development. This release consists of approximately 2.5 hours of semantically annotated English dialogue data that includes speech and transcripts. |
| LDC2016S11 | Multi-Language Conversational Telephone Speech 2011 -- Slavic Group | Multi-Language Conversational Telephone Speech 2011 – Slavic Group was developed by LDC and is comprised of approximately 60 hours of telephone speech in Polish, Russian and Ukrainian. The data was collected to support research and technology evaluation in automatic language identification, specifically language pair discrimination for closely related languages/dialects. Portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation. |
| LDC2017S09 | Multi-Language Conversational Telephone Speech 2011 | Multi-Language Conversational Telephone Speech 2011 -- Turkish was developed by LDC and is comprised of approximately 18 hours of telephone speech in Turkish. The data was collected primarily to support research and technology evaluation in automatic language identification, specifically language pair discrimination for closely related languages/dialects. |
| LDC2004S09 | NIST Meeting Pilot Corpus Speech | The audio data included in this corpus was collected in the NIST Meeting Data Collection Laboratory for the NIST Automatic Meeting Recognition Project. The corresponding transcripts are available as the NIST Meeting Pilot Corpus Transcripts and Metadata (LDC2004T13), while the video files will be published later as NIST Meeting Pilot Corpus Video. For more information regarding the data collection conditions, meeting scenarios, transcripts, speaker information, recording logs, errata, and other ancillary data for the corpus, please consult the NIST project website for this corpus. |
| LDC2017S04 | Noisy TIMIT Speech | Noisy TIMIT Speech was developed by the Florida Institute of Technology and contains approximately 322 hours of speech from the TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) modified with different additive noise levels. Only the audio has been modified; the original arrangement of the TIMIT corpus is still as described by the TIMIT documentation. |
| LDC2015S08 | The Walking Around Corpus | The Walking Around Corpus was developed by Stony Brook University and is comprised of approximately 33 hours of navigational telephone dialogues from 72 speakers (36 speaker pairs). Participants were Stony Brook University students who identified themselves as native English speakers. |
| LDC2012S02 | TORGO Database of Dysarthric Articulation | TORGO contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy or amyotrophic lateral sclerosis and from 7 speakers (4 males, 3 females) from a non-dysarthric control group. |
| LDC2014S04 | USC-SFI MALACH Interviews and Transcripts Czech | USC-SFI MALACH Interviews and Transcripts Czech was developed by The University of Southern California Shoah Foundation Institute (USC-SFI) and the University of West Bohemia as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 229 hours of interviews from 420 interviewees along with transcripts and other documentation. |
Portions © 2017 Trustees of the University of Pennsylvania
Authors
- Linguistic Data Consortium
Introduction
LDC (Linguistic Data Consortium) Spoken Language Sampler - Third Release contains samples from 20 different corpora published by LDC between 1996 and 2015.
LDC distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and resource sharing. With the support of its members, LDC provides critical services to the language research community that include: maintaining the LDC data archives, producing and distributing data via media or web download, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world.
Resources available from LDC include speech, text, video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDC's publications, browse the Catalog.
The sampler is available as a free download.
Data
The LDC Spoken Language Sampler - Third Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the speech-related resources available from the LDC Catalog. The sound files included in this release are excerpts that have been modified in various ways relative to the original data as published by LDC:
- Most excerpts are truncated to be much shorter than the original files, typically between 1.5 and 2 minutes.
- Signal amplitude has been adjusted where necessary to normalize playback volume.
- Some corpora are published in compressed form, but all samples here are uncompressed.
- Some text files are presented as images to ensure foreign character sets display properly.
- In some publications, NIST SPHERE file format is used for audio data, but the audio files in this sampler are MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. FLAC files have been expanded into their wav form as well.
The link for the catalog number takes you to the catalog entry.
| LDC2014S06 | 2009 NIST Language Recognition Evaluation Test Set | The 2009 evaluation contains approximately 215 hours of conversational telephone speech and radio broadcast conversation collected by LDC in the following 23 languages and dialects: Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari, English (American), English (Indian), Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu and Vietnamese. |
| LDC2014S01 | CALLFRIEND Farsi Second Edition Speech | CALLFRIEND Farsi Second Edition Speech was developed by LDC and consists of approximately 42 hours of telephone conversation (100 recordings) among native Farsi speakers. The CALLFRIEND project supported the development of language identification technology. Each CALLFRIEND corpus consists of unscripted telephone conversations lasting between 5-30 minutes. |
| LDC96S37 | CALLHOME Japanese | A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts. |
| LDC2013S09 | CSC Deceptive Speech | CSC Deceptive Speech was developed by Columbia University, SRI International and University of Colorado Boulder. It consists of 32 hours of audio interviews from 32 native speakers of Standard American English (16 male, 16 female) recruited from the Columbia University student population and the community. The purpose of the study was to distinguish deceptive speech from non-deceptive speech using machine learning techniques on extracted features from the corpus. |
| LDC2007S18 | CSLU Kids' Speech | Developed at Oregon State University's Center for Spoken Language Understanding, this corpus is a collection of spontaneous and prompted speech from 1100 children from Kindergarten through Grade 10. |
| LDC2010S01 | Fisher Spanish Speech | Fisher Spanish Speech consists of audio files covering roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers. |
| LDC2014S02 | King Saud University Arabic Speech Database | King Saud University Arabic Speech Database contains 590 hours of recorded Arabic speech from 269 male and female Saudi and non-Saudi speakers. The utterances include read and spontaneous speech recorded in quiet and noisy environments. The recordings were collected via different microphones and a mobile phone and averaged between 16-19 minutes. |
| LDC2003S07 | Korean Telephone Conversations Complete | The Korean telephone conversations were originally recorded as part of the CALLFRIEND project. Korean Telephone Conversations Speech consists of 100 telephone conversations, 49 of which were published in 1996 as CALLFRIEND Korean, while the remaining 51 are previously unexposed calls. Korean Telephone Conversations Transcripts (LDC2003T08) consists of 100 text files, totaling approximately 190K words and 25K unique words. All files are in Korean orthography: orthographic Korean characters are in Hangul, encoded in KSC5601 (Wansung) system. The complete data set also includes a lexicon (LDC2003L02). |
| LDC2012S04 | Malto Speech and Transcripts | Malto Speech and Transcripts contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females). Also included are accompanying transcripts, English translations and glosses for 6 hours of the collection. Malto is principally spoken in northeastern India and Bangladesh. |
| LDC2015S05 | Mandarin Chinese Phonetic Segmentation and Tone | Mandarin Chinese Phonetic Segmentation and Tone was developed by LDC and contains 7,849 Mandarin Chinese "utterances" and their phonetic segmentation and tone labels separated into training and test sets. The utterances were derived from 1997 Mandarin Broadcast News Speech and Transcripts (HUB4-NE) (LDC98S73 and LDC98T24, respectively). That collection consists of approximately 30 hours of Chinese broadcast news recordings from Voice of America, China Central TV and KAZN-AM, a commercial radio station based in Los Angeles, CA. This corpus was developed to investigate the use of phone boundary models on forced alignment in Mandarin Chinese. |
| LDC2015S04 | Mandarin-English Code-Switching in South-East Asia | Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia and includes approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts. |
| LDC2013S03 | Mixer 6 Speech | Mixer 6 Speech was developed by LDC and is comprised of 15,863 hours of telephone speech, interviews and transcript readings from 594 distinct native English speakers. This material was collected by LDC in 2009 and 2010 as part of the Mixer project, specifically phase 6, the focus of which was on native American English speakers local to the Philadelphia area. |
| LDC2014S03 | Multi-Channel WSJ Audio | Multi-Channel WSJ Audio was developed by the Centre for Speech Technology Research at The University of Edinburgh and contains approximately 100 hours of recorded speech from 45 British English speakers. Participants read Wall Street Journal texts published in 1987-1989 in three recording scenarios: a single stationary speaker, two stationary overlapping speakers and one single moving speaker. |
| LDC2004S09 | NIST Meeting Pilot Corpus Speech | This data set contains speech and transcriptions from topical discussions in meeting settings, including complete descriptive metadata and detailed descriptions of the physical environment in which the discussions took place. |
| LDC2015S02 | RATS Speech Activity Detection | RATS Speech Activity Detection was developed by LDC and is comprised of approximately 3,000 hours of Levantine Arabic, English, Farsi, Pashto, and Urdu conversational telephone speech with automatic and manual annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Speech Activity Detection (SAD) task in the DARPA RATS (Robust Automatic Transcription of Speech) program. |
| LDC2015S03 | The Subglottal Resonances Database | The Subglottal Resonances Database was developed by Washington University and University of California Los Angeles and consists of 45 hours of simultaneous microphone and subglottal accelerometer recordings of 25 adult male and 25 adult female speakers of American English between 22 and 25 years of age. |
| LDC2012S02 | TORGO Database of Dysarthric Articulation | TORGO contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy or amyotrophic lateral sclerosis and from 7 speakers (4 males, 3 females) from a non-dysarthric control group. |
| LDC2012S06 | Turkish Broadcast News Speech and Transcripts | Turkish Broadcast News Speech and Transcripts contains approximately 130 hours of Voice of America Turkish radio broadcasts and corresponding transcripts. |
| LDC2014S08 | United Nations Proceedings Speech | United Nations Proceedings Speech was developed by the United Nations (UN) and contains approximately 8,500 hours of recorded proceedings in the six official UN languages, Arabic, Chinese, English, French, Russian and Spanish. The data was recorded in 2009-2012 from sessions 64-66 of the General Assembly and First Committee (Disarmament and International Security), and meetings 6434-6763 of the Security Council. |
| LDC2014S04 | USC-SFI MALACH Interviews and Transcripts Czech | USC-SFI MALACH Interviews and Transcripts Czech was developed by The University of Southern California Shoah Foundation Institute (USC-SFI) and the University of West Bohemia as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 229 hours of interviews from 420 interviewees along with transcripts and other documentation. |
Portions © 2015 Trustees of the University of Pennsylvania
Authors
- Linguistic Data Consortium