Automated Author Profile

Linguistic Data Consortium

Current S-Index

28.4

Sum of Dataset Indices for all datasets

Average Dataset Index per Dataset

0.7

Average Dataset Index per dataset

Total Datasets

43

Total datasets for this author

Average FAIR Score

31.8%

Average FAIR Score per dataset

Total Citations

7

Total citations to the author's datasets

Total Mentions

0

Total mentions of the author's datasets

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

LDC Spoken Language Sampler - Fifth Release

Introduction


LDC (Linguistic Data Consortium) Spoken Language Sampler - Fifth Release contains samples from 19 corpora published by LDC between 1996 and 2019.


LDC distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and resource sharing. With the support of its members, LDC provides critical services to the language research community that include: maintaining the LDC data archives, producing and distributing data via media or web download, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world.


Resources available from LDC include speech, text, video and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDC's publications, browse the Catalog.


The sampler is available as a free download.


Data


The LDC Spoken Language Sampler - Fifth Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the speech-related resources available from the LDC Catalog. The sound files included in this release are excerpts that have been modified in various ways relative to the original data as published by LDC:



  • Most excerpts are truncated to be much shorter than the original files, typically about 2 minutes. Samples shorter than this typically represent the entirety of a single file.

  • Signal amplitude has been adjusted where necessary to normalize playback volume.

  • Some corpora are published in compressed form, but all samples here are uncompressed.

  • Some text files are presented as images to ensure foreign character sets display properly.


In the below table, the link for the catalog number takes you to the catalog entry for that corpus.




































































































LDC2018S062011 NIST Language Recognition Evaluation Test Set2011 NIST Language Recognition Evaluation Test Set contains selected training data and the evaluation test set for the 2011 NIST Language Recognition Evaluation. It consists of approximately 204 hours of conversational telephone speech and broadcast audio collected by the Linguistic Data Consortium (LDC) in the following 24 languages and dialects: Arabic (Iraqi), Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Bengali, Czech, Dari, English (American), English (Indian), Farsi, Hindi, Lao, Mandarin, Punjabi, Pashto, Polish, Russian, Slovak, Spanish, Tamil, Thai, Turkish, Ukrainian and Urdu.
LDC2018S14AISHELL-1AISHELL-1 contains approximately 520 hours of Chinese Mandarin speech from 400 speakers recorded simultaneously on three different devices with associated transcripts. The goal of the collection was to support speech recognition system development in domains such as smart homes, autonomous driving, entertainment, finance, and science and technology.
LDC2018S15Avatar Education PortugueseAvatar Education Portuguese contains approximately 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions. The data was developed for Avatar Education, an animated virtual assistant designed to enhance communication and interaction in educational contexts, such as online learning.
LDC96S60CALLFRIEND VietnameseCALLFRIEND Vietnamese consists of approximately 60 unscripted telephone conversations between native speakers of Vietnamese. The duration of each conversation was between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers.
LDC2019S07CIEMPIESS ExperimentationCIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Experimentation was developed at the National Autonomous University of Mexico (UNAM) and consists of approximately 22 hours of Mexican Spanish broadcast and read speech with associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition.
LDC97S63The CMU Kids CorpusThe CMU Kids Corpus was developed in 1995-1996 and is a database of sentences read aloud by 76 children, totaling 5,180 utterances. This data set was designed as a training set of children's speech for the SPHINX II automatic speech recognizer in the LISTEN project at Carnegie Mellon University.
LDC2008S01CSLU: Portland Cellular Telephone Speech Version 1.3Created by the Center for Spoken Language Understanding (CSLU) at Oregon Health and Science University, CSLU: Portland Cellular Telephone Speech Version 1.3 is a collection of cellular telephone speech (7,571 utterances) and corresponding orthographic and phonetic transcriptions.
LDC2018S01DIRHA English WSJ AudioDIRHA English WSJ Audio is comprised of approximately 85 hours of real and simulated read speech by six native American English speakers. It was developed as part of the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project, which addressed natural spontaneous speech interaction with distant microphones in a domestic environment.
LDC2019S14The DKU-JNU-EMA Electromagnetic Articulography DatabaseThe DKU-JNU-EMA Electromagnetic Articulography Database was developed by Duke Kunshan University and Jinan University and contains approximately 10 hours of articulography and speech data in Mandarin, Cantonese, Hakka, and Teochew Chinese from two to seven native speakers for each dialect.
LDC2002S28Emotional Prosody Speech and TranscriptsEmotional Prosody Speech and Transcripts was developed by LDC and contains audio recordings and corresponding transcripts, designed to support research in emotional prosody and collected over an eight-month period in 2000-2001. The recordings consist of professional actors reading a series of semantically neutral utterances (dates and numbers) spanning 14 distinct emotional categories.
LDC2019S09First DIHARD Challenge Development - Eight SourcesFirst DIHARD Challenge Development - Eight Sources was developed by LDC and contains approximately 17 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge. This release, when combined with First DIHARD Challenge Development - SEEDLingS (LDC2019S10), contains the development set audio data and annotation (diarization, segmentation) as well as the official scoring tool.
LDC2017S19IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1eIARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 211 hours of Zulu conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts.
LDC2004S02ICSI Meeting SpeechICSI Meeting Speech contains approximately 72 hours of speech from 53 unique speakers in 75 meetings collected at Berkeley’s International Computer Science Institute (ICSI) in 2000-2002. The recordings were made during regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project. The speech files range in length from 17 to 103 minutes, but in general are less than one hour each.
LDC2012S04Malto Speech and TranscriptsMalto Speech and Transcripts contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females), accompanying transcripts, English translations and glosses for 6 hours of the collection. Speakers were asked to talk about themselves, their lives, rituals and folklore; elicitation interviews were then conducted. The goal of the work was to present the current state and dialectal variation of Malto.
LDC2018S08Multi-Language Conversational Telephone Speech 2011 -- Central EuropeanMulti-Language Conversational Telephone Speech 2011 -- Central European was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 44 hours of telephone speech in two distinct language varieties of Central Europe: Czech and Slovak. The data was collected to support research and technology evaluation in automatic language identification, specifically language pair discrimination for closely related languages/dialects. Portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation.
LDC2006S13N4 NATO Native and Non-Native SpeechN4 NATO Native and Non-Native Speech corpus was developed by the NATO research group on Speech and Language Technology in order to provide a military-oriented database for multilingual and non-native speech processing studies. It consists of 115 native and non-native speakers using NATO English procedure between ships and reading from a text, "The North Wind and the Sun," in both English and the speaker's native language.
LDC2018S10RATS Language IdentificationRATS Language Identification was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 5,400 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Language Identification (LID) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.
LDC2012S06Turkish Broadcast News Speech and TranscriptsTurkish Broadcast News Speech and Transcripts was developed by Boğaziçi University, Istanbul, Turkey and contains approximately 130 hours of Voice of America (VOA) Turkish radio broadcasts and corresponding transcripts. This is part of a larger corpus of Turkish broadcast news data collected and transcribed with the goal to facilitate research in Turkish automatic speech recognition and its applications. The VOA material was collected between December 2006 and June 2009 using a PC and TV/radio card setup. The data collected during the period 2006-2008 was recorded from analog FM radio; the 2009 broadcasts were recorded from digital satellite transmissions.
LDC2017S17Vehicle City Voices Corpus – Part IVehicle City Voices Corpus – Part I was developed at the University of Michigan-Flint, and is an ongoing oral history project and survey of English language variation in Flint, Michigan. It contains approximately 16 hours of speech with corresponding transcripts from 21 interviews of Flint residents conducted between 2012 and 2015. The corpus was designed to provide high-quality recordings for acoustic analysis and to examine narrative structure and discursive construction of individual and collective identity in urban spaces.

Portions © 2019 Trustees of the University of Pennsylvania

Authors

  • Linguistic Data Consortium
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/mdhj-5p592019

HUB5 Mandarin Telephone Speech and Transcripts Second Edition

Introduction


HUB5 Mandarin Telephone Speech and Transcripts Second Edition was developed by the Linguistic Data Consortium (LDC) in support of US government projects for language recognition and Large Vocabulary Conversational Speech Recognition (LVCSR). The first edition was released by LDC in two data sets, HUB5 Mandarin Telephone Speech Corpus (LDC98S69) and HUB5 Mandarin Transcripts (LDC98T26). This second edition merges the speech and transcript releases, updates the audio format and adds Pinyin transcripts, forced alignment and updated documentation and metadata.


Data


This release consists of (1) approximately 19 hours of Mandarin speech from 42 unscripted telephone conversations between native speakers of Mandarin from CALLFRIEND Mandarin Chinese-Mainland Dialect (LDC96S55), which has also been released in a second, updated edition (LDC2018S09) and (2) associated transcripts of contiguous 5-30 minute segments from those telephone conversations.


Audio data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations lasted up to 30 minutes.


The audio data was recorded as 8kHz u-law SPH encoded stereo files with one end of the phone call on each channel. In this release, files were converted to WAV format, and information from the original SPH headers is included with the corpus. SPH files are not included in this second edition.


Completed calls passed through two human audits. The first audit was conducted to verify that the target language was spoken by the participants and to check the quality of the recordings. The second audit was conducted by a native speaker familiar with Mainland and Taiwan Mandarin dialects to classify the conversations under one of the two categories. Audit information is available in in the corpus documentation.


Transcripts were created manually by native Mandarin speakers in the GB2312 encoding schema. This release adds Pinyin translations of the transcripts in UTF-8 and includes the original transcripts converted to UTF-8. For forced alignment, files were converted to linear-PCM encoding, and the speaker channels were split into separate files to avoid overlapping. The aligned files are presented in tab-separated files and in TextGrid files. Alignment data is provided in UTF-8.


Samples


Please view the following samples:



Updates


None at this time.


Portions © 1996, 1998, 2018 Trustees of the University of Pennsylvania

Authors

  • Linguistic Data Consortium
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/4js2-xd382018

TRAD Arabic-French Parallel Text -- Newswire

Introduction


TRAD Arabic-French Parallel Text -- Newswire was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 20,000 Arabic words from NIST 2008 Open Machine Translation (OpenMT) Evaluation (LDC2010T21).


The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains. ELDA developed several corpora for this effort.


The Linguistic Data Consortium (LDC) has also released the following TRAD corpora:



  • TRAD Chinese-French Parallel Text -- Blog (LDC2018T02)

  • TRAD Arabic-French Parallel Text -- Newsgroup (LDC2018T13)

  • TRAD Chinese-French Parallel Text -- Broadcast News (LDC2018T17)


Data


This release consists of 813 segments (translations units) from 74 documents. The source data is Arabic newswire text collected and translated into English by LDC. Information about the ELDA translation team, translation guidelines and validation results is contained in the documentation accompanying this release.


The Arabic source file contains 19,902 words and the French reference translation contains 29,104 words. The data is presented in two unicode-encoded XML files along with an associated DTD.


Samples


Please view this Arabic sample and French sample.


Updates


None at this time.


Portions © 2007 Agence France Presse, Al-Ahram, Al Hayat, An Nahar, Al Quds-Al Arabi, Asharq Al-Awsat, Assabah, Xinhua News Agency, © 2018 ELDA, © 2007, 2009, 2010, 2018 Trustees of the University of Pennsylvania

Authors

  • Linguistic Data Consortium ;
  • ELDA
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/z1wg-9x782018

TRAD Chinese-French Parallel Text -- Broadcast News

Introduction


TRAD Chinese-French Parallel Text -- Broadcast News was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 30,000 Chinese characters from GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 (LDC2008T18).


The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains. ELDA developed several corpora for this effort.


The Linguistic Data Consortium (LDC) has also released the following TRAD corpora:



  • TRAD Chinese-French Parallel Text -- Blog (LDC2018T02)

  • TRAD Arabic-French Parallel Text -- Newsgroup (LDC2018T13)

  • TRAD Arabic-French Parallel Text -- Newswire (LDC2018T21)


Data


This release consists of 977 segments (translation units) from 139 documents. The source data is Chinese broadcast news collected and translated into English by LDC for the DARPA GALE (Global Autonomous Language Exploitation) program. Information about the ELDA translation team, translation guidelines and validation results is contained in the documentation accompanying this release.


The Chinese source file contains 33,571 characters and the French reference translation contains 22,424 words. The data is presented in two unicode-encoded XML files along with an associated DTD.


Samples


Please view this source sample and reference sample.


Updates


None at this time.


Portions © 2005, 2006 China Central TV, © 2005, 2006 Phoenix TV, © 2018 ELDA, © 2005-2006, 2008, 2018 Trustees of the University of Pennsylvania

Authors

  • Linguistic Data Consortium ;
  • ELDA
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/7fw4-ev852018

TRAD Arabic-French Parallel Text -- Newsgroup

Introduction


TRAD Arabic-French Parallel Text -- Newsgroup was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Arabic words from GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03).


The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains. ELDA developed several corpora for this effort.


The Linguistic Data Consortium (LDC) has also released the following TRAD corpora:



  • TRAD Chinese-French Parallel Text -- Blog (LDC2018T02)

  • TRAD Chinese-French Parallel Text -- Broadcast News (LDC2018T17)

  • TRAD Arabic-French Parallel Text -- Newswire (LDC2018T21)


Data


This release consists of 398 segments (translation units) from 17 documents. The source data is Arabic newsgroup text collected and translated into English by the Linguistic Data Consortium for the DARPA GALE (Global Autonomous Language Exploitation) program. Information about the ELDA translation team, translation guidelines and validation results is contained in the documentation accompanying this release.


The Arabic source file contains 10,706 words and the French reference translation contains 15,843 words. The data is presented in two unicode-encoded XML files along with an associated DTD.


Samples


Please view this source sample and reference sample.


Updates


None at this time.


Portions © 2018 ELDA, © 2005-2007, 2009, 2018 Trustees of the University of Pennsylvania

Authors

  • Linguistic Data Consortium ;
  • ELDA
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/55s4-ym312018

2007 CoNLL Shared Task - Arabic & English

Introduction


2007 CoNLL Shared Task - Arabic & English consists of dependency treebanks in two languages used as part of the CoNLL 2007 shared task on multi-lingual dependency parsing and domain adaptation. The languages covered in this release are Arabic and English.


LDC also released the following 2006 & 2007 CoNLL Shared Task corpora:



  • 2007 CoNLL Shared Task - Greek, Hungarian & Italian (LDC2018T07)

  • 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish (LDC2018T06)

  • 2006 CoNLL Shared Task - Ten Languages (LDC2015T11)

  • 2006 CoNLL Shared Task - 2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12)


This corpus is cross listed with ELRA as ELRA-W0123.


The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. In 2006 and 2007, the shared tasks were devoted to the parsing of syntactic dependencies using corpora from up to thirteen languages. The task aimed to define and extend the then-current state of the art in dependency parsing, a technology that complemented previous tasks by producing a different kind of syntactic description of input text. The 2007 shared task added a domain adaptation track for English in addition to the multilingual track. More information about the 2007 shared task is available at the CoNLL Previous Tasks web site.


LDC has released data sets from other CoNLL shared tasks. 2008 CoNLL Shared Task Data (LDC2009T12) contains the English material used in the 2008 shared task which focused on English, employed a unified dependency-based formalism and merged the tasks of syntactic dependency parsing, identifying semantic arguments and labeling them with semantic roles. 2009 CoNLL Shared Task Data Parts 1 and 2 (LDC2012T03 and LDC2012T04) consists of the English, Catalan, Chinese, Czech, German and Spanish resources used in the 2009 task which included a comparison of time and space complexity based on participants' input and learning curve comparison for languages with large datasets. 2015-2016 CoNLL Shared Task (LDC2017T13) contains Chinese and English resources used in the 2015 and 2016 shared tasks on dependency parsing.


Data


The source data in the treebanks in this release consists principally of various texts (e.g., textbooks, news, literature) annotated in dependency format. In general, dependency grammar is based on the idea that the verb is the center of the clause structure and that other units in the sentence are connected to the verb as directed links or dependencies. This is a one-to-one correspondence: for every element in the sentence there is one node in the sentence structure that corresponds to that element. In constituency or phrase structure grammars, on the other hand, clauses are divided into noun phrases and verb phrases and in each sentence, one or more nodes may correspond to one element. The Penn Treebank (LDC99T42) is an example of a constituency or phrase structure approach. All of the data sets in this release are dependency treebanks.


The individual data sets are:



Samples


Please view these samples:



Updates


None at this time.


Portions © 2000 Agence France Presse, © 2001 Al Hayat, © 2002 An Nahar, © 1987-1989 Dow Jones & Company, Inc., © 2002 Ummah Press Service, © 2003 Xinhua News Agency, © 1999, 2000-2008, 2018 Trustees of the University of Pennsylvania

Authors

  • Linguistic Data Consortium
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/nb2s-5a362018

TRAD Chinese-French Parallel Text -- Blog

Introduction


TRAD Chinese-French Parallel Text -- Blog was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Chinese words from GALE Phase 1 Chinese Blog Parallel Text (LDC2008T06).


The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains. ELDA developed several corpora for this effort.


The Linguistic Data Consortium (LDC) has also released the following TRAD corpora:



  • TRAD Arabic-French Parallel Text -- Newsgroup (LDC2018T13)

  • TRAD Chinese-French Parallel Text -- Broadcast News (LDC2018T17)

  • TRAD Arabic-French Parallel Text -- Newswire (LDC2018T21)


Data


This release consists of 444 segments (translation units) from 17 documents. The source data is Chinese blog text collected and translated into English by LDC for the DARPA GALE (Global Autonomous Language Exploitation) program. Information about the ELDA translation team, translation guidelines and validation results is contained in the documentation accompanying this release.


The Chinese source file contains 15,809 characters and the French reference translation contains 11,769 words. The data is presented in two unicode-encoded XML files along with an associated DTD.


Samples


Please view this source sample and reference sample.


Updates


None at this time.


Portions © 2018 ELDA, © 2005-2007, 2008, 2018 Trustees of the University of Pennsylvania

Authors

  • Linguistic Data Consortium ;
  • ELDA
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/n41t-39442018

ASpIRE Development and Development Test Sets

Introduction


ASpIRE Development and Development Test Sets was developed for the Automatic Speech recognition In Reverberant Environments (ASpIRE) Challenge sponsored by IARPA (the Intelligent Advanced Research Projects Activity). It contains approximately 226 hours of English speech with transcripts and scoring files.


The ASpIRE challenge asked solvers to develop innovative speech recognition systems that could be trained on conversational telephone speech, and yet work well on far-field microphone data from noisy, reverberant rooms. Participants had the opportunity to evaluate their techniques on a common set of challenging data that included significant room noise and reverberation.


Data


The audio data is a subset of Mixer 6 Speech (LDC2013S03), audio recordings of interviews, transcript readings and conversational telephone speech collected by the Linguistic Data Consortium in 2009 and 2010 from native English speakers local to the Philadelphia area. The transcripts were developed by Appen for the ASpIRE challenge.


Data is divided into development and development test sets.


Audio is presented as single channel, 16kHz 16-bit Signed Integer PCM *.wav files. Transcripts are plain text tdf files. Scoring files are also included.


Samples


Please view this audio sample and transcript sample.


Updates


None at this time.


Portions © 2014 U.S. Government, © 2009-2010, 2013, 2017 Trustees of the University of Pennsylvania

Authors

  • Linguistic Data Consortium ;
  • Appen Pty Ltd
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/5893-bd532017

LDC Spoken Language Sampler - Fourth Release

Introduction


LDC (Linguistic Data Consortium) Spoken Language Sampler - Fourth Release, LDC catalog number LDC2017S16 and ISBN 1-58563-811-0, contains samples from 18 different corpora published by LDC between 1996 and 2017.


LDC distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and resource sharing. With the support of its members, LDC provides critical services to the language research community that include: maintaining the LDC data archives, producing and distributing data via media or web download, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world.


Resources available from LDC include speech, text, video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDC's publications, browse the Catalog.


The sampler is available as a free download.


Data


The LDC Spoken Language Sampler - Fourth Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the speech-related resources available from the LDC Catalog. The sound files included in this release are excerpts that have been modified in various ways relative to the original data as published by LDC:



  • Most excerpts are truncated to be much shorter than the original files, typically between 1.5 and 2 minutes.

  • Signal amplitude has been adjusted where necessary to normalize playback volume.

  • Some corpora are published in compressed form, but all samples here are uncompressed.

  • Some text files are presented as images to ensure foreign character sets display properly.

  • In some publications, NIST SPHERE file format is used for audio data, but the audio files in this sampler are MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. FLAC files have been expanded into their wav form as well.


The link for the catalog number takes you to the catalog entry, and the link for the title takes you to further documentation for that corpus.































































































LDC2017S062010 NIST Speaker Recognition Evaluation Test Set2010 NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST (National Institute of Standards and Technology). It contains 2,255 hours of American English telephone speech and interview speech recorded over a microphone channel used as test data in the NIST-sponsored 2010 Speaker Recognition Evaluation (SRE).
LDC2015S10Arabic Learner CorpusArabic Learner Corpus was developed at the University of Leeds and consists of written essays and spoken recordings by Arabic learners collected in Saudi Arabia in 2012 and 2013. The corpus includes 282,732 words in 1,585 materials, produced by 942 students from 67 nationalities studying at pre-university and university levels. The average length of an essay is 178 words.
LDC2015S12Articulation Index LSCPArticulation Index LSCP was developed by researchers at Laboratoire de Sciences Cognitives et Psycholinguistique (LSCP), Ecole Normale Supérieure. It revises and enhances a subset of Articulation Index (AIC) (LDC2005S22), a corpus of persons speaking English syllables. Changes include the addition of forced alignment to sound files, time alignment of syllable utterances and format conversions.
LDC2014S01CALLFRIEND Farsi Second Edition SpeechCALLFRIEND Farsi Second Edition Speech was developed by LDC and consists of approximately 42 hours of telephone conversation (100 recordings) among native Farsi speakers. The CALLFRIEND project supported the development of language identification technology. Each CALLFRIEND corpus consists of unscripted telephone conversations lasting between 5-30 minutes.
LDC2016S04CHM150CHM150 (Corpus Hecho en México 150) was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 1.63 hours of Mexican Spanish speech, associated transcripts, and speaker metadata. The goal of this work was to support spoken term detection and forensic speaker identification.
LDC2007S18 CSLU: Kids` Speech Version 1.1CSLU: Kids' Speech Version 1.1 is a collection of spontaneous and prompted speech from 1100 children between Kindergarten and Grade 10 in the Forest Grove School District in Oregon. Approximately 100 children at each grade level read around 60 items from a total list of 319 phonetically-balanced but simple words, sentences or digit strings. Each utterance of spontaneous speech begins with a recitation of the alphabet and contains a monologue of about one minute in length. This release consists of 1017 files containing approximately 8-10 minutes of speech per speaker. Corresponding word-level transcriptions are also included.
LDC2016S12IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0aIARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 190 hours of Georgian conversational and scripted telephone speech collected in 2014-2015 along with corresponding transcripts.
LDC2003S07 Korean Telephone Conversations Complete (S), (T), (L)The Korean telephone conversations were originally recorded as part of the CALLFRIEND project. Korean Telephone Conversations Speech consists of 100 telephone conversations, 49 of which were published in 1996 as CALLFRIEND Korean, while the remaining 51 are previously unexposed calls. Korean Telephone Conversations Transcripts consists of 100 text files, totaling approximately 190K words and 25K unique words. All files are in Korean orthography: orthographic Korean characters are in Hangul, encoded in KSC5601 (Wansung) system. The complete set of Korean Telephone Conversations also includes a transcript (LDC2003T08) and lexicon (LDC2003L02) corpus.
LDC2012S04 Malto Speech and TranscriptsMalto Speech and Transcripts contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females), accompanying transcripts, English translations and glosses for 6 hours of the collection. Speakers were asked to talk about themselves, their lives, rituals and folklore; elicitation interviews were then conducted. The goal of the work was to present the current state and dialectal variation of Malto.
LDC2015S04Mandarin-English Code-Switching in South-East AsiaMandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia and includes approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts.
LDC2017S11Metalogue Multi-Issue Bargaining DialogueMetalogue Multi-Issue Bargaining Dialogue was developed by the Metalogue Consortium under the European Community's Seventh Framework Programme for Research and Technological Development. This release consists of approximately 2.5 hours of semantically annotated English dialogue data that includes speech and transcripts.
LDC2016S11Multi-Language Conversational Telephone Speech 2011 -- Slavic GroupMulti-Language Conversational Telephone Speech 2011 – Slavic Group was developed by LDC and is comprised of approximately 60 hours of telephone speech in Polish, Russian and Ukrainian. The data was collected to support research and technology evaluation in automatic language identification, specifically language pair discrimination for closely related languages/dialects. Portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation.
LDC2017S09Multi-Language Conversational Telephone Speech 2011Multi-Language Conversational Telephone Speech 2011 -- Turkish was developed by LDC and is comprised of approximately 18 hours of telephone speech in Turkish. The data was collected primarily to support research and technology evaluation in automatic language identification, specifically language pair discrimination for closely related languages/dialects.
LDC2004S09NIST Meeting Pilot Corpus SpeechThe audio data included in this corpus was collected in the NIST Meeting Data Collection Laboratory for the NIST Automatic Meeting Recognition Project. The corresponding transcripts are available as the NIST Meeting Pilot Corpus Transcripts and Metadata (LDC2004T13), while the video files will be published later as NIST Meeting Pilot Corpus Video. For more information regarding the data collection conditions, meeting scenarios, transcripts, speaker information, recording logs, errata, and other ancillary data for the corpus, please consult the NIST project website for this corpus.
LDC2017S04Noisy TIMIT SpeechNoisy TIMIT Speech was developed by the Florida Institute of Technology and contains approximately 322 hours of speech from the TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) modified with different additive noise levels. Only the audio has been modified; the original arrangement of the TIMIT corpus is still as described by the TIMIT documentation.
LDC2015S08The Walking Around CorpusThe Walking Around Corpus was developed by Stony Brook University and is comprised of approximately 33 hours of navigational telephone dialogues from 72 speakers (36 speaker pairs). Participants were Stony Brook University students who identified themselves as native English speakers.
LDC2012S02TORGO Database of Dysarthric ArticulationTORGO contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy or amyotrophic lateral sclerosis and from 7 speakers (4 males, 3 females) from a non-dysarthric control group.
LDC2014S04 USC-SFI MALACH Interviews and Transcripts CzechUSC-SFI MALACH Interviews and Transcripts Czech was developed by The University of Southern California Shoah Foundation Institute (USC-SFI) and the University of West Bohemia as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 229 hours of interviews from 420 interviewees along with transcripts and other documentation.

Portions © 2017 Trustees of the University of Pennsylvania

Authors

  • Linguistic Data Consortium
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/94k3-cf052017

LDC Spoken Language Sampler - Third Release

Introduction


LDC (Linguistic Data Consortium) Spoken Language Sampler - Third Release contains samples from 20 different corpora published by LDC between 1996 and 2015.


LDC distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and resource sharing. With the support of its members, LDC provides critical services to the language research community that include: maintaining the LDC data archives, producing and distributing data via media or web download, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world.


Resources available from LDC include speech, text, video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDC's publications, browse the Catalog.


The sampler is available as a free download.


Data


The LDC Spoken Language Sampler - Third Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the speech-related resources available from the LDC Catalog. The sound files included in this release are excerpts that have been modified in various ways relative to the original data as published by LDC:



  • Most excerpts are truncated to be much shorter than the original files, typically between 1.5 and 2 minutes.

  • Signal amplitude has been adjusted where necessary to normalize playback volume.

  • Some corpora are published in compressed form, but all samples here are uncompressed.

  • Some text files are presented as images to ensure foreign character sets display properly.

  • In some publications, NIST SPHERE file format is used for audio data, but the audio files in this sampler are MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. FLAC files have been expanded into their wav form as well.


The link for the catalog number takes you to the catalog entry.









































































































LDC2014S062009 NIST Language Recognition Evaluation Test SetThe 2009 evaluation contains approximately 215 hours of conversational telephone speech and radio broadcast conversation collected by LDC in the following 23 languages and dialects: Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari, English (American), English (Indian), Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu and Vietnamese.
LDC2014S01CALLFRIEND Farsi Second Edition SpeechCALLFRIEND Farsi Second Edition Speech was developed by LDC and consists of approximately 42 hours of telephone conversation (100 recordings) among native Farsi speakers. The CALLFRIEND project supported the development of language identification technology. Each CALLFRIEND corpus consists of unscripted telephone conversations lasting between 5-30 minutes.
LDC96S37 CALLHOME JapaneseA corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts.
LDC2013S09CSC Deceptive SpeechCSC Deceptive Speech was developed by Columbia University, SRI International and University of Colorado Boulder. It consists of 32 hours of audio interviews from 32 native speakers of Standard American English (16 male, 16 female) recruited from the Columbia University student population and the community. The purpose of the study was to distinguish deceptive speech from non-deceptive speech using machine learning techniques on extracted features from the corpus.
LDC2007S18 CSLU Kids' SpeechDeveloped at Oregon State University's Center for Spoken Language Understanding, this corpus is a collection of spontaneous and prompted speech from 1100 children from Kindergarten through Grade 10.
LDC2010S01 Fisher Spanish SpeechFisher Spanish Speech consists of audio files covering roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers.
LDC2014S02 King Saud University Arabic Speech DatabaseKing Saud University Arabic Speech Database contains 590 hours of recorded Arabic speech from 269 male and female Saudi and non-Saudi speakers. The utterances include read and spontaneous speech recorded in quiet and noisy environments. The recordings were collected via different microphones and a mobile phone and averaged between 16-19 minutes.
LDC2003S07 Korean Telephone Conversations CompleteThe Korean telephone conversations were originally recorded as part of the CALLFRIEND project. Korean Telephone Conversations Speech consists of 100 telephone conversations, 49 of which were published in 1996 as CALLFRIEND Korean, while the remaining 51 are previously unexposed calls. Korean Telephone Conversations Transcripts (LDC2003T08) consists of 100 text files, totaling approximately 190K words and 25K unique words. All files are in Korean orthography: orthographic Korean characters are in Hangul, encoded in KSC5601 (Wansung) system. The complete data set also includes a lexicon (LDC2003L02).
LDC2012S04 Malto Speech and TranscriptsMalto Speech and Transcripts contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females). Also included are accompanying transcripts, English translations and glosses for 6 hours of the collection. Malto is principally spoken in northeastern India and Bangladesh.
LDC2015S05Mandarin Chinese Phonetic Segmentation and ToneMandarin Chinese Phonetic Segmentation and Tone was developed by LDC and contains 7,849 Mandarin Chinese "utterances" and their phonetic segmentation and tone labels separated into training and test sets. The utterances were derived from 1997 Mandarin Broadcast News Speech and Transcripts (HUB4-NE) (LDC98S73 and LDC98T24, respectively). That collection consists of approximately 30 hours of Chinese broadcast news recordings from Voice of America, China Central TV and KAZN-AM, a commercial radio station based in Los Angeles, CA. This corpus was developed to investigate the use of phone boundary models on forced alignment in Mandarin Chinese.
LDC2015S04 Mandarin-English Code-Switching in South-East AsiaMandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia and includes approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts.
LDC2013S03 Mixer 6 SpeechMixer 6 Speech was developed by LDC and is comprised of 15,863 hours of telephone speech, interviews and transcript readings from 594 distinct native English speakers. This material was collected by LDC in 2009 and 2010 as part of the Mixer project, specifically phase 6, the focus of which was on native American English speakers local to the Philadelphia area.
LDC2014S03 Multi-Channel WSJ AudioMulti-Channel WSJ Audio was developed by the Centre for Speech Technology Research at The University of Edinburgh and contains approximately 100 hours of recorded speech from 45 British English speakers. Participants read Wall Street Journal texts published in 1987-1989 in three recording scenarios: a single stationary speaker, two stationary overlapping speakers and one single moving speaker.
LDC2004S09 NIST Meeting Pilot Corpus SpeechThis data set contains speech and transcriptions from topical discussions in meeting settings, including complete descriptive metadata and detailed descriptions of the physical environment in which the discussions took place.
LDC2015S02 RATS Speech Activity DetectionRATS Speech Activity Detection was developed by LDC and is comprised of approximately 3,000 hours of Levantine Arabic, English, Farsi, Pashto, and Urdu conversational telephone speech with automatic and manual annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Speech Activity Detection (SAD) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.
LDC2015S03 The Subglottal Resonances DatabaseThe Subglottal Resonances Database was developed by Washington University and University of California Los Angeles and consists of 45 hours of simultaneous microphone and subglottal accelerometer recordings of 25 adult male and 25 adult female speakers of American English between 22 and 25 years of age.
LDC2012S02TORGO Database of Dysarthric ArticulationTORGO contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy or amyotrophic lateral sclerosis and from 7 speakers (4 males, 3 females) from a non-dysarthric control group.
LDC2012S06 Turkish Broadcast News Speech and TranscriptsTurkish Broadcast News Speech and Transcripts contains approximately 130 hours of Voice of America Turkish radio broadcasts and corresponding transcripts.
LDC2014S08 United Nations Proceedings SpeechUnited Nations Proceedings Speech was developed by the United Nations (UN) and contains approximately 8,500 hours of recorded proceedings in the six official UN languages, Arabic, Chinese, English, French, Russian and Spanish. The data was recorded in 2009-2012 from sessions 64-66 of the General Assembly and First Committee (Disarmament and International Security), and meetings 6434-6763 of the Security Council.
LDC2014S04 USC-SFI MALACH Interviews and Transcripts CzechUSC-SFI MALACH Interviews and Transcripts Czech was developed by The University of Southern California Shoah Foundation Institute (USC-SFI) and the University of West Bohemia as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 229 hours of interviews from 420 interviewees along with transcripts and other documentation.

Portions © 2015 Trustees of the University of Pennsylvania

Authors

  • Linguistic Data Consortium
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/b2v8-ej562015