Automated Author ProfileBernstein, Jared
Bernstein, Jared
Current S-Index
Sum of Dataset Indices for all datasets
Average Dataset Index per Dataset
Average Dataset Index per dataset
Total Datasets
Total datasets for this author
Average FAIR Score
Average FAIR Score per dataset
Total Citations
Total citations to the author's datasets
Total Mentions
Total mentions of the author's datasets
S-Index Interpretation
The S-Index (Sharing Index) is a comprehensive metric that represents the cumulative impact of all your datasets. It is calculated as the sum of Dataset Index scores across all your claimed datasets.
What it means:
- A higher S-index indicates greater overall impact of your datasets relative to typical datasets in their fields of research
- The S-Index grows as you add more datasets or as existing datasets gain more citations and mentions
- It provides a single number to track your research data impact over time
Current S-Index: 7.4 (sum of 6 datasets Dataset Index scores)
More information here.
S-Index Over Time
Cumulative Citations Over Time
Cumulative Mentions Over Time
Datasets
Introduction
Hispanic-English Database contains approximately 30 hours of English and Spanish conversational and read speech with transcripts (24 hours) and metadata collected from 22 non-native English speakers between 1996 and 1998. The corpus was developed by Entropic Research Laboratory, Inc., a developer of speech recognition and speech synthesis software toolkits that was acquired by Microsoft in 1999.
Participants were adult native speakers of Spanish as spoken in Central America and South America who resided in the Palo Alto, California area, had lived in the United States for at least one year and demonstrated a basic ability to understand, read and speak English. They read a total of 2200 sentences, 50 each in Spanish and English per speaker. The Spanish sentence prompts were a subset of the materials in LATINO-40 Spanish Read News, and the English sentence prompts were taken from the TIMIT database. Conversations were task-oriented, drawing on exercises similar to those used in English second language instruction and designed to engage the speakers in collaborative, problem-solving activities.
Data
Read speech was recorded on two wideband channels with a Shure SM10A head-mounted microphone in a quiet laboratory environment. The conversational speech was simultaneously recorded on four channels, two of which were used to place phone calls to each subject in two separate offices and to record the incoming speech of the two channels into separate files. The audio was originally saved under the Entropic Audio (ESPS) format using a 16kHz sampling rate and 16 bit samples. Audio files were converted to flac compressed .wav files from the ESPS format. ESPS headers were removed and are presented in this release as *.hdr files that include demographic and technical data.
Transcripts were developed with the Entropic Annotator tool and are time-aligned with speaker turns. The transcription conventions were based on those used in the LDC Switchboard and CALLHOME collections. Transcript files are denoted with a .lab extension.
Data files and their corresponding label files are stored in subdirectories named using a speaker-pair id and session number. The first three letters identify the speaker on channel A. The last three letters identify the speaker on channel B. Wideband audio files contain *.wb.flac in their file name, and narrow band audio files are denoted with a *.nb.flac in the file name.
Samples
Please view these samples:
Updates
None at this time.
Portions © 2014 Trustees of the University of Pennsylvania
Authors
- Byrne, William ;
- Knodt, Eva ;
- Bernstein, Jared ;
- Emami, Farzhad
This database provides a set of recordings for training speaker-independent systems that recognize Latin-American Spanish. It was recorded by the Entropic Research Laboratory in the period from July 11 through September 9 1994 in Palo Alto, California. The database comprises about 5,000 utterances files. These files include about 125 utterances from each of 40 different speakers, 20 male and 20 female.
The recordings were all made with a high-quality, head-mounted microphone (Shure SM10A) in an office environment, and the utterances were digitized in 16-bit samples at 16 kHz.
The Linguistic Data Consortium provided 13,000 sentences that had been selected from Latin American newspaper text by people working at Texas Instruments. The sentences are all shorter than 80 characters and are not grouped into larger constituents such as paragraphs or stories. The speech files have NIST SPHERE headers and are presented in compressed format, using the shorten speech compression algorithm developed by Tony Robinson at Cambridge Univesity, as implemented in the NIST SPHERE software package. This software is included with the data.
Portions © 1995 Trustees of the University of Pennsylvania
Authors
- Bernstein, Jared ;
- Grundy, Bill ;
- Rosenfeld, Elizabeth ;
- Najmi, Amir ;
- Mankoski, Psi
Introduction
MACROPHONE consists of approximately 200,000 utterances by 5,000 speakers. It is designed to provide material sufficient and suitable for research, development and evaluation of automatic speech recognition technology for common telephone applications, such as shopping, transportation, database access and autodialing. In addition to application-oriented phrases and numerous digit strings, seven sentences are spoken by each talker to provide ensemble phoneme, diphone and triphone coverage of the language. The spoken material also refers to times, locations, monetary amounts, spellings and interactive operations.
Data
The utterances were collected automatically over the telephone network by recording directly from a T1 connection in 8 kHz, 8-bit mu-law format. The participants, roughly equal numbers of males and females, were solicited by a marketing firm from all regions of the United States. They ranged in age from the teens to the seventies and represented a broad range of educations and incomes as well. Each recorded utterance is accompanied by an orthographic transcription which also notes any unusual acoustic events or anomalies. Macrophone is the American English contribution to an international database of telephone speech corpora called POLYPHONE. Similar data sets are expected for major languages of the world and at least some of these will be made available through LDC. Prospects are currently good for American Spanish (by early 1995), Dutch, Standard French, Standard German, Japanese, Mandarin Chinese, Swiss French and Danish versions of POLYPHONE, all with basically the same structure and methods of collection.
MACROPHONE was collected at SRI under LDC sponsorship. A paper describing it was presented at ICASSP-94: "Macrophone: An American English Telephone Speech Corpus for the POLYPHONE Project," by Jared Bernstein, Kelsey Taussig and Jack Godfrey.
Samples
Please listen to this audio sample.
Updates
None at this time.
Portions © 1994 Trustees of the University of Pennsylvania
Authors
- Bernstein, Jared ;
- Taussig, Kelsey ;
- Godfrey, Jack
LDC93S3A - Resource Management Complete Set 2.0
LDC93S3B - Resource Management (RM1) 2.0
LDC93S3C - Resource Management (RM2) 2.0
The DARPA Resource Management corpora (RM) consist of digitized and transcribed speech for use in designing and evaluating continuous speech recognition systems. There are two main parts, often referred to as RM1 and RM2. RM1 contains three sections, Speaker-Dependent (SD) training data, Speaker-Independent (SI) training data and test and evaluation data. RM2 has an additional and larger SD data set, including test material. Resource Management Complete Set 2.0 contains RM1 and RM2.
All RM material consists of read sentences modeled after a naval resource management task. The complete corpus contains over 25,000 utterances from more than 160 speakers representing a variety of American dialects. The material was recorded at 16KHz, with 16-bit resolution, using a Sennheiser HMD-414 headset microphone. All discs conform to the ISO-9660 data format.
Resource Managment SD and SI Training and Test Data (RM1)
The Speaker-Dependent (SD) Training Data contains 12 subjects, each reading a set of 600 "training sentences," two "dialect" sentences and ten "rapid adaptation" sentences, for a total of 7,344 recorded sentence utterances. The 600 sentences designated as training cover 97 of the lexical items in the corpus.
The Speaker-Independent (SI) Training Data contains 80 speakers, each reading two "dialect" sentences plus 40 sentences from the Resource Management text corpus, for a total of 3,360 recorded sentence utterances. Any given sentence from a set of 1,600 Resource Management sentence texts was recorded by two subjects, while no sentence was read twice by the same subject.
RM1 contains all SD and SI system test material used in five DARPA benchmark tests conducted in March and October of 1987, June 1988, and February and October 1989, along with scoring and diagnostic software and documentation for those tests. Documentation is also provided outlining use of the Resource Management training and test material at CMU in development of the SPHINX system. Example output and scored results for state-of-the-art speaker-dependent and speaker-independent systems (i.e. the BBN BYBLOS and CMU SPHINX systems) for the October 1989 benchmark tests are included.
Extended Resource Management Speaker-Dependent Corpus (RM2)
This set forms a speaker-dependent extension to the Resource Management (RM1) corpus. The corpus consists of a total of 10,508 sentence utterances (two male and two female speakers each speaking 2,652 sentence texts). These include the 600 "standard" Resource Management speaker-dependent training sentences, two dialect calibration sentences, ten rapid adaptation sentences, 1,800 newly-generated extended training sentences, 120 newly-generated development-test sentences and 120 newly-generated evaluation-test sentences. The evaluation-test material on this disc was used as the test set for the June 1990 DARPA SLS Resource Management Benchmark Tests (see the Proceedings).
The RM2 corpus was recorded at Texas Instruments. The NIST speech recognition scoring software originally distributed on the RM1 "Test" Disc was adapted for RM2 sentences and is included in this publication.
Portions © 1993 Trustees of the University of Pennsylvania
Authors
- Price, P ;
- Fisher, W M. ;
- Bernstein, Jared ;
- Pallett, D S.
LDC93S3A - Resource Management Complete Set 2.0
LDC93S3B - Resource Management (RM1) 2.0
LDC93S3C - Resource Management (RM2) 2.0
The DARPA Resource Management corpora (RM) consist of digitized and transcribed speech for use in designing and evaluating continuous speech recognition systems. There are two main parts, often referred to as RM1 and RM2. RM1 contains three sections, Speaker-Dependent (SD) training data, Speaker-Independent (SI) training data and test and evaluation data. RM2 has an additional and larger SD data set, including test material. Resource Management Complete Set 2.0 contains RM1 and RM2.
All RM material consists of read sentences modeled after a naval resource management task. The complete corpus contains over 25,000 utterances from more than 160 speakers representing a variety of American dialects. The material was recorded at 16KHz, with 16-bit resolution, using a Sennheiser HMD-414 headset microphone.
Resource Managment SD and SI Training and Test Data (RM1)
The Speaker-Dependent (SD) Training Data contains 12 subjects, each reading a set of 600 "training sentences," two "dialect" sentences and ten "rapid adaptation" sentences, for a total of 7,344 recorded sentence utterances. The 600 sentences designated as training cover 97 of the lexical items in the corpus.
The Speaker-Independent (SI) Training Data contains 80 speakers, each reading two "dialect" sentences plus 40 sentences from the Resource Management text corpus, for a total of 3,360 recorded sentence utterances. Any given sentence from a set of 1,600 Resource Management sentence texts was recorded by two subjects, while no sentence was read twice by the same subject.
RM1 contains all SD and SI system test material used in 5 DARPA benchmark tests conducted in March and October of 1987, June 1988 and February and October 1989, along with scoring and diagnostic software and documentation for those tests. Documentation is also provided outlining use of the Resource Management training and test material at CMU in development of the SPHINX system. Example output and scored results for state-of-the-art speaker-dependent and speaker-independent systems (i.e. the BBN BYBLOS and CMU SPHINX systems) for the October 1989 benchmark tests are included.
Extended Resource Management Speaker-Dependent Corpus (RM2)
This set forms a speaker-dependent extension to the Resource Management (RM1) corpus. The corpus consists of a total of 10,508 sentence utterances (two male and two female speakers each speaking 2,652 sentence texts). These include the 600 "standard" Resource Management speaker-dependent training sentences, two dialect calibration sentences, ten rapid adaptation sentences, 1,800 newly-generated extended training sentences, 120 newly-generated development-test sentences and 120 newly-generated evaluation-test sentences. The evaluation-test material was used as the test set for the June 1990 DARPA SLS Resource Management Benchmark Tests (see the Proceedings).
The RM2 corpus was recorded at Texas Instruments. The NIST speech recognition scoring software originally distributed on the RM1 "Test" Disc was adapted for RM2 sentences and is included in this publication.
Portions © 1993 Trustees of the University of Pennsylvania
Authors
- Price, P ;
- Fisher, W M. ;
- Bernstein, Jared ;
- Pallett, D S.
LDC93S3A - Resource Management Complete Set 2.0
LDC93S3B - Resource Management (RM1) 2.0
LDC93S3C- Resource Management (RM2) 2.0
The DARPA Resource Management corpora (RM) consist of digitized and transcribed speech for use in designing and evaluating continuous speech recognition systems. There are two main parts, often referred to as RM1 and RM2. RM1 contains three sections, Speaker-Dependent (SD) training data, Speaker-Independent (SI) training data and test and evaluation data. RM2 has an additional and larger SD data set, including test material. Resource Management Complete Set 2.0 contains RM1 and RM2.
All RM material consists of read sentences modeled after a naval resource management task. The complete corpus contains over 25,000 utterances from more than 160 speakers representing a variety of American dialects. The material was recorded at 16KHz, with 16-bit resolution, using a Sennheiser HMD-414 headset microphone. All discs conform to the ISO-9660 data format.
Resource Managment SD and SI Training and Test Data (RM1)
The Speaker-Dependent (SD) Training Data contains 12 subjects, each reading a set of 600 "training sentences," two "dialect" sentences and ten "rapid adaptation" sentences, for a total of 7,344 recorded sentence utterances. The 600 sentences designated as training cover 97 of the lexical items in the corpus.
The Speaker-Independent (SI) Training Data contains 80 speakers, each reading two "dialect" sentences plus 40 sentences from the Resource Management text corpus, for a total of 3,360 recorded sentence utterances. Any given sentence from a set of 1,600 Resource Management sentence texts was recorded by two subjects, while no sentence was read twice by the same subject.
RM1 contains all SD and SI system test material used in five DARPA benchmark tests conducted in March and October of 1987, June 1988 and February and October 1989, along with scoring and diagnostic software and documentation for those tests. Documentation is also provided outlining use of the Resource Management training and test material at CMU in development of the SPHINX system. Example output and scored results for state-of-the-art speaker-dependent and speaker-independent systems (i.e. the BBN BYBLOS and CMU SPHINX systems) for the October 1989 benchmark tests are included.
Extended Resource Management Speaker-Dependent Corpus (RM2)
This set forms a speaker-dependent extension to the RM1 corpus. The corpus consists of a total of 10,508 sentence utterances (two male and two female speakers each speaking 2,652 sentence texts). These include the 600 "standard" Resource Management speaker-dependent training sentences, two dialect calibration sentences, ten rapid adaptation sentences, 1,800 newly-generated extended training sentences, 120 newly-generated development-test sentences and 120 newly-generated evaluation-test sentences. The evaluation-test material on this disc was used as the test set for the June 1990 DARPA SLS Resource Management Benchmark Tests (see the Proceedings).
The RM2 corpus was recorded at Texas Instruments. The NIST speech recognition scoring software originally distributed on the RM1 "Test" Disc was adapted for RM2 sentences and is included in this publication.
Portions © 1993 Trustees of the University of Pennsylvania
Authors
- Price, P ;
- Fisher, W M. ;
- Bernstein, Jared ;
- Pallett, D S.