Automated Author ProfileUniversiti Sains Malaysia
Universiti Sains Malaysia
Current S-Index
Sum of Dataset Indices for all datasets
Average Dataset Index per Dataset
Average Dataset Index per dataset
Total Datasets
Total datasets for this author
Average FAIR Score
Average FAIR Score per dataset
Total Citations
Total citations to the author's datasets
Total Mentions
Total mentions of the author's datasets
S-Index Interpretation
The S-Index (Sharing Index) is a comprehensive metric that represents the cumulative impact of all your datasets. It is calculated as the sum of Dataset Index scores across all your claimed datasets.
What it means:
- A higher S-index indicates greater overall impact of your datasets relative to typical datasets in their fields of research
- The S-Index grows as you add more datasets or as existing datasets gain more citations and mentions
- It provides a single number to track your research data impact over time
Current S-Index: 0.9 (sum of 1 dataset Dataset Index scores)
More information here.
S-Index Over Time
Cumulative Citations Over Time
Cumulative Mentions Over Time
Datasets
Introduction
Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia in Singapore and Malaysia, respectively. It is comprised of approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts.
Code-switching refers to the practice of shifting between languages or language varieties during conversation. This corpus focuses on the shift between Mandarin and English by Malaysian and Singaporean speakers. Speakers engaged in unscripted conversations and interviews. In the conversational speech segments, two speakers conversed freely with each other. The interviews consisted of questions from an interviewer and answers from an interviewee; only the interviewee's speech was recorded. Topics discussed range from hobbies, friends, and daily activities.
Data
The speakers were gender-balanced (49.7% female, 50.3% male) and between 19 and 33 years of age. Over 60% of the speakers were Singaporean; the rest were Malaysian.
The speech recordings were conducted in a quiet room using several microphones and recording devices. Details about the recording conditions are contained in the documentation provided with this release. The audio files in this corpus are 16KHz, 16-bit recordings in flac compressed wav format between 20 and 120 minutes in length.
Selected segments of the audio recordings were transcribed. Most of those segments contain code-switching utterances. The transcription file for each audio file is stored in UTF-8 tab-separated text file format.
Development and Training Divisions are available as a seperate download (SEAME_train_dev_division.zip) and on the provider's Github page.
Samples
Please view this audio sample and transcript sample.
Updates
As of 12/14/2015, an additional set of transcription files were added for all the audio. The transcriptions are updated based on the original transcription, with adding the previously un-transcribed utterance. The language label also is also added for each utterance in the transcription. File directories were also changed to reflect the update, specifically, the change is made under /data/{recording_type}/transcript/{phase_number}/
Where
- the {recording_type} is equal to 'conversation' or 'interview'
- the {phase_number} is equal to 'phaseI' or 'phaseII'
+) 'phaseI' contains all the existing transcription from the first release
+) 'phaseII' contains the newly updated transcriptions, where some typo mistakes, wrong boundary markers are corrected. Un-transcribed segments, which are normally monolingual and language label for each segment are added.
The documentation for the corpus also updated to include the detail description on the new update in section 3) Transcription.
Portions © 2015 Nanyang Technical University, Universiti Sains Malaysia, Trustees of the University of Pennsylvania
Authors
- Nanyang Technological University ;
- Universiti Sains Malaysia