Version v0.1

The Maaloula Aramaic Speech Corpus (MASC)

View Dataset
Eid, Ghattas;Seyffarth, Esther;Rihan, Emad;Arnold, Werner;Plag, Ingo

Description

This dataset contains the first electronic speech corpus of Maaloula Aramaic, an endangered Western Neo-Aramaic variety spoken in Syria. This 64,845-word corpus is available in four formats: (1) transliteration, (2) lemmatized transliteration, (3) audio files and time-aligned phonetic transcriptions, and (4) an SQLite database. The transliteration files are a digitized and corrected version of authentic transcriptions of tape-recorded narratives coming from a fieldwork trip conducted in the 1980s and published in the early 1990s (Arnold, 1991a, 1991b). They contain no annotation, except for some informative tagging (e.g. to mark loanwords and misspoken words). In the lemmatized version of the files, each word form is followed by its lemma in angled brackets. The time-aligned TextGrid annotations consist of four tiers: the sentence level (Tier 1), the word level (Tiers 2 and 3), and the segment level (Tier 4). These TextGrid files are downloadable together with their denoised audio files (for the original source of the audio data see Arnold, 2003). The SQLite database enables users to access the data on the level of tokens, types, lemmas, sentences, stories, or speakers. For more information, please see our paper (submitted): The Maaloula Aramaic Speech Corpus (MASC): From Printed Material to a Lemmatized and Time-Aligned Corpus

Citations (1)

Mentions (0)

Metrics

Dataset Index

2.0

FAIR Score

73%

Citations

1

Mentions

0

Metrics Over Time

Publication Details

DOI

Publisher

Zenodo

Assigned Domain

Subfield

Language and Linguistics

Field

Arts and Humanities

Domain

Social Sciences

Confidence Score

100%

Source

Open Alex

Keywords

Maaloula AramaicWestern Neo-Aramaicspeech corpuslanguage documentation corpuslemmatizationtime alignment

Normalization Factors

FT

15.38

CTw

1.00

MTw

1.00