Automated Author Profile
Li, Xuansong

Current S-Index

27.4

Sum of Dataset Indices for all datasets

Average Dataset Index per Dataset

1.0

Average Dataset Index per dataset

Total Datasets

Total datasets for this author

Average FAIR Score

34.6%

Average FAIR Score per dataset

Total Citations

Total citations to the author's datasets

Total Mentions

Total mentions of the author's datasets

S-Index Interpretation

The S-Index (Sharing Index) is a comprehensive metric that represents the cumulative impact of all your datasets. It is calculated as the sum of Dataset Index scores across all your claimed datasets.

What it means:

A higher S-index indicates greater overall impact of your datasets relative to typical datasets in their fields of research
The S-Index grows as you add more datasets or as existing datasets gain more citations and mentions
It provides a single number to track your research data impact over time

Current S-Index: 27.4 (sum of 27 datasets Dataset Index scores)

More information here.

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training

Introduction

BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training was developed by the Linguistic Data Consortium (LDC) and consists of 153,171 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

Data

The source data in this release consists of transcripts of Egyptian Arabic conversational telephone speech (CTS) from LDC's CALLHOME and CALLFRIEND collections (LDC97S45, LDC97T19, LDC2002S37, LDC2002T38, LDC96S49) that were translated into English by professional translation agencies and annotated for the word alignment task.

The BOLT word alignment task was built on treebank annotation. Specifically, Egyptian Arabic source tree tokens were automatically extracted from tree files in LDC's BOLT Egyptian Arabic Treebank. Those tree files had been tagged for part-of-speech and syntactically annotated. That data was then aligned and annotated for the word alignment task.

The data profile broken down by character tokens, tree tokens and segments appears below:

Language	Genre	Files	Words	Tree-tokens	Segments
Egyptian Arabic	CTS	176	153,171	215,896	20,010

Acknowledgement

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Samples

Please view the following samples:

Egyptian-Arabic Token Sample

English Token Sample

Word Alignment Token

Updates

None at this time.

Authors

Li, Xuansong ;
Grimes, Stephen ;
Strassel, Stephanie

1 Citation0 Mentions35% FAIR1.2 Dataset Index

10.35111/1jbk-p6162020

BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training

Introduction

BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training was developed by the Linguistic Data Consortium (LDC) and consists of 349,414 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.

Data

This release consists of Egyptian Arabic source text message and chat conversations collected using two methods: new collection via LDC's collection platform, and donation of SMS or chat archives from BOLT collection participants. The source data is released as BOLT Egyptian Arabic SMS/Chat and Transliteration (LDC2017T07).

The data profile broken down by character tokens, tree tokens and segments appears below:

Language	Genre	Files	Words	Tree/POS-tokens	Segments
Egyptian Arabic	SMS/Chat	1367	349,414	475,665	74,814

Acknowledgement

Samples

Please view the following samples:

Egyptian Arabic Source

English Translation

Word Alignment

Updates

None at this time.

Authors

Li, Xuansong ;
Grimes, Stephen ;
Strassel, Stephanie

0 Citations0 Mentions35% FAIR0.9 Dataset Index

10.35111/1kea-zq242019

BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training

Introduction

BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training was developed by the Linguistic Data Consortium (LDC) and consists of 388,027 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations.

Data

This release consists of Chinese source text message and chat conversations collected using two methods: new collection via LDC's collection platform and donation of SMS and chat archives from BOLT collection participants. The source data is released as BOLT Chinese SMS/Chat (LDC2018T15).

The BOLT word alignment task was built on treebank annotation. Specifically, LDC automatically extracted Chinese source tokens, including empty categories/traces, from word-segmented files provided by the BOLT Chinese Treebank annotation team at Brandeis University. The word-segmented tokens were then used to automatically generate ctb (Chinese Treebank) alignment and were also tokenized for character alignment by inserting white spaces to separate characters.

The data profile broken down by character tokens, ctb tokens and segments appears below:

Language	Genre	Files	Words	CharTokens	CTBTokens	Segments
Chinese	SMS/chat	1359	388,027	582,043	419,406	59,564

Acknowledgement

Samples

Please view the following samples:

English Tokenized

CTB-Based Word Alignment

Character-Based Word Alignment

Chinese CTB-based Tokenized

Chinese Character Tokenized

Updates

None at this time.

Authors

Li, Xuansong ;
Grimes, Stephen ;
Strassel, Stephanie

0 Citations0 Mentions35% FAIR0.9 Dataset Index

10.35111/zbdg-8t662019

HAVIC MED Progress Test -- Videos, Metadata and Annotation

Introduction

HAVIC MED Progress Test -- Videos, Metadata and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 3,650 hours of user-generated videos with annotation and metadata.

To advance multimodal event detection and related technologies, LDC developed, in collaboration with NIST (the National Institute of Standards and Technology), a large, heterogeneous, annotated multimodal corpus for HAVIC (the Heterogeneous Audio Visual Internet Collection) that was used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC MED Progress Test is a subset of that corpus, specifically, a collection of event and background videos for the HAVIC project originally released to support the 2012, 2013, 2014, and 2015 Multimedia Event Detection tasks.

Data

The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos) harvested by a large team of human annotators. Each event video was manually annotated with a set of judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.

All video files are in .mp4 format (h.264), with varying bit-rates and levels of audio fidelity and video resolution. Metadata and annotation for the videos are stored in a .tsv file.

Samples

Please view this video sample and annotation sample.

Updates

None at this time.

Additional Licensing Instructions

This members-only corpus is available to current members. Contact [email protected] for information about becoming a member.

Authors

Morris, Amanda ;
Strassel, Stephanie ;
Li, Xuansong ;
Antonishek, Brian ;
Fiscus, Jonathan G.

0 Citations0 Mentions35% FAIR0.4 Dataset Index

10.35111/fnzz-kn072019

BOLT Egyptian-English Word Alignment -- Discussion Forum Training

Introduction

BOLT Egyptian-English Word Alignment -- Discussion Forum Training was developed by the Linguistic Data Consortium (LDC) and consists of 400,448 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.

Data

This release consists of Egyptian source discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes. The source data is released as BOLT Arabic Discussion Forums (LDC2018T10).

The BOLT word alignment task was built on treebank annotation. Specifically, Egyptian source tree tokens for word alignment were automatically extracted from tree files of BOLT Egyptian Arabic Treebank annotation on the source discussion forum data harvested by LDC. Human annotators then followed LDC guidelines to link words and phrases in Arabic to those in English.

The data profile broken down by character tokens, tree tokens and segments appears below:

Language	Genre	Files	Words	Tree-tokens	Segments
Egyptian Arabic	discussion forum	724	400,448	593,723	31,454

Acknowledgement

Samples

Please view the following samples:

Egyptian Tokenized

English Tokenized

Word Alignment

Updates

None at this time.

Authors

Li, Xuansong ;
Peterson, Katherine ;
Grimes, Stephen ;
Strassel, Stephanie

0 Citations0 Mentions35% FAIR0.9 Dataset Index

10.35111/me9d-jr382019

HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation

Introduction

HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 53 hours of user-generated videos with annotation and metadata.

To advance multimodal event detection and related technologies, LDC developed, in collaboration with NIST (the National Institute of Standards and Technology), a large, heterogeneous, annotated multimodal corpus for HAVIC (the Heterogeneous Audio Visual Internet Collection) that was used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC MED Event E051-E060 is a subset of that corpus, specifically, a collection of event videos for the HAVIC Project originally released to support the 2016 Multimedia Event Detection task.

Data

All video files are in .mp4 format (h.264), with varying bit-rates and levels of audio fidelity and video resolution. Metadata and annotation for the videos are stored in a .tsv file.

Samples

Please view this video sample and annotation sample.

Updates

None at this time.

Authors

Morris, Amanda ;
Strassel, Stephanie ;
Li, Xuansong ;
Antonishek, Brian ;
Fiscus, Jonathan G.

0 Citations0 Mentions35% FAIR0.9 Dataset Index

10.35111/m8n7-xy132018

GALE English-Chinese Parallel Aligned Treebank -- Training

Introduction

GALE English-Chinese Parallel Aligned Treebank -- Training was developed by the Linguistic Data Consortium (LDC) and contains 196,123 tokens of word aligned English and Chinese parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Parallel aligned treebanks are treebanks annotated with morphological and syntactic structures aligned at the sentence level and the sub-sentence level. Such data sets are useful for natural language processing and related fields, including automatic word alignment system training and evaluation, transfer-rule extraction, word sense disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic studies. With respect to machine translation system development, parallel aligned treebanks may improve system performance with enhanced syntactic parsers, better rules and knowledge about language pairs and reduced word error rate.

The English source data was translated into Chinese. Chinese and English treebank annotations were performed independently. The parallel texts were then word aligned. The material in this release corresponds to portions of the treebanked data in OntoNotes 3.0 (LDC2009T24) and OntoNotes 4.0 (LDC2011T03).

Data

This release consists of English source broadcast programming (CNN, NBC/MSNBC) and web data collected by LDC in 2005 and 2006. The distribution by genre, words, character tokens, treebank tokens and segments appears below:

Genre	Files	Words	CharTokens	CTBTokens	Segments
bc	6	60,0061	90,092	62,438	3,763
wb	15	70,687	106,031	69,309	3,238
Total	21	130,748	196,123	131,747	7,001

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The word alignment task consisted of the following components:

Identifying, aligning, and tagging eight different types of links

Identifying, attaching, and tagging local-level unmatched words

Identifying and tagging sentence/discourse-level unmatched words

Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link

This release contains nine types of files - English raw source files, Chinese raw translation files, Chinese character tokenized files, Chinese CTB tokenized files, English tokenized files, Chinese treebank files, English treebank files, character-based word alignment files, and CTB-based word alignment files.

Samples

Please view the following samples:

English raw source

Chinese raw translation

Chinese character tokenized

Chinese CTB tokenized

English tokenized

Chinese treebank

English treebank

Character-based word alignment

CTB-based word alignment

Sponsorship

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Updates

04/12/2017 - The Chinese raw translation files for broadcast conversation were updated to the correct files. All downloads recieved after this date are fully up to date.

Authors

Li, Xuansong ;
Grimes, Stephen ;
Strassel, Stephanie ;
Ma, Xiaoyi ;
Xue, Nianwen ;
Marcus, Mitch ;
Taylor, Ann

0 Citations0 Mentions35% FAIR0.9 Dataset Index

10.35111/w1wt-fc492017

BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training

Introduction

BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training was developed by the Linguistic Data Consortium (LDC) and consists of 448,094 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations.

Data

This release consists of Chinese source discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes. The source data is released as BOLT Chinese Discussion Forums (LDC2016T05).

The data profile broken down by character tokens, ctb tokens and segments appears below:

Language	Genre	Files	Words	CharTokens	CTBTokens	Segments
Chinese	forum	570	448,094	672,140	442,520	20,819

Acknowledgement

Samples

Please view the following samples:

Chinese Character Tokenized

Chinese CTB-Based Tokenized

English Tokenized

Character-Based Word Alignment

CTB-Based Word Alignment

Updates

None at this time.

Authors

Li, Xuansong ;
Peterson, Katherine ;
Grimes, Stephen ;
Strassel, Stephanie

1 Citation0 Mentions35% FAIR1.3 Dataset Index

10.35111/s5ae-pn382016

HAVIC Pilot Transcription

Introduction

HAVIC Pilot Transcription was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 72 hours of user-generated videos with transcripts based on the English speech audio extracted from the videos. This data set was created in collaboration with NIST (the National Institute of Standards and Technology) as part of the HAVIC (the Heterogeneous Audio Visual Internet Collection) project, the goal of which was to advance multimodal event detection and related technologies.

LDC has developed a large, heterogeneous, annotated multimodal corpus for HAVIC that has been used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC Pilot Transcription supported an experiment to produce a verbatim transcript (quick and rich transcription) based on audio extracted from user-generated videos. It contains the pilot transcripts for selected MED 2011 video files as well as the associated videos.

Data

NIST designated the videos to be transcribed. Annotators generated the transcripts using XTrans, which supports manual transcription across multiple channels, languages and platforms. HAVIC transcription guidelines are included in the documentation for this release.

Each file was transcribed by a single annotator with no corpus-wide second pass. File samples from each annotator were checked for various errors, including missing transcription, improper mark-up, poor segmentation and missing/added words.

All transcription files are in .tdf format, a plain-text, flat-table format with 13 tab-delimited fields. All video files are in .mp4 format (h264), with varying bit-rates and levels of audio fidelity and video resolution.

Samples

Please view these video and transcript samples.

Updates

None at this time.

Authors

Tracey, Jennifer ;
Strassel, Stephanie ;
Morris, Amanda ;
Li, Xuansong ;
Antonishek, Brian ;
Fiscus, Jonathan G.

0 Citations0 Mentions35% FAIR0.9 Dataset Index

10.35111/cn82-n5032016

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4

Introduction

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4 was developed by the Linguistic Data Consortium (LDC) and contains 243,038 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

Other releases available in this series are:

GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web (LDC2012T16)

GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire (LDC2012T20)

GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web (LDC2012T24)

GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web (LDC2013T05)

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 (LDC2013T23)

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2 (LDC2014T25)

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 (LDC2015T04)

Data

This release consists of Chinese source broadcast conversation (BC) and broadcast news (BN) programming collected by LDC in 2008 and 2009. The distribution by genre, words, character tokens and segments appears below:

Language	Genre	Files	Words	CharTokens	Segments
Chinese	BC	69	67,782	101,674	2,276
Chinese	BN	29	94,242	141,364	3,152
Total		98	162,024	243,038	5,428

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment tasks consisted of the following components:

Identifying, aligning, and tagging eight different types of links

Identifying, attaching, and tagging local-level unmatched words

Identifying and tagging sentence/discourse-level unmatched words

Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link

Samples

Please view the following sample.

Chinese Raw

Chinese Token

English Raw

English Token

Word Alignment

Sponsorship

Updates

None at this time.

Authors

Li, Xuansong ;
Grimes, Stephen ;
Strassel, Stephanie

1 Citation0 Mentions35% FAIR1.2 Dataset Index

10.35111/jt1b-sv882015

Automated Author ProfileLi, Xuansong

Li, Xuansong

Current S-Index

Average Dataset Index per Dataset

Total Datasets

Average FAIR Score

Total Citations

Total Mentions

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training

Introduction

Data

Acknowledgement

Samples

Updates

BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training

Introduction

Data

Acknowledgement

Samples

Updates

BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training

Introduction

Data

Acknowledgement

Samples

Updates

HAVIC MED Progress Test -- Videos, Metadata and Annotation

Introduction

Data

Samples

Updates

Additional Licensing Instructions

BOLT Egyptian-English Word Alignment -- Discussion Forum Training

Introduction

Data

Acknowledgement

Samples

Updates

HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation

Introduction

Data

Samples

Updates

GALE English-Chinese Parallel Aligned Treebank -- Training

Introduction

Data

Samples

Sponsorship

Updates

BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training

Introduction

Data

Acknowledgement

Samples

Updates

HAVIC Pilot Transcription

Introduction

Data

Samples

Updates

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4

Introduction

Data

Samples

Sponsorship

Updates

Automated Author Profile
Li, Xuansong