Automated Author Profile

Li, Xuansong

Current S-Index

27.4

Sum of Dataset Indices for all datasets

Average Dataset Index per Dataset

1.0

Average Dataset Index per dataset

Total Datasets

27

Total datasets for this author

Average FAIR Score

34.6%

Average FAIR Score per dataset

Total Citations

10

Total citations to the author's datasets

Total Mentions

0

Total mentions of the author's datasets

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training

Introduction


BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training was developed by the Linguistic Data Consortium (LDC) and consists of 153,171 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.


The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.


Data


The source data in this release consists of transcripts of Egyptian Arabic conversational telephone speech (CTS) from LDC's CALLHOME and CALLFRIEND collections (LDC97S45, LDC97T19, LDC2002S37, LDC2002T38, LDC96S49) that were translated into English by professional translation agencies and annotated for the word alignment task.


The BOLT word alignment task was built on treebank annotation. Specifically, Egyptian Arabic source tree tokens were automatically extracted from tree files in LDC's BOLT Egyptian Arabic Treebank. Those tree files had been tagged for part-of-speech and syntactically annotated. That data was then aligned and annotated for the word alignment task.


The data profile broken down by character tokens, tree tokens and segments appears below:





















LanguageGenreFilesWordsTree-tokensSegments
Egyptian ArabicCTS176153,171215,89620,010

Acknowledgement


This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.


Samples


Please view the following samples:



Updates


None at this time.


Portions © 1996, 1997, 2002, 2012-2015, 2020 Trustees of the University of Pennsylvania

Authors

  • Li, Xuansong ;
  • Grimes, Stephen ;
  • Strassel, Stephanie
1 Citation0 Mentions35% FAIR1.2 Dataset Index
10.35111/1jbk-p6162020

BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training

Introduction


BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training was developed by the Linguistic Data Consortium (LDC) and consists of 349,414 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.


The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.


Data


This release consists of Egyptian Arabic source text message and chat conversations collected using two methods: new collection via LDC's collection platform, and donation of SMS or chat archives from BOLT collection participants. The source data is released as BOLT Egyptian Arabic SMS/Chat and Transliteration (LDC2017T07).


The BOLT word alignment task was built on treebank annotation. Specifically, Egyptian Arabic source tree tokens were automatically extracted from tree files in LDC's BOLT Egyptian Arabic Treebank. Those tree files had been tagged for part-of-speech and syntactically annotated. That data was then aligned and annotated for the word alignment task.


The data profile broken down by character tokens, tree tokens and segments appears below:





















LanguageGenreFilesWordsTree/POS-tokensSegments
Egyptian ArabicSMS/Chat1367349,414475,66574,814

Acknowledgement


This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.


Samples


Please view the following samples:



Updates


None at this time.


Portions © 2019 Trustees of the University of Pennsylvania

Authors

  • Li, Xuansong ;
  • Grimes, Stephen ;
  • Strassel, Stephanie
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/1kea-zq242019

BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training

Introduction


BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training was developed by the Linguistic Data Consortium (LDC) and consists of 388,027 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations.


The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.


Data


This release consists of Chinese source text message and chat conversations collected using two methods: new collection via LDC's collection platform and donation of SMS and chat archives from BOLT collection participants. The source data is released as BOLT Chinese SMS/Chat (LDC2018T15).


The BOLT word alignment task was built on treebank annotation. Specifically, LDC automatically extracted Chinese source tokens, including empty categories/traces, from word-segmented files provided by the BOLT Chinese Treebank annotation team at Brandeis University. The word-segmented tokens were then used to automatically generate ctb (Chinese Treebank) alignment and were also tokenized for character alignment by inserting white spaces to separate characters.


The data profile broken down by character tokens, ctb tokens and segments appears below:























LanguageGenreFilesWordsCharTokensCTBTokensSegments
ChineseSMS/chat1359388,027582,043419,40659,564

Acknowledgement


This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.


Samples


Please view the following samples:



Updates


None at this time.


Portions © 2012-2015, 2018, 2019 Trustees of the University of Pennsylvania

Authors

  • Li, Xuansong ;
  • Grimes, Stephen ;
  • Strassel, Stephanie
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/zbdg-8t662019

HAVIC MED Progress Test -- Videos, Metadata and Annotation

Introduction


HAVIC MED Progress Test -- Videos, Metadata and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 3,650 hours of user-generated videos with annotation and metadata.


To advance multimodal event detection and related technologies, LDC developed, in collaboration with NIST (the National Institute of Standards and Technology), a large, heterogeneous, annotated multimodal corpus for HAVIC (the Heterogeneous Audio Visual Internet Collection) that was used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC MED Progress Test is a subset of that corpus, specifically, a collection of event and background videos for the HAVIC project originally released to support the 2012, 2013, 2014, and 2015 Multimedia Event Detection tasks.


Data


The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos) harvested by a large team of human annotators. Each event video was manually annotated with a set of judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.


All video files are in .mp4 format (h.264), with varying bit-rates and levels of audio fidelity and video resolution. Metadata and annotation for the videos are stored in a .tsv file.


Samples


Please view this video sample and annotation sample


Updates


None at this time.


Additional Licensing Instructions


This members-only corpus is available to current members. Contact [email protected] for information about becoming a member.


Portions © 2011-2016 YouTube, LLC, © 2011-2016, 2019 Trustees of the University of Pennsylvania

Authors

  • Morris, Amanda ;
  • Strassel, Stephanie ;
  • Li, Xuansong ;
  • Antonishek, Brian ;
  • Fiscus, Jonathan G.
0 Citations0 Mentions35% FAIR0.4 Dataset Index
10.35111/fnzz-kn072019

BOLT Egyptian-English Word Alignment -- Discussion Forum Training

Introduction


BOLT Egyptian-English Word Alignment -- Discussion Forum Training was developed by the Linguistic Data Consortium (LDC) and consists of 400,448 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.


The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.


Data


This release consists of Egyptian source discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes. The source data is released as BOLT Arabic Discussion Forums (LDC2018T10).


The BOLT word alignment task was built on treebank annotation. Specifically, Egyptian source tree tokens for word alignment were automatically extracted from tree files of BOLT Egyptian Arabic Treebank annotation on the source discussion forum data harvested by LDC. Human annotators then followed LDC guidelines to link words and phrases in Arabic to those in English.


The data profile broken down by character tokens, tree tokens and segments appears below:





















LanguageGenreFilesWordsTree-tokensSegments
Egyptian Arabicdiscussion forum724400,448593,72331,454

Acknowledgement


This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.


Samples


Please view the following samples:



Updates


None at this time.


Portions © 2012-2015, 2018, 2019 Trustees of the University of Pennsylvania

Authors

  • Li, Xuansong ;
  • Peterson, Katherine ;
  • Grimes, Stephen ;
  • Strassel, Stephanie
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/me9d-jr382019

HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation

Introduction


HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 53 hours of user-generated videos with annotation and metadata.


To advance multimodal event detection and related technologies, LDC developed, in collaboration with NIST (the National Institute of Standards and Technology), a large, heterogeneous, annotated multimodal corpus for HAVIC (the Heterogeneous Audio Visual Internet Collection) that was used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC MED Event E051-E060 is a subset of that corpus, specifically, a collection of event videos for the HAVIC Project originally released to support the 2016 Multimedia Event Detection task.


Data


The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos) harvested by a large team of human annotators. Each event video was manually annotated with a set of judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.


All video files are in .mp4 format (h.264), with varying bit-rates and levels of audio fidelity and video resolution. Metadata and annotation for the videos are stored in a .tsv file.


Samples


Please view this video sample and annotation sample.


Updates


None at this time.


Portions © 2011-2016 YouTube, LLC, © 2011-2018 Trustees of the University of Pennsylvania

Authors

  • Morris, Amanda ;
  • Strassel, Stephanie ;
  • Li, Xuansong ;
  • Antonishek, Brian ;
  • Fiscus, Jonathan G.
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/m8n7-xy132018

GALE English-Chinese Parallel Aligned Treebank -- Training

Introduction


GALE English-Chinese Parallel Aligned Treebank -- Training was developed by the Linguistic Data Consortium (LDC) and contains 196,123 tokens of word aligned English and Chinese parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.


Parallel aligned treebanks are treebanks annotated with morphological and syntactic structures aligned at the sentence level and the sub-sentence level. Such data sets are useful for natural language processing and related fields, including automatic word alignment system training and evaluation, transfer-rule extraction, word sense disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic studies. With respect to machine translation system development, parallel aligned treebanks may improve system performance with enhanced syntactic parsers, better rules and knowledge about language pairs and reduced word error rate.


The English source data was translated into Chinese. Chinese and English treebank annotations were performed independently. The parallel texts were then word aligned. The material in this release corresponds to portions of the treebanked data in OntoNotes 3.0 (LDC2009T24) and OntoNotes 4.0 (LDC2011T03).


Data


This release consists of English source broadcast programming (CNN, NBC/MSNBC) and web data collected by LDC in 2005 and 2006. The distribution by genre, words, character tokens, treebank tokens and segments appears below:





































GenreFilesWordsCharTokensCTBTokensSegments
bc660,006190,09262,4383,763
wb1570,687106,03169,3093,238
Total21130,748196,123131,7477,001

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.


The word alignment task consisted of the following components:



  • Identifying, aligning, and tagging eight different types of links

  • Identifying, attaching, and tagging local-level unmatched words

  • Identifying and tagging sentence/discourse-level unmatched words

  • Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link


This release contains nine types of files - English raw source files, Chinese raw translation files, Chinese character tokenized files, Chinese CTB tokenized files, English tokenized files, Chinese treebank files, English treebank files, character-based word alignment files, and CTB-based word alignment files.


Samples


Please view the following samples:



Sponsorship


This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.


Updates


04/12/2017 - The Chinese raw translation files for broadcast conversation were updated to the correct files. All downloads recieved after this date are fully up to date.


Portions © 2005 Cable News Network, LP, LLLP, © 2006 National Broadcasting Company, Inc., © 2005, 2006, 2011, 2017 Trustees of the University of Pennsylvania

Authors

  • Li, Xuansong ;
  • Grimes, Stephen ;
  • Strassel, Stephanie ;
  • Ma, Xiaoyi ;
  • Xue, Nianwen ;
  • Marcus, Mitch ;
  • Taylor, Ann
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/w1wt-fc492017

BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training

Introduction


BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training was developed by the Linguistic Data Consortium (LDC) and consists of 448,094 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations.


The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.


Data


This release consists of Chinese source discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes. The source data is released as BOLT Chinese Discussion Forums (LDC2016T05).


The BOLT word alignment task was built on treebank annotation. Specifically, LDC automatically extracted Chinese source tokens, including empty categories/traces, from word-segmented files provided by the BOLT Chinese Treebank annotation team at Brandeis University. The word-segmented tokens were then used to automatically generate ctb (Chinese Treebank) alignment and were also tokenized for character alignment by inserting white spaces to separate characters.


The data profile broken down by character tokens, ctb tokens and segments appears below:























LanguageGenreFilesWordsCharTokensCTBTokensSegments
Chineseforum570448,094672,140442,52020,819

Acknowledgement


This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.


Samples


Please view the following samples:



Updates


None at this time.


Portions © 2012-2016 Trustees of the University of Pennsylvania

Authors

  • Li, Xuansong ;
  • Peterson, Katherine ;
  • Grimes, Stephen ;
  • Strassel, Stephanie
1 Citation0 Mentions35% FAIR1.3 Dataset Index
10.35111/s5ae-pn382016

HAVIC Pilot Transcription

Introduction


HAVIC Pilot Transcription was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 72 hours of user-generated videos with transcripts based on the English speech audio extracted from the videos. This data set was created in collaboration with NIST (the National Institute of Standards and Technology) as part of the HAVIC (the Heterogeneous Audio Visual Internet Collection) project, the goal of which was to advance multimodal event detection and related technologies.


LDC has developed a large, heterogeneous, annotated multimodal corpus for HAVIC that has been used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC Pilot Transcription supported an experiment to produce a verbatim transcript (quick and rich transcription) based on audio extracted from user-generated videos. It contains the pilot transcripts for selected MED 2011 video files as well as the associated videos.


Data


NIST designated the videos to be transcribed. Annotators generated the transcripts using XTrans, which supports manual transcription across multiple channels, languages and platforms. HAVIC transcription guidelines are included in the documentation for this release.


Each file was transcribed by a single annotator with no corpus-wide second pass. File samples from each annotator were checked for various errors, including missing transcription, improper mark-up, poor segmentation and missing/added words.


All transcription files are in .tdf format, a plain-text, flat-table format with 13 tab-delimited fields. All video files are in .mp4 format (h264), with varying bit-rates and levels of audio fidelity and video resolution.


Samples


Please view these video and transcript samples.


Updates


None at this time.


Portions © 2011-2016 YouTube, LLC, © 2011-2016 Trustees of the University of Pennsylvania

Authors

  • Tracey, Jennifer ;
  • Strassel, Stephanie ;
  • Morris, Amanda ;
  • Li, Xuansong ;
  • Antonishek, Brian ;
  • Fiscus, Jonathan G.
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/cn82-n5032016

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4

Introduction


GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4 was developed by the Linguistic Data Consortium (LDC) and contains 243,038 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.


Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.


Other releases available in this series are:



  • GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web (LDC2012T16)

  • GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire (LDC2012T20)

  • GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web (LDC2012T24)

  • GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web (LDC2013T05)

  • GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 (LDC2013T23)

  • GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2 (LDC2014T25)

  • GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 (LDC2015T04)


Data


This release consists of Chinese source broadcast conversation (BC) and broadcast news (BN) programming collected by LDC in 2008 and 2009. The distribution by genre, words, character tokens and segments appears below:





































LanguageGenreFilesWordsCharTokensSegments
ChineseBC6967,782101,6742,276
ChineseBN2994,242141,3643,152
Total 98162,024243,0385,428

 


Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.


The Chinese word alignment tasks consisted of the following components:



  • Identifying, aligning, and tagging eight different types of links

  • Identifying, attaching, and tagging local-level unmatched words

  • Identifying and tagging sentence/discourse-level unmatched words

  • Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link


Samples


Please view the following sample.



Sponsorship


This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.


Updates


None at this time.


Portions © 2008-2009 China Central TV, © 2008 Hubei TV, © 2008, 2009, 2015 Trustees of the University of Pennsylvania

Authors

  • Li, Xuansong ;
  • Grimes, Stephen ;
  • Strassel, Stephanie
1 Citation0 Mentions35% FAIR1.2 Dataset Index
10.35111/jt1b-sv882015