Automated Author ProfileLi, Xuansong
Li, Xuansong
Current S-Index
Sum of Dataset Indices for all datasets
Average Dataset Index per Dataset
Average Dataset Index per dataset
Total Datasets
Total datasets for this author
Average FAIR Score
Average FAIR Score per dataset
Total Citations
Total citations to the author's datasets
Total Mentions
Total mentions of the author's datasets
S-Index Interpretation
The S-Index (Sharing Index) is a comprehensive metric that represents the cumulative impact of all your datasets. It is calculated as the sum of Dataset Index scores across all your claimed datasets.
What it means:
- A higher S-index indicates greater overall impact of your datasets relative to typical datasets in their fields of research
- The S-Index grows as you add more datasets or as existing datasets gain more citations and mentions
- It provides a single number to track your research data impact over time
Current S-Index: 27.4 (sum of 27 datasets Dataset Index scores)
More information here.
S-Index Over Time
Cumulative Citations Over Time
Cumulative Mentions Over Time
Datasets
Introduction
BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training was developed by the Linguistic Data Consortium (LDC) and consists of 153,171 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.
The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.
Data
The source data in this release consists of transcripts of Egyptian Arabic conversational telephone speech (CTS) from LDC's CALLHOME and CALLFRIEND collections (LDC97S45, LDC97T19, LDC2002S37, LDC2002T38, LDC96S49) that were translated into English by professional translation agencies and annotated for the word alignment task.
The BOLT word alignment task was built on treebank annotation. Specifically, Egyptian Arabic source tree tokens were automatically extracted from tree files in LDC's BOLT Egyptian Arabic Treebank. Those tree files had been tagged for part-of-speech and syntactically annotated. That data was then aligned and annotated for the word alignment task.
The data profile broken down by character tokens, tree tokens and segments appears below:
| Language | Genre | Files | Words | Tree-tokens | Segments |
| Egyptian Arabic | CTS | 176 | 153,171 | 215,896 | 20,010 |
Acknowledgement
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Samples
Please view the following samples:
Updates
None at this time.
Portions © 1996, 1997, 2002, 2012-2015, 2020 Trustees of the University of Pennsylvania
Authors
- Li, Xuansong ;
- Grimes, Stephen ;
- Strassel, Stephanie
Introduction
BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training was developed by the Linguistic Data Consortium (LDC) and consists of 349,414 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.
The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.
Data
This release consists of Egyptian Arabic source text message and chat conversations collected using two methods: new collection via LDC's collection platform, and donation of SMS or chat archives from BOLT collection participants. The source data is released as BOLT Egyptian Arabic SMS/Chat and Transliteration (LDC2017T07).
The BOLT word alignment task was built on treebank annotation. Specifically, Egyptian Arabic source tree tokens were automatically extracted from tree files in LDC's BOLT Egyptian Arabic Treebank. Those tree files had been tagged for part-of-speech and syntactically annotated. That data was then aligned and annotated for the word alignment task.
The data profile broken down by character tokens, tree tokens and segments appears below:
| Language | Genre | Files | Words | Tree/POS-tokens | Segments |
| Egyptian Arabic | SMS/Chat | 1367 | 349,414 | 475,665 | 74,814 |
Acknowledgement
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Samples
Please view the following samples:
Updates
None at this time.
Portions © 2019 Trustees of the University of Pennsylvania
Authors
- Li, Xuansong ;
- Grimes, Stephen ;
- Strassel, Stephanie
Introduction
BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training was developed by the Linguistic Data Consortium (LDC) and consists of 388,027 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations.
The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.
Data
This release consists of Chinese source text message and chat conversations collected using two methods: new collection via LDC's collection platform and donation of SMS and chat archives from BOLT collection participants. The source data is released as BOLT Chinese SMS/Chat (LDC2018T15).
The BOLT word alignment task was built on treebank annotation. Specifically, LDC automatically extracted Chinese source tokens, including empty categories/traces, from word-segmented files provided by the BOLT Chinese Treebank annotation team at Brandeis University. The word-segmented tokens were then used to automatically generate ctb (Chinese Treebank) alignment and were also tokenized for character alignment by inserting white spaces to separate characters.
The data profile broken down by character tokens, ctb tokens and segments appears below:
| Language | Genre | Files | Words | CharTokens | CTBTokens | Segments |
| Chinese | SMS/chat | 1359 | 388,027 | 582,043 | 419,406 | 59,564 |
Acknowledgement
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Samples
Please view the following samples:
- English Tokenized
- CTB-Based Word Alignment
- Character-Based Word Alignment
- Chinese CTB-based Tokenized
- Chinese Character Tokenized
Updates
None at this time.
Portions © 2012-2015, 2018, 2019 Trustees of the University of Pennsylvania
Authors
- Li, Xuansong ;
- Grimes, Stephen ;
- Strassel, Stephanie
Introduction
HAVIC MED Progress Test -- Videos, Metadata and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 3,650 hours of user-generated videos with annotation and metadata.
To advance multimodal event detection and related technologies, LDC developed, in collaboration with NIST (the National Institute of Standards and Technology), a large, heterogeneous, annotated multimodal corpus for HAVIC (the Heterogeneous Audio Visual Internet Collection) that was used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC MED Progress Test is a subset of that corpus, specifically, a collection of event and background videos for the HAVIC project originally released to support the 2012, 2013, 2014, and 2015 Multimedia Event Detection tasks.
Data
The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos) harvested by a large team of human annotators. Each event video was manually annotated with a set of judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.
All video files are in .mp4 format (h.264), with varying bit-rates and levels of audio fidelity and video resolution. Metadata and annotation for the videos are stored in a .tsv file.
Samples
Please view this video sample and annotation sample.
Updates
None at this time.
Additional Licensing Instructions
This members-only corpus is available to current members. Contact [email protected] for information about becoming a member.
Portions © 2011-2016 YouTube, LLC, © 2011-2016, 2019 Trustees of the University of Pennsylvania
Authors
- Morris, Amanda ;
- Strassel, Stephanie ;
- Li, Xuansong ;
- Antonishek, Brian ;
- Fiscus, Jonathan G.
Introduction
BOLT Egyptian-English Word Alignment -- Discussion Forum Training was developed by the Linguistic Data Consortium (LDC) and consists of 400,448 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.
The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.
Data
This release consists of Egyptian source discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes. The source data is released as BOLT Arabic Discussion Forums (LDC2018T10).
The BOLT word alignment task was built on treebank annotation. Specifically, Egyptian source tree tokens for word alignment were automatically extracted from tree files of BOLT Egyptian Arabic Treebank annotation on the source discussion forum data harvested by LDC. Human annotators then followed LDC guidelines to link words and phrases in Arabic to those in English.
The data profile broken down by character tokens, tree tokens and segments appears below:
| Language | Genre | Files | Words | Tree-tokens | Segments |
| Egyptian Arabic | discussion forum | 724 | 400,448 | 593,723 | 31,454 |
Acknowledgement
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Samples
Please view the following samples:
Updates
None at this time.
Portions © 2012-2015, 2018, 2019 Trustees of the University of Pennsylvania
Authors
- Li, Xuansong ;
- Peterson, Katherine ;
- Grimes, Stephen ;
- Strassel, Stephanie
Introduction
HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 53 hours of user-generated videos with annotation and metadata.
To advance multimodal event detection and related technologies, LDC developed, in collaboration with NIST (the National Institute of Standards and Technology), a large, heterogeneous, annotated multimodal corpus for HAVIC (the Heterogeneous Audio Visual Internet Collection) that was used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC MED Event E051-E060 is a subset of that corpus, specifically, a collection of event videos for the HAVIC Project originally released to support the 2016 Multimedia Event Detection task.
Data
The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos) harvested by a large team of human annotators. Each event video was manually annotated with a set of judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.
All video files are in .mp4 format (h.264), with varying bit-rates and levels of audio fidelity and video resolution. Metadata and annotation for the videos are stored in a .tsv file.
Samples
Please view this video sample and annotation sample.
Updates
None at this time.
Portions © 2011-2016 YouTube, LLC, © 2011-2018 Trustees of the University of Pennsylvania
Authors
- Morris, Amanda ;
- Strassel, Stephanie ;
- Li, Xuansong ;
- Antonishek, Brian ;
- Fiscus, Jonathan G.
Introduction
GALE English-Chinese Parallel Aligned Treebank -- Training was developed by the Linguistic Data Consortium (LDC) and contains 196,123 tokens of word aligned English and Chinese parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.
Parallel aligned treebanks are treebanks annotated with morphological and syntactic structures aligned at the sentence level and the sub-sentence level. Such data sets are useful for natural language processing and related fields, including automatic word alignment system training and evaluation, transfer-rule extraction, word sense disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic studies. With respect to machine translation system development, parallel aligned treebanks may improve system performance with enhanced syntactic parsers, better rules and knowledge about language pairs and reduced word error rate.
The English source data was translated into Chinese. Chinese and English treebank annotations were performed independently. The parallel texts were then word aligned. The material in this release corresponds to portions of the treebanked data in OntoNotes 3.0 (LDC2009T24) and OntoNotes 4.0 (LDC2011T03).
Data
This release consists of English source broadcast programming (CNN, NBC/MSNBC) and web data collected by LDC in 2005 and 2006. The distribution by genre, words, character tokens, treebank tokens and segments appears below:
| Genre | Files | Words | CharTokens | CTBTokens | Segments |
| bc | 6 | 60,0061 | 90,092 | 62,438 | 3,763 |
| wb | 15 | 70,687 | 106,031 | 69,309 | 3,238 |
| Total | 21 | 130,748 | 196,123 | 131,747 | 7,001 |
Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.
The word alignment task consisted of the following components:
- Identifying, aligning, and tagging eight different types of links
- Identifying, attaching, and tagging local-level unmatched words
- Identifying and tagging sentence/discourse-level unmatched words
- Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link
This release contains nine types of files - English raw source files, Chinese raw translation files, Chinese character tokenized files, Chinese CTB tokenized files, English tokenized files, Chinese treebank files, English treebank files, character-based word alignment files, and CTB-based word alignment files.
Samples
Please view the following samples:
- English raw source
- Chinese raw translation
- Chinese character tokenized
- Chinese CTB tokenized
- English tokenized
- Chinese treebank
- English treebank
- Character-based word alignment
- CTB-based word alignment
Sponsorship
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Updates
04/12/2017 - The Chinese raw translation files for broadcast conversation were updated to the correct files. All downloads recieved after this date are fully up to date.
Portions © 2005 Cable News Network, LP, LLLP, © 2006 National Broadcasting Company, Inc., © 2005, 2006, 2011, 2017 Trustees of the University of Pennsylvania
Authors
- Li, Xuansong ;
- Grimes, Stephen ;
- Strassel, Stephanie ;
- Ma, Xiaoyi ;
- Xue, Nianwen ;
- Marcus, Mitch ;
- Taylor, Ann
Introduction
BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training was developed by the Linguistic Data Consortium (LDC) and consists of 448,094 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations.
The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.
Data
This release consists of Chinese source discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes. The source data is released as BOLT Chinese Discussion Forums (LDC2016T05).
The BOLT word alignment task was built on treebank annotation. Specifically, LDC automatically extracted Chinese source tokens, including empty categories/traces, from word-segmented files provided by the BOLT Chinese Treebank annotation team at Brandeis University. The word-segmented tokens were then used to automatically generate ctb (Chinese Treebank) alignment and were also tokenized for character alignment by inserting white spaces to separate characters.
The data profile broken down by character tokens, ctb tokens and segments appears below:
| Language | Genre | Files | Words | CharTokens | CTBTokens | Segments |
| Chinese | forum | 570 | 448,094 | 672,140 | 442,520 | 20,819 |
Acknowledgement
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Samples
Please view the following samples:
- Chinese Character Tokenized
- Chinese CTB-Based Tokenized
- English Tokenized
- Character-Based Word Alignment
- CTB-Based Word Alignment
Updates
None at this time.
Portions © 2012-2016 Trustees of the University of Pennsylvania
Authors
- Li, Xuansong ;
- Peterson, Katherine ;
- Grimes, Stephen ;
- Strassel, Stephanie
Introduction
HAVIC Pilot Transcription was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 72 hours of user-generated videos with transcripts based on the English speech audio extracted from the videos. This data set was created in collaboration with NIST (the National Institute of Standards and Technology) as part of the HAVIC (the Heterogeneous Audio Visual Internet Collection) project, the goal of which was to advance multimodal event detection and related technologies.
LDC has developed a large, heterogeneous, annotated multimodal corpus for HAVIC that has been used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC Pilot Transcription supported an experiment to produce a verbatim transcript (quick and rich transcription) based on audio extracted from user-generated videos. It contains the pilot transcripts for selected MED 2011 video files as well as the associated videos.
Data
NIST designated the videos to be transcribed. Annotators generated the transcripts using XTrans, which supports manual transcription across multiple channels, languages and platforms. HAVIC transcription guidelines are included in the documentation for this release.
Each file was transcribed by a single annotator with no corpus-wide second pass. File samples from each annotator were checked for various errors, including missing transcription, improper mark-up, poor segmentation and missing/added words.
All transcription files are in .tdf format, a plain-text, flat-table format with 13 tab-delimited fields. All video files are in .mp4 format (h264), with varying bit-rates and levels of audio fidelity and video resolution.
Samples
Please view these video and transcript samples.
Updates
None at this time.
Portions © 2011-2016 YouTube, LLC, © 2011-2016 Trustees of the University of Pennsylvania
Authors
- Tracey, Jennifer ;
- Strassel, Stephanie ;
- Morris, Amanda ;
- Li, Xuansong ;
- Antonishek, Brian ;
- Fiscus, Jonathan G.
Introduction
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4 was developed by the Linguistic Data Consortium (LDC) and contains 243,038 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.
Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.
Other releases available in this series are:
- GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web (LDC2012T16)
- GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire (LDC2012T20)
- GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web (LDC2012T24)
- GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web (LDC2013T05)
- GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 (LDC2013T23)
- GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2 (LDC2014T25)
- GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 (LDC2015T04)
Data
This release consists of Chinese source broadcast conversation (BC) and broadcast news (BN) programming collected by LDC in 2008 and 2009. The distribution by genre, words, character tokens and segments appears below:
| Language | Genre | Files | Words | CharTokens | Segments |
| Chinese | BC | 69 | 67,782 | 101,674 | 2,276 |
| Chinese | BN | 29 | 94,242 | 141,364 | 3,152 |
| Total | 98 | 162,024 | 243,038 | 5,428 |
Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.
The Chinese word alignment tasks consisted of the following components:
- Identifying, aligning, and tagging eight different types of links
- Identifying, attaching, and tagging local-level unmatched words
- Identifying and tagging sentence/discourse-level unmatched words
- Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link
Samples
Please view the following sample.
Sponsorship
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Updates
None at this time.
Portions © 2008-2009 China Central TV, © 2008 Hubei TV, © 2008, 2009, 2015 Trustees of the University of Pennsylvania
Authors
- Li, Xuansong ;
- Grimes, Stephen ;
- Strassel, Stephanie