Automated Author ProfileGormley, Matthew R.
Gormley, Matthew R.
Current S-Index
Sum of Dataset Indices for all datasets
Average Dataset Index per Dataset
Average Dataset Index per dataset
Total Datasets
Total datasets for this author
Average FAIR Score
Average FAIR Score per dataset
Total Citations
Total citations to the author's datasets
Total Mentions
Total mentions of the author's datasets
S-Index Interpretation
The S-Index (Sharing Index) is a comprehensive metric that represents the cumulative impact of all your datasets. It is calculated as the sum of Dataset Index scores across all your claimed datasets.
What it means:
- A higher S-index indicates greater overall impact of your datasets relative to typical datasets in their fields of research
- The S-Index grows as you add more datasets or as existing datasets gain more citations and mentions
- It provides a single number to track your research data impact over time
Current S-Index: 2.8 (sum of 2 datasets Dataset Index scores)
More information here.
S-Index Over Time
Cumulative Citations Over Time
Cumulative Mentions Over Time
Datasets
Introduction
Concretely Annotated English Gigaword was developed by Johns Hopkins University's Human Language Technology Center of Excellence (JHU). It adds multiple kinds and instances of automatically-generated syntactic, semantic and coreference annotations to English Gigaword Fifth Edition (LDC2011T07).
Concrete is a schema for representing structured, hierarchical and overlapping linguistic annotations. This release provides multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization.
The Linguistic Data Consortium (LDC) has also released Annotated English Gigaword (LDC2012T21), earlier work by JHU researchers to create a standardized corpus for knowledge extraction and distributional semantics by using then-state of the art tools to add automatically-generated syntactic and discourse structure annotation to English Gigaword Fifth Edition.
Data
Concretely Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition which consists of newswire stories from seven sources collected by LDC between 1994-2010.
The following layers of annotation were added under the Concrete schema:
- Segmented sentences and Penn Treebank-style tokenized words
- Treebank-style constituent parse trees
- Four different syntactic dependency trees
- Named entities
- Part of speech tags
- Lemmas
- In-document entity coreference chains
- Three different frame semantic parses
The data is stored in a binary form called Concrete, which is based upon Apache Thrift. Concrete can be read and written in many common programming languages, like Java, Python, Javascript and C++. Concrete also has a number of utilities to easily access and view the data in human-readable forms.
Samples
Please view the following samples:
Reference
Users of this corpus must cite the following paper:
Francis Ferraro, Max Thomas, Matthew Gormley, Travis Wolfe, Craig Harman, and Benjamin Van Durme. "Concretely Annotated Corpora." In The Proceedings of the NIPS Workshop on Automated Knowledge Base Construction (AKBC). NIPS Workshop 2014.
Additional Licensing Instructions
Any organization that licensed English Gigaword Fifth Edition (LDC2011T07) or Annotated English Gigaword (LDC2012T21) may request a copy of Concretely Annotated English Gigaword (LDC2018T20) for a $250 media fee. Contact [email protected] for licensing.
Portions © 1994-2010 Agence France Presse, © 1994-2010 The Associated Press, © 1997-2010 Central News Agency (Taiwan), © 1994-1998, 2003-2009 Los Angeles Times-Washington Post News Service, Inc., © 1994-2010 New York Times, © 2010 The Washington Post News Service with Bloomberg News, © 1995-2010 Xinhua News Agency, © 2003, 2005, 2007, 2009, 2011, 2018 Trustees of the University of Pennsylvania
Authors
- Ferraro, Francis ;
- Thomas, Max ;
- Gormley, Matthew R. ;
- Wolfe, Travis ;
- Harman, Craig ;
- Van Durme, Benjamin
Introduction
Annotated English Gigaword was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds automatically-generated syntactic and discourse structure annotation to English Gigaword Fifth Edition (LDC2011T07) and also contains an API and tools for reading the dataset's XML files. The goal of the annotation is to provide a standardized corpus for knowledge extraction and distributional semantics which enables broader involvement in large-scale knowledge-acquisition efforts by researchers.
Data
Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition from seven news sources:
- Agence France-Presse, English Service (afp_eng)
- Associated Press Worldstream, English Service (apw_eng)
- Central News Agency of Taiwan, English Service (cna_eng)
- Los Angeles Times/Washington Post Newswire Service (ltw_eng)
- Washington Post/Bloomberg Newswire Service (wpb_eng)
- New York Times Newswire Service (nyt_eng)
- Xinhua News Agency, English Service (xin_eng)
The following layers of annotation were added:
- Tokenized and segmented sentences
- Treebank-style constituent parse trees
- Syntactic dependency trees
- Named entities
- In-document coreference chains
The annotation was performed in a three-step process: (1) the data was preprocessed and sentences selected for annotation (sentences with more than 100 tokens were excluded) (2) syntactic parses were derived and (3) the parsed output was post-processed to derive syntactic dependencies, named entities and coreference chains. Over 183 million sentences were parsed.
The data is stored in a form similar to the gigaword SGML format with XML annotations containing the additional markup. The included API provides object representations for the contents of the XML files.
Samples
Please the link for a sample.
Additional Licensing Information
Any 2011 member organization that licensed English Gigaword Fifth Edition (LDC2011T07) may request a no-cost copy of Annotated English Gigaword. Any non-member organization that licensed English Gigaword Fifth Edition may request a copy of Annotated English Gigaword for a $250 media fee. Please contact [email protected] for licensing or with any additional questions.
Updates
None at this time.
Portions © 1994-2010 Agence France Presse, © 1994-2010 The Associated Press, © 1997-2010 Central News Agency (Taiwan), © 1994-1998, 2003-2009 Los Angeles Times-Washington Post News Service, Inc., © 1994-2010 New York Times, © 2010 The Washington Post News Service with Bloomberg News, © 1995-2010 Xinhua News Agency, © 2012 Matthew R. Gormley, © 2003, 2005, 2007, 2009, 2011, 2012 Trustees of the University of Pennsylvania
Authors
- Napoles, Courtney ;
- Gormley, Matthew R. ;
- Van Durme, Benjamin