Scholar Data

Machine Reading Phase 1 IC Training Data

Introduction

Machine Reading Phase 1 IC Training Data was developed by the Linguistic Data Consortium and contains 248 English source documents and 116 standoff annotation files used in the DARPA (Defense Advanced Research Projects Agency) Machine Reading program.

The Machine Reading (MR) program aimed to develop automated reading systems to bridge the gap between knowledge contained in natural language texts and knowledge accessible to formal reasoning systems. The reading systems designed by program participants were required to extract and reason about facts from text in multiple domains.

The data in this release constitutes the training data for the IC (Core Domain) task. The IC Use Cases tested the core domain by extracting information about about Entities (people, organizations, geopolitical entities or "GPEs") and their involvement in four types of Relations: Attack Relations (e.g. bombings), Biographical Relations (e.g. being a citizen of a country), Affiliation Relations (e.g. being a leader of an organization), and Family Relations (e.g. having a spouse) as described in newswire text. This information was then aligned with an IC Use Cases ontology that would allow automated reasoning about the extracted Entities and Relations.

Data

This release contains 248 source documents (108,960 words) from English newswire stories in English Gigaword Fourth Edition (LDC2009T13). Roughly half of those documents (116) were annotated for IC/Core Use Cases. Annotation was non-exhaustive, but an attempt was made to provide instances of all relations and their arguments where explicitly stated in a single sentence, as well as some non-explicit relations, which were marked with an "Inferred" tag by the annotator.

Annotations are in GUI XML (traditional annotation) and RDF XML (formal knowledge representation) formats. A second set of GUI XML is provided with additional, unofficial annotations. All source and annotation files are presented as UTF-8 encoded XML files with associated dtds, schemas or ontologies.

Acknowledgments

The Linguistic Data Consortium gratefully acknowledges the support of Defense Advanced Research Projects Agency (DARPA) Machine Reading Program under Air Force Research Laboratory (AFRL) prime contract no. FA8750-09 C-xxxx. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of the DARPA, AFRL, or the US government.

Samples

Please view the following samples:

Source

RDF XML

GUI XML

GUI XML Extended

Updates

None at this time.

Portions © 1994-1997, 2001-2006 Agence France Presse, © 2002 An Nahar, ©1995-1998, 2000-2001, 2005-2006 The Associated Press, © 1996-1998, 2004, 2006 Los Angeles Times-Washington Post News Service, Inc., © 1994-2002, 2004-2006 New York Times, © 1994 Reuters America, Inc., © 1995-2006 Xinhua News Agency, © 2009, 2020 Trustees of the University of Pennsylvania

Authors

Simpson, Heather ;
Strassel, Stephanie ;
Wright, Jonathan ;
Griffitt, Kira

0 Citations0 Mentions35% FAIR0.8 Dataset Index

10.35111/tj3x-ce20February 2020

Machine Reading Phase 1 NFL Scoring Training Data

Introduction

Machine Reading Phase 1 NFL Scoring Training Data was developed by the Linguistic Data Consortium (LDC) and contains 110 US NFL (National Football League) scoring source documents and 110 standoff annotation files used in the DARPA (Defense Advanced Research Projects Agency) Machine Reading program.

The Machine Reading program aimed to develop automated reading systems to bridge the gap between knowledge contained in natural language texts and knowledge accessible to formal reasoning systems. The reading systems designed by program participants were required to extract and reason about facts from text in multiple domains.

The data in this release constitutes the training data for the NFL Scoring Use Cases evaluation. The NFL Scoring Use Cases tested the sports domain by extracting information about scoring events and outcomes of US NFL games and by aligning that information with an NFL Scoring ontology.

Data

This release contains 110 source documents (70,233 words) from English newswire stories. The files were manually annotated for instances of NFL Scoring annotation categories defined with respect to the NFL Scoring ontology.

Annotations are in GUI XML (traditional annotation) and RDF XML (formal knowledge representation) formats. All source and annotation files are presented as UTF-8 encoded XML files with associated dtds.

Acknowledgments

Samples

Please view the following samples:

Source Sample

GUI XML Sample

RDF XML Sample

Updates

None at this time.

Portions © 1995-1996, 2002-2005 Agence France Presse, ©1998, 2000-2001 The Associated Press, © 1994, 1996, 1998, 2005 New York Times, © 2003, 2005, 2007, 2009, 2011, 2019 Trustees of the University of Pennsylvania

Authors

Simpson, Heather ;
Strassel, Stephanie ;
Wright, Jonathan ;
Griffitt, Kira

0 Citations0 Mentions35% FAIR0.9 Dataset Index

10.35111/8pye-2w87September 2019

TAC KBP Reference Knowledge Base

Introduction

TAC KBP Reference Knowledge Base was developed by the Linguistic Data Consortium (LDC) in support of the NIST-sponsored TAC-KBP evaluation series. It is a knowledge base built from English Wikipedia articles and their associated infoboxes and covers over 800,000 entities. LDC also released TAC KBP Spanish Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2012-2014 (LDC2016T26.)

TAC (Text Analysis Conference) is a series of workshops organized by NIST (the National Institute of Standards and Technology) to encourage research in natural language processing and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. TAC's KBP track (Knowledge Base Population) encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base.

Consult the LDC TAC-KBP project page for further information about LDC's resource development for the TAC-KBP program.

Data

The source data (Wikipedia infoboxes and articles) was taken from an October 2008 snapshot of Wikipedia.

TAC KBP Reference Knowledge Base contains a set of entities, each with a canonical name and title for the Wikipedia page, an entity type, an automatically parsed version of the data from the infobox in the entity's Wikipedia article, and a stripped version of the text of the Wiki article. Each entity is assigned one of four types: PER (person), ORG (organization), GPE (geo-political entity) and UKN (unknown).

All data files are presented as UTF-8 encoded XML.

Samples

Please view the following sample.

Updates

None at this time.

Authors

Simpson, Heather ;
Ellis, Joe ;
Parker, Robert ;
Strassel, Stephanie

0 Citations0 Mentions35% FAIR0.9 Dataset Index

10.35111/4yac-wb16August 2014

Automated Author Profile
Simpson, Heather

Simpson, Heather

Current S-Index

Average Dataset Index per Dataset

Total Datasets

Average FAIR Score

Total Citations

Total Mentions

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

Machine Reading Phase 1 IC Training Data

Introduction

Data

Acknowledgments

Samples

Updates

Machine Reading Phase 1 NFL Scoring Training Data

Introduction

Data

Acknowledgments

Samples

Updates

TAC KBP Reference Knowledge Base

Introduction

Data

Samples

Updates

Automated Author ProfileSimpson, Heather

Simpson, Heather

Current S-Index

Average Dataset Index per Dataset

Total Datasets

Average FAIR Score

Total Citations

Total Mentions

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

Machine Reading Phase 1 IC Training Data

Introduction

Data

Acknowledgments

Samples

Updates

Machine Reading Phase 1 NFL Scoring Training Data

Introduction

Data

Acknowledgments

Samples

Updates

TAC KBP Reference Knowledge Base

Introduction

Data

Samples

Updates

Automated Author Profile
Simpson, Heather