Published on 10 November 2022
The English Headline Treebank corpus
View DatasetBenton, Adrian;Shi, Tianze;Irsoy, Ozan;Malioutov, Igor
Description
This repository contains the evaluation sets used in
A Benton, T Shi, O İrsoy, and I Malioutov."Weakly Supervised Headline Dependency Parsing". Findings of EMNLP. 2022. This dataset contains parse annotations for English news headlines and a script to produce conllu files joined with original headline text. Parse annotations are joined to the corresponding text by running: LDC_NYT_DIR="/PATH/TO/UNTARRED/LDC2008T19/" # path to untarred LDC2008T19 python build_eht.py --nyt_dir ${LDC_NYT_DIR} --num_proc 4 This will download the Google sentence compression (GSC) dataset, and build conllu files for GSC examples. If you have the New York Times Annotated Corpus (LDC2008T19) untarred locally, this will also join annotations to the NYT examples (location passed via --nyt_dir). Increase the argument to --num_procs to process more shards from the NYT corpus in parallel and reduce build time. The above was tested with python 3.9.7. The EHT evaluation sets, with gold-annotated POS tags and dependency relations, are built as EHT/gsc.test.conllu and EHT/nyt.test.conllu Silver, projected, trees which we used to train and validate out models are built under GSC_projected. These are not gold parse trees (projected predictions from the article lead sentence), and are shared purely for reproducibility sake.
Citations (0)
No citations found
It looks like this dataset has no citations.
Mentions (0)
No mentions found
It looks like this dataset has not been mentioned in any sources.
Metrics
Dataset Index
2.1
FAIR Score
85%
Citations
0
Mentions
0
Metrics Over Time
Publication Details
Publisher
Zenodo
Assigned Domain
Topic Name
Natural Language Processing Techniques
Subfield
Artificial Intelligence
Field
Computer Science
Domain
Physical Sciences
Keywords
HeadlineNPLSyntaxNatural Language ProcessingUDDependencyParse
Normalization Factors
FT
13.46
CTw
1.00
MTw
1.00