Automated Author Profile

Badia, Toni

Current S-Index

3.9

Sum of Dataset Indices for all datasets

Average Dataset Index per Dataset

1.0

Average Dataset Index per dataset

Total Datasets

4

Total datasets for this author

Average FAIR Score

34.6%

Average FAIR Score per dataset

Total Citations

1

Total citations to the author's datasets

Total Mentions

0

Total mentions of the author's datasets

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

NewSoMe Corpus of Opinion in Blogs

Introduction


NewSoMe Corpus of Opinion in Blogs was compiled at Barcelona Media and consists of English and Spanish blogs annotated for opinions. It is part of the NewSoMe (News and Social Media) set of corpora presenting opinion annotations across several genres and covering multiple languages. NewSoMe is the result of an effort to build a unifying annotation framework for analyzing opinion in different genres, ranging from controlled text, such as news reports, to diverse types of user-generated content that includes blogs, product reviews and microblogs.


LDC has also released NewSoMe Corpus of Opinion in News Reports (LDC2015T17).


Data


The source data in this corpus was obtained by means of the Google Blog Search API. Spanish blogs were taken from wordpress.com and blogspot.com blogs. The English data was extracted from those same two domains and from asiawrites.org.


This release consists of 108 English documents and 191 Spanish documents. The annotation was carried out manually through the crowdsourcing platform CrowdFlower with seven annotations per layer that were aggregated for this data set. The layers annotated were topic, segment, cue, subjectivity, polarity and intensity.


Data is presented as UTF-8 either as plain text or in CSV files.


Samples


Please view thiese samples:



Updates


None at this time.


Portions © 2016 Fundacio Barcelona Media, © 2016 Trustees of the University of Pennsylvania

Authors

  • Sauri, Roser ;
  • Domingo, Judith ;
  • Badia, Toni
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/60j1-yd622016

NewSoMe Corpus of Opinion in News Reports

Introduction


NewSoMe Corpus of Opinion in News Reports was compiled at Barcelona Media and consists of Spanish, Catalan and Portuguese news reports annotated for opinions. It is part of the NewSoMe (News and Social Media) set of corpora presenting opinion annotations across several genres and covering multiple languages. NewSoMe is the result of an effort to build a unifying annotation framework for analyzing opinion in different genres, ranging from controlled text, such as news reports, to diverse types of user-generated content that includes blogs, product reviews and microblogs.


Data


The source data in this release was obtained from various newspaper websites and consists of approximately 200 documents in each of Spanish, Catalan and Portuguese. The annotation was carried out manually through the crowdsourcing platform CrowdFlower with seven annotations per layer that were aggregated for this data set. The layers annotated were topic, segment, cue, subjectivity, polarity and intensity.


Data is presented as UTF-8 either as plain text or in CSV files.


Samples


Please view the following samples.



Updates


None at this time.


Portions © 2015 Fundacío Barcelona Media, © 2015 Trustees of the University of Pennsylvania

Authors

  • Sauri, Roser ;
  • Domingo, Judith ;
  • Badia, Toni
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/y013-n2552015

Spanish TimeBank 1.0

Introduction

Spanish TimeBank 1.0 was developed by researchers at Barcelona Media and consists of Spanish texts in the AnCora corpus annotated with temporal and event information according to the TimeML specification language.

TimeML (Pusteyovsky, et al., 2005) is a schema for annotating eventualities and time expressions in natural language as well as the temporal relations among them, thus facilitating the task of extraction, representation and exchange of temporal information. Spanish Timebank 1.0 is annotated in three levels, marking events, time expressions and event metadata. The TimeML annotation scheme was tailored for the specifics of the Spanish language. Temporal relations in Spanish present distinctions of verbal mood (e.g., indicative, subjunctive, conditional, etc.) and grammatical aspect (e.g., imperfective) which are absent in English. Spanish TimeBank 1.0 joins the family of TimeBank annotated corpora which includes languages such as English, Italian, French, Korean and Chinese. Through their common layer of annotation, these corpora provide resources useful for multilingual temporal extraction and processing, such as multilingual text entailment, opinion mining or question answering. Spanish Timebank 1.0 is the Spanish language complement to Catalan Timebank 1.0 LDC2012T10.

LDC has released other corpora incorporating TimeBank annotation: TimeBank 1.2 LDC2006T08, FactBank 1.0 LDC2009T23 and ModeS TimeBank 1.0 LDC2012T01.

Data

Spanish TimeBank 1.0 contains stand-off annotations for 210 documents with over 75,800 tokens (including punctuation marks) and 68,000 tokens (excluding punctuation). The source documents are news stories and fiction from the AnCora corpus.

The AnCora corpus is the largest multilayer annotated corpus of Spanish and Catalan. AnCora contains 400,000 words in Spanish and 275,000 words in Catalan. The AnCora documents are annotated on many linguistic levels including stucture, syntax, dependencies, semantics and pragmatics. That information is not included in this release, but it can be mapped to the present annotations. The data contained in the AnCora corpus has been used in several international natural language processing evaluations such as CoNLL-2006, CoNLL-2007 and SemEval-2007. The corpus is freely available from the Centre de Llenguatge i Computació (CLiC).

Samples


Portions © 2012 Roser Saurí, Toni Badia, © 2012 Trustees of the University of Pennsylvania

Authors

  • Sauri, Roser ;
  • Badia, Toni
0 Citations0 Mentions35% FAIR0.9 Dataset Index
10.35111/6sfh-f7622012

Catalan TimeBank 1.0

Catalan TimeBank 1.0 was developed by researchers at Barcelona Media and consists of Catalan texts in the AnCora corpus annotated with temporal and event information according to the TimeML specification language.

TimeML (Pusteyovsky, et al., 2005) is a schema for annotationg eventualities and time expressions in natural language as well as the temporal relations among them, thus facilitating the task of extraction, representation and exchange of temporal information. Catalan Timebank 1.0 is annotated in three levels, marking events, time expressions and event metadata. The TimeML annotation scheme was tailored for the specifics of the Catalan language. Temporal relations in Catalan present distinctions of verbal mood (e.g., indicative, subjunctive, conditional, etc.) and grammatical aspect (e.g., imperfective) which are absent in English. Catalan TimeBank 1.0 joins the family of TimeBank annotated corpora which includes languages such as English, Spanish, Italian, French, Korean and Chinese. Through their common layer of annotation, these corpora provide resoures useful for multilingual temporal extraction and processing, such as multilingual text entailment, opinion mining or question answering.

LDC has released the following corpora incorporating TimeBank annotation: TimeBank 1.2 LDC2006T08, FactBank 1.0 LDC2009T23 and ModeS TimeBank 1.0 LDC2012T01.

Data

Catalan TimeBank 1.0 contains stand-off annotations for 210 documents with over 75,800 tokens (including punctuation marks) and 68,000 tokens (excluding punctuation). The source documents are from the EFE news agency, the ACN Catalan news agency2 and the Catalan version of the El Períodico newspaper, and span the period from January to December 2000.

The AnCora corpus is the largest multilayer annotated corpus of Spanish and Catalan. AnCora contains 400,000 words in Spanish and 275,000 words in Catalan. The AnCora documents are annotated on many linguistic levels including stucture, syntax, dependencies, semantics and pragmatics.That information is not included in this release, but it can be mapped to the present annotations. The data contained in the AnCora corpus has been used in several international natural language processing evaluations such as CoNLL-2006, CoNLL-2007 and SemEval-2007. The corpus is freely available from the Centre de Llenguatge i Computació (CLiC).

Samples

(Click to view full sized image.)

Updates

Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2012T10.


Portions © 2012 Roser Saurí, Toni Badia, © 2012 Trustees of the University of Pennsylvania

Authors

  • Sauri, Roser ;
  • Badia, Toni
1 Citation0 Mentions35% FAIR1.4 Dataset Index
10.35111/a183-hk462012