Automated Author ProfileBadia, Toni
Badia, Toni
Current S-Index
Sum of Dataset Indices for all datasets
Average Dataset Index per Dataset
Average Dataset Index per dataset
Total Datasets
Total datasets for this author
Average FAIR Score
Average FAIR Score per dataset
Total Citations
Total citations to the author's datasets
Total Mentions
Total mentions of the author's datasets
S-Index Interpretation
The S-Index (Sharing Index) is a comprehensive metric that represents the cumulative impact of all your datasets. It is calculated as the sum of Dataset Index scores across all your claimed datasets.
What it means:
- A higher S-index indicates greater overall impact of your datasets relative to typical datasets in their fields of research
- The S-Index grows as you add more datasets or as existing datasets gain more citations and mentions
- It provides a single number to track your research data impact over time
Current S-Index: 3.9 (sum of 4 datasets Dataset Index scores)
More information here.
S-Index Over Time
Cumulative Citations Over Time
Cumulative Mentions Over Time
Datasets
Introduction
NewSoMe Corpus of Opinion in Blogs was compiled at Barcelona Media and consists of English and Spanish blogs annotated for opinions. It is part of the NewSoMe (News and Social Media) set of corpora presenting opinion annotations across several genres and covering multiple languages. NewSoMe is the result of an effort to build a unifying annotation framework for analyzing opinion in different genres, ranging from controlled text, such as news reports, to diverse types of user-generated content that includes blogs, product reviews and microblogs.
LDC has also released NewSoMe Corpus of Opinion in News Reports (LDC2015T17).
Data
The source data in this corpus was obtained by means of the Google Blog Search API. Spanish blogs were taken from wordpress.com and blogspot.com blogs. The English data was extracted from those same two domains and from asiawrites.org.
This release consists of 108 English documents and 191 Spanish documents. The annotation was carried out manually through the crowdsourcing platform CrowdFlower with seven annotations per layer that were aggregated for this data set. The layers annotated were topic, segment, cue, subjectivity, polarity and intensity.
Data is presented as UTF-8 either as plain text or in CSV files.
Samples
Please view thiese samples:
Updates
None at this time.
Portions © 2016 Fundacio Barcelona Media, © 2016 Trustees of the University of Pennsylvania
Authors
- Sauri, Roser ;
- Domingo, Judith ;
- Badia, Toni
Introduction
NewSoMe Corpus of Opinion in News Reports was compiled at Barcelona Media and consists of Spanish, Catalan and Portuguese news reports annotated for opinions. It is part of the NewSoMe (News and Social Media) set of corpora presenting opinion annotations across several genres and covering multiple languages. NewSoMe is the result of an effort to build a unifying annotation framework for analyzing opinion in different genres, ranging from controlled text, such as news reports, to diverse types of user-generated content that includes blogs, product reviews and microblogs.
Data
The source data in this release was obtained from various newspaper websites and consists of approximately 200 documents in each of Spanish, Catalan and Portuguese. The annotation was carried out manually through the crowdsourcing platform CrowdFlower with seven annotations per layer that were aggregated for this data set. The layers annotated were topic, segment, cue, subjectivity, polarity and intensity.
Data is presented as UTF-8 either as plain text or in CSV files.
Samples
Please view the following samples.
Updates
None at this time.
Portions © 2015 Fundacío Barcelona Media, © 2015 Trustees of the University of Pennsylvania
Authors
- Sauri, Roser ;
- Domingo, Judith ;
- Badia, Toni
Introduction
Spanish TimeBank 1.0 was developed by researchers at Barcelona Media and consists of Spanish texts in the AnCora corpus annotated with temporal and event information according to the TimeML specification language.
TimeML (Pusteyovsky, et al., 2005) is a schema for annotating eventualities and time expressions in natural language as well as the temporal relations among them, thus facilitating the task of extraction, representation and exchange of temporal information. Spanish Timebank 1.0 is annotated in three levels, marking events, time expressions and event metadata. The TimeML annotation scheme was tailored for the specifics of the Spanish language. Temporal relations in Spanish present distinctions of verbal mood (e.g., indicative, subjunctive, conditional, etc.) and grammatical aspect (e.g., imperfective) which are absent in English. Spanish TimeBank 1.0 joins the family of TimeBank annotated corpora which includes languages such as English, Italian, French, Korean and Chinese. Through their common layer of annotation, these corpora provide resources useful for multilingual temporal extraction and processing, such as multilingual text entailment, opinion mining or question answering. Spanish Timebank 1.0 is the Spanish language complement to Catalan Timebank 1.0 LDC2012T10.
LDC has released other corpora incorporating TimeBank annotation: TimeBank 1.2 LDC2006T08, FactBank 1.0 LDC2009T23 and ModeS TimeBank 1.0 LDC2012T01.
Data
Spanish TimeBank 1.0 contains stand-off annotations for 210 documents with over 75,800 tokens (including punctuation marks) and 68,000 tokens (excluding punctuation). The source documents are news stories and fiction from the AnCora corpus.
The AnCora corpus is the largest multilayer annotated corpus of Spanish and Catalan. AnCora contains 400,000 words in Spanish and 275,000 words in Catalan. The AnCora documents are annotated on many linguistic levels including stucture, syntax, dependencies, semantics and pragmatics. That information is not included in this release, but it can be mapped to the present annotations. The data contained in the AnCora corpus has been used in several international natural language processing evaluations such as CoNLL-2006, CoNLL-2007 and SemEval-2007. The corpus is freely available from the Centre de Llenguatge i Computació (CLiC).
Samples
Portions © 2012 Roser Saurí, Toni Badia, © 2012 Trustees of the University of Pennsylvania
Authors
- Sauri, Roser ;
- Badia, Toni
Catalan TimeBank 1.0 was developed by researchers at Barcelona Media and consists of Catalan texts in the AnCora corpus annotated with temporal and event information according to the TimeML specification language.
TimeML (Pusteyovsky, et al., 2005) is a schema for annotationg eventualities and time expressions in natural language as well as the temporal relations among them, thus facilitating the task of extraction, representation and exchange of temporal information. Catalan Timebank 1.0 is annotated in three levels, marking events, time expressions and event metadata. The TimeML annotation scheme was tailored for the specifics of the Catalan language. Temporal relations in Catalan present distinctions of verbal mood (e.g., indicative, subjunctive, conditional, etc.) and grammatical aspect (e.g., imperfective) which are absent in English. Catalan TimeBank 1.0 joins the family of TimeBank annotated corpora which includes languages such as English, Spanish, Italian, French, Korean and Chinese. Through their common layer of annotation, these corpora provide resoures useful for multilingual temporal extraction and processing, such as multilingual text entailment, opinion mining or question answering.
LDC has released the following corpora incorporating TimeBank annotation: TimeBank 1.2 LDC2006T08, FactBank 1.0 LDC2009T23 and ModeS TimeBank 1.0 LDC2012T01.
Data
Catalan TimeBank 1.0 contains stand-off annotations for 210 documents with over 75,800 tokens (including punctuation marks) and 68,000 tokens (excluding punctuation). The source documents are from the EFE news agency, the ACN Catalan news agency2 and the Catalan version of the El Períodico newspaper, and span the period from January to December 2000.
The AnCora corpus is the largest multilayer annotated corpus of Spanish and Catalan. AnCora contains 400,000 words in Spanish and 275,000 words in Catalan. The AnCora documents are annotated on many linguistic levels including stucture, syntax, dependencies, semantics and pragmatics.That information is not included in this release, but it can be mapped to the present annotations. The data contained in the AnCora corpus has been used in several international natural language processing evaluations such as CoNLL-2006, CoNLL-2007 and SemEval-2007. The corpus is freely available from the Centre de Llenguatge i Computació (CLiC).
Samples
(Click to view full sized image.)
Updates
Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2012T10.
Portions © 2012 Roser Saurí, Toni Badia, © 2012 Trustees of the University of Pennsylvania
Authors
- Sauri, Roser ;
- Badia, Toni