Scholar Data

1993-2007 United Nations Parallel Text

Introduction

1993-2007 United Nations Parallel Text was developed by Google Research. It consists of United Nations (UN) parliamentary documents from 1993 through 2007 in the official languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish. There are 673,670 raw text documents and 520,283 word alignment documents.

UN parliamentary documents are available from the UN Official Document System (UN ODS) at http://ods.un.org/. UN ODS, in its main UNDOC database, contains the full text of all types of UN parliamentary documents. It has complete coverage datng from 1993 and variable coverage before that. Documents exist in one or more of the official languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish. UN ODS also contains a large number of German documents, marked with the language other, but these are not included in this dataset.

For more information, see the UN ODS documentation at http://documents.un.org/help_E.htm.

For more details of the UN bibliographic systems, see http://www.un.org/depts/dhl/unbisref_manual/.

LDC has released parallel UN parliamentary documents in English, French and Spanish spanning the period 1988-1993, UN Parallel Text (Complete) (LDC94T4A).

Data

The data is presented as raw text and word-aligned text. The raw text is very close to what was extracted from the original word processing documents in UN ODS (e.g., Word, WordPerfect, PDF), converted to UTF-8 encoding.

The word-aligned text was normalized, tokenized, aligned at the sentence-level, further broken into sub-sentential chunk-pairs, and then aligned at the word. The sentence, chunk, and word alignment operations were performed separately for each individual language pair.

The files are presented in tar files and compressed using the bzip2 compression utility. The bzip2 utility is standard in most Linux releases. For Windows users, there are a variety of decompression software options. 7-Zip will decompress tar and bzip2 formats.

Note that in the data/aligned folder, the en-zh-1993.tar.bz2 and en-zh-1994.tar.bz2 archives decompress into empty folders. This is intentional as there is no Chinese aligned data for those two years.

Samples

Please view this raw English sample, raw French sample, aligned English-French sample.

Updates

None at this time.

Authors

Franz, Alex ;
Kumar, Shankar ;
Brants, Thorsten

0 Citations0 Mentions35% FAIR0.9 Dataset Index

10.35111/2ntv-xb562013

Web 1T 5-gram, 10 European Languages Version 1

Introduction

Web 1T 5-gram, 10 European Languages Version 1 was created by Google, Inc. It consists of word n-grams and their observed frequency counts for ten European languages: Czech, Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish and Swedish. The length of the n-grams ranges from unigrams (single words) to five-grams. The n-gram counts were generated from approximately one hundred billion word tokens of text for each language, or approximately one trillion total tokens.

The n-grams were extracted from publicly-accessible web pages from October 2008 to December 2008. This data set contains only n-grams that appeared at least 40 times in the processed sentences. Less frequent n-grams were discarded. While the aim was to identify and collect pages from the specific target languages only, it is likely that some text from other languages may be in the final data. This dataset will be useful for statistical language modeling, including machine translation, speech recognition and other uses.

Data

The input encoding of documents was automatically detected, and all text was converted to UTF8.

The following table contains statistics for the entire release.

File sizes (entire corpus): approximately 27.9 GB compressed (bzip2) text files

Total number of tokens:	1,306,807,412,486
Total number of sentences:	150,727,365,731
Total number of unigrams:	95,998,281
Total number of bigrams:	646,439,858
Total number of trigrams:	1,312,972,925
Total number of fourgrams:	1,396,154,236
Total number of fivegrams:	1,149,361,413
Total number of n-grams:	4,600,926,713

Samples

For an example of the data in this corpus please examine this sample file.

Authors

Brants, Thorsten ;
Franz, Alex

6 Citations0 Mentions35% FAIR4.4 Dataset Index

10.35111/mesn-fv792009

Web 1T 5-gram Version 1

Introduction

Web 1T 5-gram Version 1 was contributed by Google Inc. and contains English word n-grams and their observed frequency counts for approximately 1 trillion tokens. The length of the n-grams ranges from unigrams (single words) to five-grams. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.

Data

The n-gram counts were generated from text taken from publicly accessible Web pages.

The input encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following:

Hyphenated word are usually separated, and hyphenated numbers usually form one token.

Sequences of numbers separated by slashes (e.g. in dates) form one token.

Sequences that look like urls or email addresses form one token.

The files total 24 GB compressed (gzip'ed) text files containing the following:

Tokens	1,024,908,267,229
Sentences	95,119,665,584
Unigrams	13,588,391
Bigrams	314,843,401
Trigrams	977,069,902
Fourgrams	1,313,818,354
Fivegrams	1,176,470,663

Samples

For an example of the 3-gram data in this corpus, please review this text sample (TXT).

For an example of the 4-gram data in this corpus, please review this text sample (TXT).

Updates

None at this time.

Authors

Brants, Thorsten ;
Franz, Alex

109 Citations0 Mentions35% FAIR63.5 Dataset Index

10.35111/cqpa-a4982006

Automated Author Profile
Brants, Thorsten

Brants, Thorsten

Current S-Index

Average Dataset Index per Dataset

Total Datasets

Average FAIR Score

Total Citations

Total Mentions

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

1993-2007 United Nations Parallel Text

Introduction

Data

Samples

Updates

Web 1T 5-gram, 10 European Languages Version 1

Introduction

Data

Samples

Web 1T 5-gram Version 1

Introduction

Data

Samples

Updates

Automated Author ProfileBrants, Thorsten

Brants, Thorsten

Current S-Index

Average Dataset Index per Dataset

Total Datasets

Average FAIR Score

Total Citations

Total Mentions

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

1993-2007 United Nations Parallel Text

Introduction

Data

Samples

Updates

Web 1T 5-gram, 10 European Languages Version 1

Introduction

Data

Samples

Web 1T 5-gram Version 1

Introduction

Data

Samples

Updates

Automated Author Profile
Brants, Thorsten