The CLEF-IP 2011 collection is composed of multilingual patent documents, including extracts of the MAREC dataset by IRF. It contains patents, physically stored as a collection of XML files en coding patent documents that have partial translations of the patent text between German, English and French (the EPO’s official languages).
This collection contains approximately 3.5 million XML documents, referring to 1.5 million patents, and most of patent documents were published by the European Patent Office (EPO), but that also include certain patents published by the World Intellectual Patent Organization (WIPO).
I have extracted multilingual parallels segments (English-German, English-French, and French-German) from the CLEF-IP 2011 collection. I made the corpus available to the research community.
--------------------------------------------------------------
Release v1
On 11 Oct 2013, i released the first version of the corpus. The first version patent parallel corpus consists of 3 languages, 6-language pairs, and TMX format available.
Table shows the number of segments and words that are extracted from title and claims fields on the source and the target after segment aligning.
This collection contains approximately 3.5 million XML documents, referring to 1.5 million patents, and most of patent documents were published by the European Patent Office (EPO), but that also include certain patents published by the World Intellectual Patent Organization (WIPO).
I have extracted multilingual parallels segments (English-German, English-French, and French-German) from the CLEF-IP 2011 collection. I made the corpus available to the research community.
Licensing
All CLEF-IP collections, as extracts of MAREC, are available under the Creative Commons License (see the paragraph below) and are now freely available to download. Please cite (WANG et al., 2014), if you use the corpus in your work.
--------------------------------------------------------------
Release v1
On 11 Oct 2013, i released the first version of the corpus. The first version patent parallel corpus consists of 3 languages, 6-language pairs, and TMX format available.
Table shows the number of segments and words that are extracted from title and claims fields on the source and the target after segment aligning.
Language pairs | Title | Claims | |||
Segments | Words | Segments | Words | ||
de-en | de | 311,298 | 2,038,785 | 1,696,498 | 62 M |
en | 2,582,703 | 71 M | |||
de-fr | de | 311,184 | 2,036,112 | 1,661,419 | 79 M |
fr | 2,482,257 | 86 M | |||
en-de | en | 884,759 | 6,661,481 | 5,218,024 | 332 M |
de | 5,508,289 | 296 M | |||
en-fr | en | 884,727 | 6,661,322 | 5,373,452 | 330 M |
fr | 8,538,012 | 380 M | |||
fr-de | fr | 106,211 | 963,508 | 572,356 | 36 M |
de | 1,204,439 | 37 M | |||
fr-en | fr | 106,246 | 1,285,467 | 586,498 | 38 M |
en | 1,048,374 | 37 M |
Table 1: Number of extracted segments as source and target after segment aligning in the <title> and <claims> fields
--------------------------------------------------------------
Download
Name | Language pair | format |
CLEF-Title | 6-language pairs | tmx |
CLEF-Abstract | 4-language pairs | tmx txt |
CLEF-Claim | de-en | tmx txt |
CLEF-Claim | de-fr | tmx txt |
CLEF-Claim | en-de | tmx txt |
CLEF-Claim | en-fr | tmx txt |
CLEF-Claim | fr-de | tmx txt |
CLEF-Claim | fr-en | tmx txt |