The CLEF-IP 2011 collection is composed of multilingual patent documents, including extracts of the MAREC dataset by IRF. It contains patents, physically stored as a collection of XML files en coding patent documents that have partial translations of the patent text between German, English and French (the EPO’s official languages).

This collection contains approximately 3.5 million XML documents, referring to 1.5 million patents, and most of patent documents were published by the European Patent Office (EPO), but that also include certain patents published by the World Intellectual Patent Organization (WIPO).

I have extracted multilingual parallels segments (English-German, English-French, and French-German) from the CLEF-IP 2011 collection. I made the corpus available to the research community.

Licensing

All CLEF-IP collections, as extracts of MAREC, are available under the Creative Commons License (see the paragraph below) and are now freely available to download. Please cite (WANG et al., 2014), if you use the corpus in your work.

MAREC by IRF is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Permissions beyond the scope of this license may be available at marec@fandan.net.
Creative Commons License

--------------------------------------------------------------

Release v1

On 11 Oct 2013, i released the first version of the corpus. The first version patent parallel corpus consists of 3 languages, 6-language pairs, and TMX format available.

Table shows the number of segments and words that are extracted from title and claims fields on the source and the target after segment aligning.



Language pairs Title Claims
Segments Words Segments Words
de-en de 311,298 2,038,785 1,696,498 62 M
en 2,582,703 71 M
de-fr de 311,184 2,036,112 1,661,419 79 M
fr 2,482,257 86 M
en-de en 884,759 6,661,481 5,218,024 332 M
de 5,508,289 296 M
en-fr en 884,727 6,661,322 5,373,452 330 M
fr 8,538,012 380 M
fr-de fr 106,211 963,508 572,356 36 M
de 1,204,439 37 M
fr-en fr 106,246 1,285,467 586,498 38 M
en 1,048,374 37 M

Table 1: Number of extracted segments as source and target after segment aligning in the <title> and <claims> fields



--------------------------------------------------------------

Download



Name Language pair format
CLEF-Title 6-language pairs tmx
CLEF-Abstract4-language pairstmx txt
CLEF-Claimde-entmx txt
CLEF-Claimde-frtmx txt
CLEF-Claimen-de tmx txt
CLEF-Claimen-fr tmx txt
CLEF-Claimfr-detmx txt
CLEF-Claimfr-entmx txt
CLEF-DE-FR CLEF-FR-DE CLEF-FR-EN