CLEF-IP 2011 Parallel Corpus

The CLEF-IP 2011 collection is composed of multilingual patent documents, including extracts of the MAREC dataset by IRF. It contains patents, physically stored as a collection of XML files en coding patent documents that have partial translations of the patent text between German, English and French (the EPO’s official languages).

This collection contains approximately 3.5 million XML documents, referring to 1.5 million patents, and most of patent documents were published by the European Patent Office (EPO), but that also include certain patents published by the World Intellectual Patent Organization (WIPO).

I have extracted multilingual parallels segments (English-German, English-French, and French-German) from the CLEF-IP 2011 collection. I made the corpus available to the research community.

Licensing

All CLEF-IP collections, as extracts of MAREC, are available under the Creative Commons License (see the paragraph below) and are now freely available to download. Please cite (WANG et al., 2014), if you use the corpus in your work.

MAREC by IRF is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Permissions beyond the scope of this license may be available at marec@fandan.net.

--------------------------------------------------------------

Release v1

On 11 Oct 2013, i released the first version of the corpus. The first version patent parallel corpus consists of 3 languages, 6-language pairs, and TMX format available.

Table shows the number of segments and words that are extracted from title and claims fields on the source and the target after segment aligning.

Language pairs		Title		Claims
		Segments	Words	Segments	Words
de-en	de	311,298	2,038,785	1,696,498	62 M
	en		2,582,703		71 M
de-fr	de	311,184	2,036,112	1,661,419	79 M
	fr		2,482,257		86 M
en-de	en	884,759	6,661,481	5,218,024	332 M
	de		5,508,289		296 M
en-fr	en	884,727	6,661,322	5,373,452	330 M
	fr		8,538,012		380 M
fr-de	fr	106,211	963,508	572,356	36 M
	de		1,204,439		37 M
fr-en	fr	106,246	1,285,467	586,498	38 M
	en		1,048,374		37 M

Table 1: Number of extracted segments as source and target after segment aligning in the <title> and <claims> fields

--------------------------------------------------------------

Download

Name	Language pair	format
CLEF-Title	6-language pairs	tmx
CLEF-Abstract	4-language pairs	tmx txt
CLEF-Claim	de-en	tmx txt
CLEF-Claim	de-fr	tmx txt
CLEF-Claim	en-de	tmx txt
CLEF-Claim	en-fr	tmx txt
CLEF-Claim	fr-de	tmx txt
CLEF-Claim	fr-en	tmx txt

CLEF-DE-FR CLEF-FR-DE CLEF-FR-EN

WANG Lingxiao 王凌霄

Licensing