Home page of Clément Grimal

Clément Grimal

Data

You can find on this page the datasets I generated. You are free to use it for any purposes.

Reuters Multilingual dataset containing 6 samples of 1200 documents over 6 labels, and desribed by 5 views of 2000 words each. [Readme][Download archive]

Cora dataset containing 2708 documents over 7 labels, and desribed by 2 views (content and citations). [Readme][Download archive]

CiteSeer dataset containing 3312 documents over 6 labels, and desribed by 2 views (content and citations). [Readme][Download archive]

WebKB datasets containing 4 subsets of documents over 6 labels, and desribed by 2 views (content and citations).
- Cornell: [Readme][Download archive]
- Texas: [Readme][Download archive]
- Washington: [Readme][Download archive]
- Wisconsin: [Readme][Download archive]

Movies dataset containing 617 movies over 17 labels, described by 2 views (keywords and actors). [Readme][Download archive]

Newsgroup datasets containing subsets of the NG20 dataset with 3 different preprocessing. The description of the subsets, as well as details on the preprocessing steps, can be found in our ICMLA’2010 publication (see Publications page).
- Supervised Mutual Information preprocessing: [Readme][Download archive]
- Partitioning Around Medoïds preprocessing: [Readme][Download archive]
- Unsupervised Mutual Information preprocessing: [Readme][Download archive]

Other Newsgroup datasets containing subsets of the NG20 dataset which have been built to validate a splitting approach. [Readme][Download archive ~140MB]

This website was created with Webby