PhD student in Machine Learning at Grenoble University
Data
You can find on this page the datasets I generated. You are free to use it for any purposes.
Reuters Multilingual dataset containing 6 samples of 1200 documents over 6 labels, and desribed by 5 views of 2000 words each. [Readme][Download archive]
Cora dataset containing 2708 documents over 7 labels, and desribed by 2 views (content and citations). [Readme][Download archive]
CiteSeer dataset containing 3312 documents over 6 labels, and desribed by 2 views (content and citations). [Readme][Download archive]
WebKB datasets containing 4 subsets of documents over 6 labels, and desribed by 2 views (content and citations).
Movies dataset containing 617 movies over 17 labels, described by 2 views (keywords and actors). [Readme][Download archive]
Newsgroup datasets containing subsets of the NG20 dataset with 3 different preprocessing. The description of the subsets, as well as details on the preprocessing steps, can be found in our ICMLA’2010 publication (see Publications page).