===== Presentation ===== This subsets have been extracted from the 20-Newsgroup dataset that can be found on http://people.csail.mit.edu/jrennie/20Newsgroups/ authors: Syed Fawad Hussain http://sites.google.com/site/fawadsyed/ Clément Grimal http://membres-lig.imag.fr/grimal/ Questions, suggestions or comments are appreciated! See: An Improved Co-Similarity Measure for Document Clustering, Syed Fawad Hussain, Clément Grimal, Gilles Bisson, ICMLA'2010. date: October, 2010 ===== Description ===== The archive contains 6 subsets : * M2: talk.politics.mideast, talk.politics.misc (500 documents) * M5: comp.graphics, rec.motorcycles, rec.sport.baseball, sci.space, talk.politics.mideast (500 documents) * M10: alt.atheism, comp.sys.mac.hardware, misc.forsale, rec.autos, rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space, talk.politics.gun (500 documents) * NG1: rec.sports.baseball, rec.sports.hockey (400 documents) * NG2: comp.os.ms-windows.misc, comp.windows.x, rec.motorcycles, sci.crypt, sci.space (1000 documents) * NG3: comp.os.ms-windows.misc, comp.windows.x, misc.forsale, rec.motorcycles, sci.crypt, sci.space, talk.politics.mideast, talk.religion.misc (1600 documents) Every subsets contains 10 samples, the documents have been selected randomly, and the words have been selected based on supervised mutual information. ===== Files ===== All the files are encoded in UTF8. _.txt -- the documents-words matrix, containing the number of co-occurences. _act.txt -- contains the list of the affectations of the documents to a topic.