===== Presentation =====
This subsets have been extracted from the 20-Newsgroup dataset
that can be found on http://people.csail.mit.edu/jrennie/20Newsgroups/

authors: Syed Fawad Hussain <Fawad.Hussain@imag.fr>
		 http://sites.google.com/site/fawadsyed/
		 Clément Grimal <Clement.Grimal@imag.fr>
		 http://membres-lig.imag.fr/grimal/
		 Questions, suggestions or comments are appreciated!
		 
See: An Improved Co-Similarity Measure for Document Clustering, 
Syed Fawad Hussain, Clément Grimal, Gilles Bisson, ICMLA'2010.
		
date: October, 2010


===== Description =====
The archive contains 6 subsets :
	* M2:  talk.politics.mideast, talk.politics.misc (500 documents)
	* M5:  comp.graphics, rec.motorcycles, rec.sport.baseball, 
	       sci.space, talk.politics.mideast  (500 documents)
	* M10: alt.atheism, comp.sys.mac.hardware, misc.forsale, rec.autos, 
	       rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space, 
	       talk.politics.gun  (500 documents)
	* NG1: rec.sports.baseball, rec.sports.hockey  (400 documents)
	* NG2: comp.os.ms-windows.misc, comp.windows.x, rec.motorcycles, 
	       sci.crypt, sci.space  (1000 documents)
	* NG3: comp.os.ms-windows.misc, comp.windows.x, misc.forsale, 
	       rec.motorcycles, sci.crypt, sci.space, talk.politics.mideast, 
	       talk.religion.misc  (1600 documents)
Every subsets contains 10 samples, the documents have been selected randomly,
and the words have been selected based on supervised mutual information.


===== Files =====
All the files are encoded in UTF8.

<subset>_<sample>.txt -- 
	the documents-words matrix, containing the number of co-occurences.

<subset>_act.txt --
	contains the list of the affectations of the documents to a topic.