Datasets for matching identities



KDD15 Dataset


Due to privacy constraints we are able to provide parts (but not all datasets) used for our KDD15 paper. Specifically we are able to share the matching identities users willingly provide on Google+. If you are interested in using this data, please send us an email according to the contact us section and indicate which of following parts you need in the email.

Google+ sitemap (contains all the users in Google+ in 2010): Profile information for 475.257 random Google+ indentities. Each line corresponds to a user and contains the links to the matching identities of the user on other social networks. Profile information on other social networks of the matching identities of the users in the above Google+ dataset. Each line correponds to a (Facebook/Linkedin/Twitter/Flickr/Myspace) identity. The linked_username element corresponds to the ID of the Google+ identity of the user (i.e., the key that links to the google_profiles.json.tar.gz dataset). For 1037 Twitter identities, the list of identities in Facebook that have the same or similar names. Each file corresponds to a Twitter identity and contains a list of identities on Facebook that have the same or a similar name with the Twitter identity. The ID of the Twitter identity is in the name of the file.

WWW13 Dataset


List of matching identities, the first column is the twitter id, second column the flickr id, and third column the yelp id. If in a row an id is missing means that we did not find the matching identity. Profile infromation of users, each line corresponds to a different profile.
User posts and their metadata, each line corresponds to a tweet and it contains the id of the user that generated the tweet.

Contact Us


If you are interested in using this data, please send us an email at oana.goga@mpi-sws.org to get the link where you can download the data. In the email:

  1. Explicitly state whether you accept the terms and conditions of the data; If you are a student or intern please involve you advisor in confirming that you will respect the terms and conditions.

  2. Describe for what purpose you will use the data.

  3. Indicate which dataset you need, and also, which which parts of the dataset.


Terms and Conditions

  1. You will use the data solely for the purpose of non-profit research or non-profit education.

  2. You will respect the privacy of end users and organizations that may be identified in the data. You will not attempt to reverse engineer, decrypt, de-anonymize, derive or otherwise re-identify anonymized information.

  3. You will not distribute the data beyond your immediate research group.

  4. If you create a publication using our datasets, please cite our papers as follows.

    • If you use the WWW13 Dataset, please cite our WWW13 paper.

    • If you use the KDD15 Dataset, please cite our KDD15 paper.

@inproceedings{Goga:2013:EIA:2488388.2488428,
	author = {Goga, Oana and Lei, Howard and Parthasarathi, Sree Hari Krishnan and Friedland, Gerald and Sommer, Robin and Teixeira, Renata},
	title = {Exploiting Innocuous Activity for Correlating Users Across Sites},
	booktitle = {Proceedings of the 22Nd International Conference on World Wide Web},
	series = {WWW '13},
	year = {2013},
	isbn = {978-1-4503-2035-1},
	location = {Rio de Janeiro, Brazil},
	pages = {447--458},
	numpages = {12},
	url = {http://doi.acm.org/10.1145/2488388.2488428},
	doi = {10.1145/2488388.2488428},
	acmid = {2488428},
	publisher = {ACM},
	address = {New York, NY, USA},
	keywords = {account correlation, geotags, language, location, online social networks, privacy, user profiles},
} 

@inproceedings{Goga:2015:RPM:2783258.2788601,
	author = {Goga, Oana and Loiseau, Patrick and Sommer, Robin and Teixeira, Renata and Gummadi, Krishna P.},
	title = {On the Reliability of Profile Matching Across Large Online Social Networks},
	booktitle = {Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
	series = {KDD '15},
	year = {2015},
	isbn = {978-1-4503-3664-2},
	location = {Sydney, NSW, Australia},
	pages = {1799--1808},
	numpages = {10},
	url = {http://doi.acm.org/10.1145/2783258.2788601},
	doi = {10.1145/2783258.2788601},
	acmid = {2788601},
	publisher = {ACM},
	address = {New York, NY, USA},
	keywords = {matching accounts, online social networks, reliability},
}