Datasets - ADA2018 - Google Sheets

	A	B	C	D	E	F	G	H	I	J
1	Name	Source	Group	Size	Typology	Description	Comments	In the cluster?	Referent TA	Mattermost channel

2	Spinn3r	http://www.icwsm.org/data/	Social	1.6TB	NET/JSON	The dataset consists of over 386 million blog posts, news articles, classifieds, forum posts and social media content between January 13th and February 14th. It spans events such as the Tunisian revolution and the Egyptian protests (see http://en.wikipedia.org/wiki/January_2011 for a more detailed list of events spanning the dataset's time period).		YES
3	Million song	https://labrosa.ee.columbia.edu/millionsong/	Audio	280Gb	AUDIO/TXT	Audio and metadata of a million songs		YES
4	Wikipedia (all and analytics)	https://dumps.wikimedia.org/	Wiki	Analytics (~1.5Gb per day), dump vary	STATS/NET/TXT	Wiki contents and analytics		YES
5	GDELT v2 (conflicts, news, etc.)	https://www.gdeltproject.org/data.html#rawdatafiles	Conflict	112 GB	STATS	Broad knowledge base of human activity, since 2013		YES
6	IRA Tweets	https://www.kaggle.com/fivethirtyeight/russian-troll-tweets	Social		CSV	Tweets of IRA (Russian) trolls meddling in U.S. politics		No
7	Reddit comments	http://academictorrents.com/details/85a5bd50e4c365f8df70240ffd4ecc7dec59912b	Social	304GB				YES
8	Liar Dataset	https://arxiv.org/pdf/1705.00648.pdf		3M	CSV	Fake-news labeled dataset (around 12k statements, with source)		No
9	Open Food Facts database	https://world.openfoodfacts.org/data		1.6GB	CSV	Open Food Facts is a food products database made by everyone, for everyone.		YES
10	Clinton emails	https://www.kaggle.com/kaggle/hillary-clinton-emails	Emails	51Mb	TXT	Clinton's 2016 7k email dataset	Also see: https://medium.com/mit-media-lab/what-i-learned-from-visualizing-hillary-clintons-leaked-emails-d13a0908e05e	No
11	ENRON	https://www.cs.cmu.edu/~./enron/	Emails	423Mb	TXT	ENRON senior management emails		No
12	Cooking recipes	http://infolab.stanford.edu/~west1/from-cookies-to-cooks/recipePages.zip	Food	2.5 Gb (zipped)	HTML/STATS	Cooking recipes		YES
13	Panama papers	https://www.occrp.org/en/panamapapers/database	Gov	352Mb	NET	Over 500'000 offshore entities with leaked info about company, persons and relationships (officers, intermediaries, ...) which some are fraudulent. Public dataset is only a fraction of leaked data		YES
14	News On the Web	https://www.corpusdata.org/intro.asp	News		TXT	Textual corpora from a variety of sources		YES
15	Patent citations	http://snap.stanford.edu/data/cit-Patents.html	Patents	81Mb	NET	Network of patent citations	A better dataset with more info on patents can be crawled here: http://www.patentsview.org/api/doc.html	No
16	FMA: A Dataset For Music Analysis	https://icitdocs.epfl.ch/display/clusterdocs/FMA%3A+A+Dataset+For+Music+Analysis	Metadata	15.6 GB	CSV	The dataset is a dump of the Free Music Archive.		YES
17	OpenSubtitles	https://icitdocs.epfl.ch/display/clusterdocs/OpenSubtitles	Text	31GB	TXT	The OpenSubtitles2018 dataset consists of a database dump of the OpenSubtitles.org repository of subtitles, comprising a total of 3.74 million subtitle files over 62 languages.		YES
18	UbuntoOne Trace	http://cloudspaces.eu/results/datasets		759GB		Back-end activity of a large-scale Personal Cloud (UbuntuOne)		No
19	Gitential Datasets for Open Source Projects	https://github.com/gitential/datasets			TXT	Datasets on public projects on github		YES
20	Swiss open data	https://opendata.swiss/en/	Gov	Varies	STATS/TXT	Variety of Swiss Gov open datasets		No
21	Twitter				TXT	1% of the tweets of 2017		YES
22	Amazon reviews	http://jmcauley.ucsd.edu/data/amazon/	Amazon	20Gb	STATS/TXT	Amazon reviews	Raw review data + metadata	YES
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100