ABCDEFGHIJKLMNOPQRSTUVWXYZAA
1
NameSourceGroupSizeTypologyDescriptionCommentsIn the cluster?Referent TA
Mattermost channel
2
Spinn3rhttp://www.icwsm.org/data/Social1.6TBNET/JSON
The dataset consists of over 386 million blog posts, news articles, classifieds, forum posts and social media content between January 13th and February 14th. It spans events such as the Tunisian revolution and the Egyptian protests (see http://en.wikipedia.org/wiki/January_2011 for a more detailed list of events spanning the dataset's time period).
YES
3
Million songhttps://labrosa.ee.columbia.edu/millionsong/Audio280GbAUDIO/TXT
Audio and metadata of a million songs
YES
4
Wikipedia (all and analytics)https://dumps.wikimedia.org/Wiki
Analytics (~1.5Gb per day), dump vary
STATS/NET/TXT
Wiki contents and analytics
YES
5
GDELT v2 (conflicts, news, etc.)
https://www.gdeltproject.org/data.html#rawdatafilesConflict112 GBSTATS
Broad knowledge base of human activity, since 2013
YES
6
IRA Tweets
https://www.kaggle.com/fivethirtyeight/russian-troll-tweets
SocialCSV
Tweets of IRA (Russian) trolls meddling in U.S. politics
No
7
Reddit comments
http://academictorrents.com/details/85a5bd50e4c365f8df70240ffd4ecc7dec59912b
Social304GBYES
8
Liar Datasethttps://arxiv.org/pdf/1705.00648.pdf3MCSV
Fake-news labeled dataset (around 12k statements, with source)
No
9
Open Food Facts databasehttps://world.openfoodfacts.org/data1.6GBCSV
Open Food Facts is a food products database made by everyone, for everyone.
YES
10
Clinton emailshttps://www.kaggle.com/kaggle/hillary-clinton-emailsEmails51MbTXT
Clinton's 2016 7k email dataset
Also see: https://medium.com/mit-media-lab/what-i-learned-from-visualizing-hillary-clintons-leaked-emails-d13a0908e05e
No
11
ENRONhttps://www.cs.cmu.edu/~./enron/Emails423MbTXT
ENRON senior management emails
No
12
Cooking recipes
http://infolab.stanford.edu/~west1/from-cookies-to-cooks/recipePages.zip
Food2.5 Gb (zipped)HTML/STATSCooking recipesYES
13
Panama papershttps://www.occrp.org/en/panamapapers/databaseGov352MbNET
Over 500'000 offshore entities with leaked info about company, persons and relationships (officers, intermediaries, ...) which some are fraudulent. Public dataset is only a fraction of leaked data
YES
14
News On the Webhttps://www.corpusdata.org/intro.aspNewsTXT
Textual corpora from a variety of sources
YES
15
Patent citationshttp://snap.stanford.edu/data/cit-Patents.htmlPatents81MbNET
Network of patent citations
A better dataset with more info on patents can be crawled here: http://www.patentsview.org/api/doc.html
No
16
FMA: A Dataset For Music Analysis
https://icitdocs.epfl.ch/display/clusterdocs/FMA%3A+A+Dataset+For+Music+Analysis
Metadata15.6 GBCSV
The dataset is a dump of the Free Music Archive.
YES
17
OpenSubtitles
https://icitdocs.epfl.ch/display/clusterdocs/OpenSubtitles
Text31GBTXT
The OpenSubtitles2018 dataset consists of a database dump of the OpenSubtitles.org repository of subtitles, comprising a total of 3.74 million subtitle files over 62 languages.
YES
18
UbuntoOne Tracehttp://cloudspaces.eu/results/datasets759GB
Back-end activity of a large-scale Personal Cloud (UbuntuOne)
No
19
Gitential Datasets for Open Source Projects
https://github.com/gitential/datasetsTXT
Datasets on public projects on github
YES
20
Swiss open datahttps://opendata.swiss/en/GovVariesSTATS/TXT
Variety of Swiss Gov open datasets
No
21
TwitterTXT
1% of the tweets of 2017
YES
22
Amazon reviewshttp://jmcauley.ucsd.edu/data/amazon/Amazon20GbSTATS/TXTAmazon reviews
Raw review data + metadata
YES
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100