d301: Machine Learning Datasets

Machine Learning Datasets

Machine Learning Datasets
Source: https://docs.google.com/spreadsheets/d/1AQvZ7-Kg0lSZtG1wlgbIsrm90HaTZrJGQMz-uKRRlFw/edit#gid=0

Link Purpose
20 Newsgroups http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html The text from 20000 messages taken from 20 Usenet newsgroups for text analysis, classification, etc.
Amazon Reviews http://jmcauley.ucsd.edu/data/amazon/ Over 142 million product reviews for sentiment analysis, recommender systems, and more.
Football Strategy https://www.crowdflower.com/wp-content/uploads/2016/03/Football-Scenarios-DFE-832307.csv Thousands of scenarios to make the best coaching decisions.
Horses for Courses https://www.kaggle.com/lukebyrne/horses-for-courses Horse-racing data for predicting race results.
Human Activity Recognition with Smartphones https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones Sensor data for recognizing the human activity – walking, sitting, etc.
Labeled Faces in the Wild http://vis-www.cs.umass.edu/lfw/ 13,000 named faces for facial recognition. Multiple training and test sets
National Survey on Drug Use and Health http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/34933
NORB 3D Object Recognition http://www.cs.nyu.edu/~ylclab/data/norb-v1.0/ Binocular images of 50 toy figurines for 3D object recognition from image.
One Million Songs http://labrosa.ee.columbia.edu/millionsong/ Audio features and metadata for a subset (10,000) of the one million popular songs dataset for recognition/classification.
SMS Spam Collection http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ A collection of 5,574 SMS (text) messages, some spam, some normal, for spam filtering.
Hate Speech Identification https://www.crowdflower.com/wp-content/uploads/2016/03/twitter-hate-speech-classifier-DFE-a845520.csv A sampling of Twitter posts that have been judged based on whether they are offensive or contain hate speech, as a training set for text analysis.
Hidden Beauty of Flickr Pictures http://www.di.unito.it/~schifane/dataset/beauty-icwsm15/ 15,000 Flickr photo IDs that have received ratings based on aesthetics, for image analysis.
Yahoo Instant Messenger Friends Connectivity Graph http://webscope.sandbox.yahoo.com/catalog.php?datatype=g Connections between Yahoo users who communicate with each other using Yahoo messenger, can be used to identify key social contacts/influencers. Add dataset to cart to access.
Record of Heart Sound http://mldata.org/repository/data/viewslug/record-of-heart-sound/ Recordings of normal and abnormal heartbeats, used to recognize heart murmur, etc.
Prostate Cancer http://mldata.org/repository/data/viewslug/prostate-cancer/ Tumor and nontumor samples, used to recognize prostate cancer.
Wine Quality http://archive.ics.uci.edu/ml/datasets/Wine+Quality Chemical properties of red and white wines (separately) and quality, for classification.
Mushroom Identification http://archive.ics.uci.edu/ml/datasets/Mushroom For hypothetically classifying mushrooms as edible or poisonous based on its characteristics.
UFO Reports https://github.com/planetsig/ufo-reports 80,000 historic reports for classification or regression. This dataset has been standardized from the source data at nuforc.org.
Militarized Interstate Disputes http://www.correlatesofwar.org/data-sets/MIDs Nearly 200 years of international threats, conflicts, etc. for modelling or prediction. Includes action taken, level of hostility, fatalities, and outcomes.
NBA & MLB Stats http://www.dougstats.com/ Current and past season stats for teams and players for fantasy sports predictions.
Sign Language http://www-i6.informatik.rwth-aachen.de/~dreuw/database.php
MusicNet http://homes.cs.washington.edu/~thickstn/musicnet.html MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note every recording, the instrument that plays each note, and the note’s position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; we estimate a labeling error rate of 4%. We offer the MusicNet labels to the machine learning and music communities as a resource for training models and a common benchmark for comparing results.
ProductHunt https://data.world/producthunt/product-hunt-research
Reddit https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/ 1.7 billion Reddit comments
VQA2 https://arxiv.org/pdf/1612.00837.pdf visual question answering dataset, now 2X larger
UCI ML Repo https://archive.ics.uci.edu/ml/datasets.html 351 datasets
Hacker News http://aaron-hoffman.blogspot.com/2016/10/hacker-news-dataset-october-2016.html Full comment dump of HN
FIRE http://www.ics.forth.gr/cvrl/fire/ Fundus Image Registration Dataset
LASIESTA http://www.gti.ssr.upm.es/data/LASIESTA Labeled and Annotated Sequences for Integral Evaluation of SegmenTation Algorithms
LAKH MIDI Dataset http://colinraffel.com/projects/lmd/ Its goal is to facilitate large-scale music information retrieval, both symbolic (using the MIDI files alone) and audio content-based (using information extracted from the MIDI files as annotations for the matched audio files).
Lamem http://memorability.csail.mit.edu/ Large-scale Image Memorability
Pratheepan dataset http://cs-chan.com/project1.htm Human Skin Detection dataset
COCO-Stuff dataset http://calvin.inf.ed.ac.uk/datasets/coco-stuff COCO-Stuff semantic segmentation dataset
NewsQA http://datasets.maluuba.com/NewsQA Maluuba’s News QA is a new machine reading comprehension dataset for developing algorithms capable of answering questions requiring human-level comprehension and reasoning skills.
This dataset of CNN news articles has over 110K Q&A pairs.
Questions are written by humans in natural language. Questions may not have answers and answers may be multiword passages.
Awesome Public Datasets https://github.com/caesar0301/awesome-public-datasets A massive Github repo of accessible, public datasets. The datasets are not, by nature, completely clean and purpose-built for ML.
ImageNet https://github.com/caesar0301/awesome-public-datasets The ImageNet project is a large visual database designed for use in visual object recognition software research
Element List Scientific Data Directory http://www.elementlist.com/scientific_data/ An online repository of links to free, publicaly available scientific datasets, mostly from university, industry, and government research programs.
IMDB dataset ftp://ftp.fu-berlin.de/pub/misc/movies/database/
MSCOCO http://mscoco.org/ Image segmentation and object recognition
Google Books Ngrams https://aws.amazon.com/datasets/google-books-ngrams/
OpenML repository http://www.openml.org/search?type=data Almost 20k datasets
Enron Email Corpus https://en.wikipedia.org/wiki/Enron_Corpus The Enron Corpus is a large database of over 600,000 emails generated by 158 employees of the Enron Corporation and acquired by the Federal Energy Regulatory Commission during its investigation after the company’s collapse.
German Traffic Signs http://benchmark.ini.rub.de/ German Traffic Sign Detection Benchmark (GTSDB). The first was used in a competition at IJCNN 2011.
SYNTHIA http://www.synthia-dataset.net 500.000 frames of annotated vídeo from a virtualcity. labels for stereo, optical flow, semántica segmentación, odometry…
Elektra http://adas.cvc.uab.es/elektra over 20 different autonomous driving datasets: pedestrians, semantic segmentation, stereo…
Cornell Movie–Dialogs Corpus http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts:
Virtual KITTI http://www.xrce.xerox.com/Our-Research/Computer-Vision/Proxy-Virtual-Worlds Large photo-realistic synthetic video understanding dataset (high res. videos @30FPS generated with the Unity Game Engine). Automatically, exactly, and fully annotated for all 2D and 3D ground truths at the pixel level (object detection & tracking, segmentation, optical flow, depth, structure from motion, …).
Bureau of Labor Statistics http://www.bls.gov/data/ Dozens of longitudinal datasets provided by the US Department of Labor (CPI, PPI, employment, population, pay, etc.)
KITTI Vision Benchmark Suite http://www.cvlibs.net/datasets/kitti/ Computer vision benchmarks: stereo, flow, odometry, object detection or tracking
Allen Institute for Artificial Intelligence Datasets http://allenai.org/data.html Datasets for computer vision, reasoning and inference, question answering, and natural language understanding

Fee free to update with your own Machine Learning Datasets.