最全大数据源（集）下载列表（持续补充）

United States Census Data: The United States Census publishes reams of demographic data at the state, city, and even zip code level. The data set is fantastic for creating geographic data visualizations and can be accessed on the Census Website. Alternatively, the data can be accessed via an API. One convenient way to use that API is through the chloroplethr. In general, this data is very clean and very comprehensive.
FBI Crime Data: The FBI crime data set is fascinating. If you’re interested in analyzing time series data, you can use it to chart changes in crime rates at the national level over a 20 year period. Alternatively, you can look at the data geographically.
CDC Cause of Death: The Center for Disease Control control maintains a database on cause of death. The data can be segmented in almost every way imaginable: age, race, year, and so on.
Medicare Hospital Quality: Medicare maintains a database on complication rates by hospital that provides for interesting comparisons.
SEER Cancer Incidence: The US government also has data about cancer incidence, again segmented by age, race, gender, year, and other factors.
Bureau of Labor Statistics: Many important economic indicators for the United States (like unemployment and inflation) can be found on the Bureau of Labor Statistics website. Most of the data can be segmented both by time and by geography.
The Bureau of Economic Analysis: The Bureau of Economic Analysis also has national and regional economic data, like GDP and exchange rates.
IMF Economic Data: If you want a view of international data, you can find it on the IMF website.
Dow Jones Weekly Returns: Predicting stock prices is a major application of data analysis and machine learning. One dataset to explore is the weekly returns of the Dow Jones Index.
Boston Housing Data: The Boston Housing Data Set contains median housing prices in Boston suburbs as well as 13 attributes that contribute to those prices. It’s an excellent set for experimenting with various types of regressions.
Enron Emails: After the collapse of Enron, a dataset of roughly 500,000 emails with message text and metadata were released. The dataset is now famous and provides an excellent testing ground for text related analysis. It has the messiness of real world data.
Google N-Grams: If you’re interested in truly massive data, the Google n-gramsdataset counts the frequency of words and phrases by year across a huge number of text sources. The resulting file is 2.2 TB.
Sentence Sentiments: Researchers have labeled 3,000 sentences as expressing positive or negative sentiments. If you’re interested in classifying text, this is a great place to start.
Reddit Comments: Reddit released a dataset of every comment that has ever been made on the site. That’s over a terabyte of data uncompressed, so if you want a smaller dataset to work with Kaggle has hosted the comments from May 2015 on their site.
Wikipedia: Wikipedia provides instructions for downloading the text of English language articles.
Lending Club: Lending Club provides data about loan applications it has rejected as well as the performance of loans that it issued. The dataset lends itself both to categorization techniques (will a given loan default) as well as regressions (how much will be paid back on a given loan.)
Walmart: Walmart has released store level sales data for 98 items across 45 stores. This is an excellent data for time series analysis and has interesting seasonal components as well.
Airbnb: This website offers different datasets related to Airbnb and listings related to different cities.
Yelp: Yelp releases an academic dataset that contains information for the areas around 30 universities.

Cross-disciplinary data repositories, data collections and data search engines:

Single datasets and data repositories

http://archive.ics.uci.edu/ml/
http://crawdad.org/
http://data.austintexas.gov
http://data.cityofchicago.org
http://data.govloop.com
http://data.gov.uk/
data.gov.in
http://data.medicare.gov
http://data.seattle.gov
http://data.sfgov.org
http://data.sunlightlabs.com
https://datamarket.azure.com/
http://developer.yahoo.com/geo/g…
http://econ.worldbank.org/datasets
http://en.wikipedia.org/wiki/Wik…
http://factfinder.census.gov/ser…
http://ftp.ncbi.nih.gov/
http://gettingpastgo.socrata.com
http://googleresearch.blogspot.c…
http://books.google.com/ngrams/
http://medihal.archives-ouvertes.fr
http://public.resource.org/
http://rechercheisidore.fr
http://snap.stanford.edu/data/in…
http://timetric.com/public-data/
https://wist.echo.nasa.gov/~wist…
http://www2.jpl.nasa.gov/srtm
http://www.archives.gov/research…
http://www.bls.gov/
http://www.crunchbase.com/
http://www.dartmouthatlas.org/
http://www.data.gov/
http://www.datakc.org
http://dbpedia.org
http://www.delicious.com/jbaldwi…
http://www.faa.gov/data_research/
http://www.factual.com/
http://research.stlouisfed.org/f…
http://www.freebase.com/
http://www.google.com/publicdata…
http://www.guardian.co.uk/news/d…
http://www.infochimps.com
http://www.kaggle.com/
http://build.kiva.org/
http://www.nationalarchives.gov….
http://www.nyc.gov/html/datamine…
http://www.ordnancesurvey.co.uk/…
http://www.philwhln.com/how-to-g…
http://www.imdb.com/interfaces
http://imat-relpred.yandex.ru/en…
http://www.dados.gov.pt/pt/catal…
http://knoema.com
http://daten.berlin.de/
http://www.qunb.com
http://databib.org/
http://datacite.org/
http://data.reegle.info/
http://data.wien.gv.at/
http://data.gov.bc.ca
https://pslcdatashop.web.cmu.edu/ (interaction data in learning environments)
http://www.icpsr.umich.edu/icpsrweb/CPES/ – Collaborative Psychiatric Epidemiology Surveys: (A collection of three national surveys focused on each of the major ethnic groups to study psychiatric illnesses and health services use)
http://www.dati.gov.it
http://dati.trentino.it
http://www.databagg.com/
http://networkrepository.com – Network/ML data repository w/ visual interactive analytics
Home (United Nations Environment Programme Grid Genava a lot of GIS datasets

More than 1 TB

The 1000 Genomes project makes 260 TB of human genome data available [13]
The Internet Archive is making an 80 TB web crawl available for research [17]
The TREC conference made the ClueWeb09 [3] dataset available a few years back. You’ll have to sign an agreement and pay a nontrivial fee (up to $610) to cover the sneakernet data transfer. The data is about 5 TB compressed.
ClueWeb12 [21] is now available, as are the Freebase annotations, FACC1 [22]
CNetS at Indiana University makes a 2.5 TB click dataset available [19]
ICWSM made a large corpus of blog posts available for their 2011 conference [2]. You’ll have to register (an actual form, not an online form), but it’s free. It’s about 2.1 TB compressed.
The Yahoo News Feed dataset is 1.5 TB compressed, 13.5 TB uncompressed
The Proteome Commons makes several large datasets available. The largest, the Personal Genome Project [11], is 1.1 TB in size. There are several others over 100 GB in size.

More than 1 GB

The Reference Energy Disaggregation Data Set [12] has data on home energy use; it’s about 500 GB compressed.
The Tiny Images dataset [10] has 227 GB of image data and 57 GB of metadata.
The ImageNet dataset [18] is pretty big.
The MOBIO dataset [14] is about 135 GB of video and audio data
The Yahoo! Webscope program [7] makes several 1 GB+ datasets available to academic researchers, including an 83 GB data set of Flickr image features and the dataset used for the 2011 KDD Cup [9], from Yahoo! Music, which is a bit over 1 GB.
Google made a dataset mapping words to Wikipedia URLs (i.e., concepts) [15]. The dataset is about 10 GB compressed.
Yandex has recently made a very large web search click dataset available [1]. You’ll have to register online for the contest to download. It’s about 5.6 GB compressed.
Freebase makes regular data dumps available [5]. The largest is their Quad dump [4], which is about 3.6 GB compressed.
The Open American National Corpus [8] is about 4.8 GB uncompressed.
Wikipedia made a dataset containing information about edits available for a recent Kaggle competition [6]. The training dataset is about 2.0 GB uncompressed.
The Research and Innovative Technology Administration (RITA) has made available a dataset about the on-time performance of domestic flights operated by large carriers. The ASA compressed this dataset and makes it available for download [16].
The wiki-links data made available by Google is about 1.75 GB total [20].