Large, Free, and Interesting SQL-ready Datasets? 73
Jon H asks: "I'd like to teach myself various platforms or technologies, involving accessing databases. The problem is, my ideas for projects to learn on usually are boring, toy projects, involving lots of boring data entry in order to create a useful database. Things like personal library databases. This doesn't particularly interest me. It would be much easier if I had a big, interesting dataset which I could load into an SQL database without too much trouble. Then I could spend my time on the php, or WebObjects, or JBoss, or whatever.
I'd like something more real than the usual toy demo databases. Something weighty, 20 megabytes and up, big enough for poor software design to cause performance issues which might not be seen in smaller databases.
Ideally, it'd be in a form that could easily be loaded into an SQL database, perhaps even including a schema. Any links would be appreciated. Do such beasts exist?
Use a random number generator (Score:3, Interesting)
As far as real world data, you'll have to collect it yourself. Start with incoming ethernet packets, and file them away into multiple tables. That'll give you a good dataset. You may also want to try your hand at incoming email, HTTP requests, etc...
Or a non-random name generator (Score:3, Informative)
Automate the generation (Score:2, Interesting)
If this isn't down to earth enough, write
Re:Automate the generation (Score:1)
Re:Automate the generation (Score:2)
But it should be reasonable easy to generate using a script. Even the process of the creating such a script will teach you a lot. Let's take the CDRs: about 50 columns per row (username, account, usage data, CLID, ANI,
USDA (Score:5, Informative)
You're going to have to google it yourself, though.
Re:USDA (Score:2, Informative)
You're going to have to google it yourself, though.
...or click here [usda.gov].
If you're gonna get snarky... (Score:3)
Re:If you're gonna get snarky... (Score:2)
NIST databases (Score:5, Interesting)
Real data dard to come by. (Score:2)
Re:John, post your email if you want a reply.. (Score:2)
One word... (Score:2, Insightful)
Re:One word... (Score:1)
Re:One word... (Score:2)
Census Data (Score:5, Interesting)
ftp://ftp2.census.gov/census_2000/datasets/
IMDb (Score:5, Informative)
baseball stats (Score:3, Interesting)
Frankly, 20 MB is not going to give you performance issues. To realistically test the performance of your engine and your queries/schemas, you need at least enough data to fill main memory and cause disks to be used. Much more would be much better.
Re:baseball stats (Score:4, Interesting)
Plenty of data is right under your nose. (Score:2, Interesting)
Your web logs or even your system logs are good candidates as well, as are the package description and dependency databases for any given Linux distribution, and the bug reports for same. One cool project might be to load the Debian, RedHat, etc dependency databases and merge them together and report t
SPAM (Score:1)
dmoz (Score:3, Informative)
Re:dmoz (Score:2, Interesting)
Note: I think I had to use Catalog 1.01 because 1.02 didn't work.
Internet Movie Database & Yahoo Stock Data (Score:1)
http://www.imdb.com/interfaces [imdb.com]
My father has a large VHS collection of movies whose info he keeps updated in a spreadsheet. It was a lot of fun to link his database to the IMDB one, and start searching for movies that included my favorite actors (and were already in the house).
Another source of data that I use from the internet is daily stock data. If you go to this page Yahoo S&P 500 Info [yahoo.com] and click Historic [yahoo.com]
Re:Internet Movie Database & Yahoo Stock Data (Score:2)
Damn, that's cool!
I kinda held a grudge for the IMDB folks for a while, since the data used to be a collection of text files distributed on USENET (eventually binary DB files, along with a simple query tool, distributed via ftp). I contributed a fair bit of stuff back in those days.
I thought that once they went "commercial" on the web, years ago, that it meant the end of all that community-created data. I'm
Here's a possible source (Score:2)
Machine Learning Databases (Score:2, Interesting)
Some are large, some are interesting, some are simple, but plenty of data.
http://www.ics.uci.edu/~mlearn/MLSummary.html [uci.edu]
Re:Machine Learning Databases (Score:2)
A similar set that I didn't see anyone mention was the KDD (Knowledge Discovery in Databases) Cup datasets. Every year for the ACM SIGKDD conference they make a dataset available and see who in the KDD community can do the best job of mining it (for instance, do the best classification on the test data). There isn't a central repository as different organizations host the challenge every year, but a quick Google search for "KDD Cup" will give you the list. The last four years had data sets such as:
Spec Int (Score:2)
Newer datasets of the specint can help you decide between a Pentium4 CPU and Athlon machines, and their chipsets. So pretty cool and useful information there.
Also heard of telemetry information, on th
RSS/RDF Feeds (Score:2)
Audioscrobbler (Score:2)
http://www.audioscrobbler.com [audioscrobbler.com]
Audioscrobbler is a computer system that builds up a detailed profile of your musical taste. After installing an Audioscrobbler Plugin, your computer sends the name of every song you play to the Audioscrobbler Server. With this information, the Audioscrobbler server builds you a 'Musical Profile'. Statistics from your Musical Profile are shown on your Audioscrobbler User Page, available for everyone t
Stock Performace (Score:2)
I didn't feel like entering in all the data by hand so I compiled a database of various stocks, and how they have performed over a few months.
All of my data was obtained through finance.yahoo.com, and they allow you to download historical data for numerious stocks and they provide it to you as a comma seperated file.
NOAA (Score:4, Interesting)
Firewall output (Score:3, Interesting)
I use iptables for this, but I'm sure you can do this with all the rest. You could even (as an excersize) try to log it directly to database. I just occasionally scan logs left by syslog-ng.
Robert
here's some genomic data (Score:1)
Datasets (Score:5, Interesting)
You can pull tables for rivers near you, and see how often they flood.
With a bit of work, you can pull all sorts of things out of the current tiger [census.gov] dataset - for example, there are about 4.8 million unique street/zipcode combinations in the US.
See how many streets near where you live are unique ( two streets just down the road from me - Kentvale [ofdoom.com] and Uthers [ofdoom.com] - appear to be unique).
There's lots of interesting data out there, keep poking around in
Re:Datasets (Score:2)
Re:Datasets (Score:2)
Lots of street names show up hundreds, or thousands of times, all over the US.
However there are quite a few which only show up once ( Analog [ofdoom.com] for example), or all of the streets with that name are concentrated in a small area (take hells gate [ofdoom.com] as an example - they're all near each other in Texas).
Re:Datasets (Score:2)
Some of the fun ones have already been posted. (Score:3, Informative)
There are some great collections of historical climate data out there for free. Here's a source for the Western US [dri.edu] (a similiar compilation for the entire US would be great). Some earthquake data can be found here [noaa.gov].
Heck, just enter "raw data" into google, along with your topic of choice, and have fun.
How about the open directory project rdf dump? (Score:2)
There are numerous [cgi-interactive-uk.com] utilities to put it [rainwaterr...eranch.org] into a database [amix.dk] for you.
SQL Server? (Score:1)
DAFIF data (Score:4, Interesting)
https://164.214.2.62/products/digitalaero/index.c
Make some neat tools for that and a zillion simmers (and lots of poor pilots like me) will love you forever. Check out X-Plane [x-plane.com] while you're at it. Or even better, the open-source Flight Gear [flightgear.org] could probably use your help!
NIH Human Genome (Score:2, Interesting)
Re:NIH Human Genome (Score:1)
Bioinformatics (Score:4, Interesting)
Have a look at the wonderful world of bioinformatics, where (hopefully) you should be able to find an array of academic institutions publishing their data for peer review.
To pick just the one place I'm vaguely familiar with, try Boston University's BMERC lab [bu.edu], which publishes both raw genomes [bu.edu] and MySQL databases [bu.edu]. BMERC's main genome.sql.gz file is 119,294,059 bytes (113 mb, compressed), which should be well into the "large dataset" category you're asking for :-)
There are surely many other schools publishing similar data if you poke around a bit.
Of course, at that point, you start to be bound by the problem domain. Sure, you have lots of data and that's all well & good, but what does it mean? What sorts of analysis can you usefully do on it? Without a biology background, maybe not much, but it's an interesting field and you should be able to give yourself enough of a crash course to make something useful out of it...
Have fun!
IP - DNS - Geographic Location dataset . . . (Score:2)
Big data sets with a lot of the information being available through publicly queryable sources.
There are lots of approaches to mapping it all out (start with class A, then to B, and onward or you can start with specific ranges . .
Northwind Database. (Score:3, Informative)
Northwind Access 2000 Download page [microsoft.com]
Wikipedia (Score:2)
easy (Score:3, Funny)
step 2: turn on application crash logging.
step 3: give to mother
step 4: ??? (aka wait 3 months)
step 5: collect 3gb dataset.
Here you go. (Score:1)
astronomical surveys (Score:2)
Look around for an astronomical survey that interests you. Learn a little of the science so that you can think of interesting things to play with (or just read some of the associated papers to repea
Zen-Cart (Score:2)
Zen is a full featured e-commerce solution and is opensource and written in php. When you install it you can choose to have it populate a demo database which has quite a bit of stuff in it. (100s of products).
I don't think its the pinnacle of database engineering but if all you are looking for is mock data there you go.
FCC License databases (Score:2)
There are a few related tables available for download, but I mostly use it for Name/Address test data.
Dean
freedb.org (Score:2)
GIS Data Depot (Score:1)
Have fun.
IRC (Score:1)
one more (Score:2)
Agency (NIMA)'s database of placenames:
http://earth-info.nga.mil/gns/html/cntry_files. h tm l
This is the db (200M compressed, 700M un) of
foreign placenames. I actually used the US
placenames file that NIMA has (don't have the
URL, but it should be free for US residents)
for zips, counties and cities of the US. Each
"placename" has the latitude and longitude as
well.
Note that db's like these need to be scrubbed
and massaged before they can be properly
read into a relationa
I hear . . . (Score:3, Funny)
how about using a benchmark ? (Score:2)
Wikipedia (Score:2)
~~~~
get the do not call list (Score:2)