Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Data Storage Programming IT Technology

Large, Free, and Interesting SQL-ready Datasets? 73

Jon H asks: "I'd like to teach myself various platforms or technologies, involving accessing databases. The problem is, my ideas for projects to learn on usually are boring, toy projects, involving lots of boring data entry in order to create a useful database. Things like personal library databases. This doesn't particularly interest me. It would be much easier if I had a big, interesting dataset which I could load into an SQL database without too much trouble. Then I could spend my time on the php, or WebObjects, or JBoss, or whatever. I'd like something more real than the usual toy demo databases. Something weighty, 20 megabytes and up, big enough for poor software design to cause performance issues which might not be seen in smaller databases. Ideally, it'd be in a form that could easily be loaded into an SQL database, perhaps even including a schema. Any links would be appreciated. Do such beasts exist?
This discussion has been archived. No new comments can be posted.

Large, Free, and Interesting SQL-ready Datasets?

Comments Filter:
  • by jgardn ( 539054 ) <jgardn@alumni.washington.edu> on Tuesday July 06, 2004 @03:37PM (#9624758) Homepage Journal
    I am sure that you will get all the data you ever need from that.

    As far as real world data, you'll have to collect it yourself. Start with incoming ethernet packets, and file them away into multiple tables. That'll give you a good dataset. You may also want to try your hand at incoming email, HTTP requests, etc...
    • Want a whole bunch (most) registered domain names in the world? You'll need to fill out some forms and wait maybe a week (except edu), but it's worth it. Click for biz [neulevel.biz], edu, int [internic.net], info [afilias.info], org [pir.org], com, net [verisign.com]. These files are whoppers for the most part. Perl would not read the com file under redhat 6 its' so big. I use them for my surf engine, iconsurf.com [iconsurf.com].
  • Why not just write a simple script to populate your database? You might not want pure nonsense in there so you could use a dictionary file or something like that to grab random words. You could also use a fortune file to fill your DB. The way I see your need is that you just want a large dataset to see how your application will perform - this can easily be automated and manual data entry is clearly not needed since I see no reason you would need to have 'real' data.

    If this isn't down to earth enough, write

    • Why not get a cue cat and scan your books? You can run an application against Barnes and Noble or Amazon and parse out all kinds of stuff. I wrote one that even grabbed the front cover image to stuff in my database. Good Luck!
    • The ideal would be if you could get your hands on some of the data from telcos - they store grotesque amounts of information: CDRs, invoices, intermediate processing, ... They love data. However, the chances of getting access to it is practially nil due to privacy concerns.

      But it should be reasonable easy to generate using a script. Even the process of the creating such a script will teach you a lot. Let's take the CDRs: about 50 columns per row (username, account, usage data, CLID, ANI, ...), 3.5mil per d
  • USDA (Score:5, Informative)

    by L. VeGas ( 580015 ) on Tuesday July 06, 2004 @03:41PM (#9624800) Homepage Journal
    One I've used for laughs is the USDA Nutrient Database. It gives you, well, nutrient information on just about any food you can think of. It's normalized, and just complicated enough to have fun with.

    You're going to have to google it yourself, though.
  • NIST databases (Score:5, Interesting)

    by bmwm3nut ( 556681 ) on Tuesday July 06, 2004 @03:41PM (#9624803)
    nist has a bunch of data. i remember a while ago downloading handwritten characters to make handwriting recognition software. they have data for just about everything, the chemistry data is probably some of the best to put in a relational database. check out: http://www.nist.gov/srd/index.htm
  • I'm currently generating my own test data and you'll probably have to do the same. Data is precious commodity and the more real it is the less likely it will be publicized. I would be fired if I make my current test data public.
  • One word... (Score:2, Insightful)

    by slappy ( 31445 )
    ...northwind
  • Census Data (Score:5, Interesting)

    by HotNeedleOfInquiry ( 598897 ) on Tuesday July 06, 2004 @03:46PM (#9624859)
    Here's all the data you'll ever need, free of charge from the gov. Some appears to be freely available and some is restricted. Have fun.

    ftp://ftp2.census.gov/census_2000/datasets/
  • IMDb (Score:5, Informative)

    by br0ck ( 237309 ) on Tuesday July 06, 2004 @03:46PM (#9624866)
    Use IMDbPY [sourceforge.net] to populate a database with all data from the downloadable files [imdb.com] from the Internet Movie Database.
  • baseball stats (Score:3, Interesting)

    by Gilk180 ( 513755 ) on Tuesday July 06, 2004 @03:47PM (#9624873)
    Sports stats [baseball1.com] are always good.

    Frankly, 20 MB is not going to give you performance issues. To realistically test the performance of your engine and your queries/schemas, you need at least enough data to fill main memory and cause disks to be used. Much more would be much better.
  • Any collection of mailing lists or newsgroups are good candidates for inclusion. You've got a wealth of data in the headers, as well as a nice free-form body and a spaghetti maze of parent and child linkage.

    Your web logs or even your system logs are good candidates as well, as are the package description and dependency databases for any given Linux distribution, and the bug reports for same. One cool project might be to load the Debian, RedHat, etc dependency databases and merge them together and report t

  • Do you have a spam folder? Save all your spam emails with headers intact. Also export your email address list into a text table. Write perl scripts to parse emails and check for whatever variables you are interested in; what type / size attachment, is the sender in your addr book, does the sender addr appear to be valid or not. You could even try to combine this with your web browser history. The key is that while you still need to build the DB yourself, there's a lot of interesting information you can
  • dmoz (Score:3, Informative)

    by heydude ( 156181 ) on Tuesday July 06, 2004 @03:51PM (#9624930)
    Some folks have used the dmoz [dmoz.org] data. It is in RDF, so should be fairly flexible enough to get into most databases using most languages and an RDF library.
    • Re:dmoz (Score:2, Interesting)

      by vigilology ( 664683 )
      I second that. Use Catalog [cpan.org] to pump the dmoz files [dmoz.org] into MySQL. This should give you a nice big database, well over 1GB.

      Note: I think I had to use Catalog 1.01 because 1.02 didn't work.

  • If you want Movie information, you can grab the data files from the Internet Movie Database.
    http://www.imdb.com/interfaces [imdb.com]

    My father has a large VHS collection of movies whose info he keeps updated in a spreadsheet. It was a lot of fun to link his database to the IMDB one, and start searching for movies that included my favorite actors (and were already in the house).

    Another source of data that I use from the internet is daily stock data. If you go to this page Yahoo S&P 500 Info [yahoo.com] and click Historic [yahoo.com]
    • If you want Movie information, you can grab the data files from the Internet Movie Database.

      Damn, that's cool!

      I kinda held a grudge for the IMDB folks for a while, since the data used to be a collection of text files distributed on USENET (eventually binary DB files, along with a simple query tool, distributed via ftp). I contributed a fair bit of stuff back in those days.

      I thought that once they went "commercial" on the web, years ago, that it meant the end of all that community-created data. I'm

  • I hear the RIAA has some pretty interesting databases. Obtaining them might be challenging, though...
  • You could probably write a script to use the data from the machine learning database collection from UCI.

    Some are large, some are interesting, some are simple, but plenty of data.

    http://www.ics.uci.edu/~mlearn/MLSummary.html [uci.edu]
    • A similar set that I didn't see anyone mention was the KDD (Knowledge Discovery in Databases) Cup datasets. Every year for the ACM SIGKDD conference they make a dataset available and see who in the KDD community can do the best job of mining it (for instance, do the best classification on the test data). There isn't a central repository as different organizations host the challenge every year, but a quick Google search for "KDD Cup" will give you the list. The last four years had data sets such as:

      • "clicks
  • Try www.spec.org, you'll find big HTML tables or CSV files for the specs of various computers and CPUs. This is also pretty useful information say if you wanna compare a Pentium1's floating with sparcstation, or if youre looking on eBay to buy a cheap 64-bit cpu machine that can do a decent job of a firewall.

    Newer datasets of the specint can help you decide between a Pentium4 CPU and Athlon machines, and their chipsets. So pretty cool and useful information there.

    Also heard of telemetry information, on th
  • ...are ideal. Come in a standard XML format, which is easily mutable using perl. While I was learning how to clunk about in databases, I had a simple script which made a call to several sites with RDF/RSS, grab a copy of their feed, and check to see if an entry exists in the DB. You can even learn to account for repeat results, and update existing entries. Watch out, though; making too many calls to a website's feed can make the webmastership unhappy with you. My script ran nightly, at 2 AM EST, and grabbe
  • The audioscrobbler database is available under a creative commons license [creativecommons.org].
    http://www.audioscrobbler.com [audioscrobbler.com]

    Audioscrobbler is a computer system that builds up a detailed profile of your musical taste. After installing an Audioscrobbler Plugin, your computer sends the name of every song you play to the Audioscrobbler Server. With this information, the Audioscrobbler server builds you a 'Musical Profile'. Statistics from your Musical Profile are shown on your Audioscrobbler User Page, available for everyone t
  • Last semester, my final was to develop a database, and write a front end client to interact with the database.

    I didn't feel like entering in all the data by hand so I compiled a database of various stocks, and how they have performed over a few months.
    All of my data was obtained through finance.yahoo.com, and they allow you to download historical data for numerious stocks and they provide it to you as a comma seperated file.
  • NOAA (Score:4, Interesting)

    by QuantumRiff ( 120817 ) on Tuesday July 06, 2004 @04:15PM (#9625169)
    Go to the NOAA Web Site [noaa.gov] and download all the weather data from your area going back many, many years.. its facinating to take the datasets and plot the ranges in temperature, humidity, etc..
  • Firewall output (Score:3, Interesting)

    by Gadzinka ( 256729 ) <rrw@hell.pl> on Tuesday July 06, 2004 @04:19PM (#9625229) Journal
    Catch all ``unusual'' packets on your firewall and log them. Lots of data and interesting things to do in order to find patterns in this aparent chaos.

    I use iptables for this, but I'm sure you can do this with all the rest. You could even (as an excersize) try to log it directly to database. I just occasionally scan logs left by syslog-ng.

    Robert
  • You can always use genomic data - there's plenty of it to go around for everyone. Following is a link to some downloads for mySQL: http://www.ensembl.org/Download/ [ensembl.org]
  • Datasets (Score:5, Interesting)

    by Pathwalker ( 103 ) * <hotgrits@yourpants.net> on Tuesday July 06, 2004 @04:21PM (#9625261) Homepage Journal
    The USGS has a huge database of Streamflow [usgs.gov] data online.
    You can pull tables for rivers near you, and see how often they flood.

    With a bit of work, you can pull all sorts of things out of the current tiger [census.gov] dataset - for example, there are about 4.8 million unique street/zipcode combinations in the US.
    See how many streets near where you live are unique ( two streets just down the road from me - Kentvale [ofdoom.com] and Uthers [ofdoom.com] - appear to be unique).

    There's lots of interesting data out there, keep poking around in .gov sites, and you'll find all sorts of stuff.
    • Unique how? I know in my state, all street names in the COUNTY have to be unique for 911 purposes.
      • Unique in the entire US.

        Lots of street names show up hundreds, or thousands of times, all over the US.
        • Main shows up just under 10,000 times
        • Washington shows up just under 6000 times

        However there are quite a few which only show up once ( Analog [ofdoom.com] for example), or all of the streets with that name are concentrated in a small area (take hells gate [ofdoom.com] as an example - they're all near each other in Texas).

  • by Deagol ( 323173 ) on Tuesday July 06, 2004 @04:24PM (#9625301) Homepage
    The US Census has tons of great info, as does the USDA Nutrition Database. Mortality stats gathered by the CDC are fun in a morbid kind of way. :)

    There are some great collections of historical climate data out there for free. Here's a source for the Western US [dri.edu] (a similiar compilation for the entire US would be great). Some earthquake data can be found here [noaa.gov].

    Heck, just enter "raw data" into google, along with your topic of choice, and have fun.

  • How about importing the Open Directory Project [dmoz.org]? It has an RDF (xml) dump [dmoz.org] of its current data. It is a couple hundrdred megs compressed and a couple gigs uncompressed.

    There are numerous [cgi-interactive-uk.com] utilities to put it [rainwaterr...eranch.org] into a database [amix.dk] for you.

  • If your using MS SQL server you can get the BigPubs2000 database. If you've ever used SQL Server then you probably know what the Pubs database is, and this is a really big version of it. It's about 200 MB. you can google to find a download site for it.
  • DAFIF data (Score:4, Interesting)

    by orn ( 34773 ) on Tuesday July 06, 2004 @04:38PM (#9625510)
    DAFIF data contains all sorts of aviation related airspace and airport information. Here's a link:

    https://164.214.2.62/products/digitalaero/index.cf m [164.214.2.62]

    Make some neat tools for that and a zillion simmers (and lots of poor pilots like me) will love you forever. Check out X-Plane [x-plane.com] while you're at it. Or even better, the open-source Flight Gear [flightgear.org] could probably use your help!

  • NIH Human Genome (Score:2, Interesting)

    by Matt Perry ( 793115 )
    You could download the human denome database from the NCBI [nih.gov]. All the files are here [nih.gov].
    • Coool! I'm downloading the XML format of the human genome now, then I can write a couple of java apps to browse through the genome. Really coool!
  • Bioinformatics (Score:4, Interesting)

    by babbage ( 61057 ) <cdeversNO@SPAMcis.usouthal.edu> on Tuesday July 06, 2004 @04:39PM (#9625536) Homepage Journal

    Have a look at the wonderful world of bioinformatics, where (hopefully) you should be able to find an array of academic institutions publishing their data for peer review.

    To pick just the one place I'm vaguely familiar with, try Boston University's BMERC lab [bu.edu], which publishes both raw genomes [bu.edu] and MySQL databases [bu.edu]. BMERC's main genome.sql.gz file is 119,294,059 bytes (113 mb, compressed), which should be well into the "large dataset" category you're asking for :-)

    There are surely many other schools publishing similar data if you poke around a bit.

    Of course, at that point, you start to be bound by the problem domain. Sure, you have lots of data and that's all well & good, but what does it mean? What sorts of analysis can you usefully do on it? Without a biology background, maybe not much, but it's an interesting field and you should be able to give yourself enough of a crash course to make something useful out of it...

    Have fun!

  • A problem I enjoy pulling out to play with now and then is doing a usable data set that can do reverse IP lookup to geographic location . . . there are services that offer it, but there isn't any reason you couldn't write stuff to do this yourself.

    Big data sets with a lot of the information being available through publicly queryable sources.

    There are lots of approaches to mapping it all out (start with class A, then to B, and onward or you can start with specific ranges . . .). Between whois info, dns,
  • Northwind Database. (Score:3, Informative)

    by Randolpho ( 628485 ) on Tuesday July 06, 2004 @04:44PM (#9625600) Homepage Journal
    30 posts and nobody mentions the Northwind Database that comes with MS Access? You can download it and set it up as an ODBC source, which you can use with pretty much anything. Well, provided you're using Windows. ;)

    Northwind Access 2000 Download page [microsoft.com]
  • A dump of the Wikipedia database is available [wikimedia.org]. It's certainly big, and the content is interesting, although the structure isn't.
  • easy (Score:3, Funny)

    by Jonny 290 ( 260890 ) <brojames@@@ductape...net> on Tuesday July 06, 2004 @04:57PM (#9625782) Homepage
    step 1: install windows xp on PC.
    step 2: turn on application crash logging.
    step 3: give to mother
    step 4: ??? (aka wait 3 months)
    step 5: collect 3gb dataset.
  • You can download the entire Wikipedia database [wikimedia.org]. It weighs in at 15GB if you include all revisions, or 600MB with just the newest copy of everything. Have fun.
  • There are plenty of astronomical surveys that are in the public domain. For example, the Sloan Digital Sky Survey (google "SDSS SQL") has its public release in SQL format. I am not sure how they bundle it for download and offline access -- the full set in in the TB, so you would have to make some serious cuts.

    Look around for an astronomical survey that interests you. Learn a little of the science so that you can think of interesting things to play with (or just read some of the associated papers to repea
  • My recommendation is to install Zen-Card on the system.

    Zen is a full featured e-commerce solution and is opensource and written in php. When you install it you can choose to have it populate a demo database which has quite a bit of stuff in it. (100s of products).

    I don't think its the pinnacle of database engineering but if all you are looking for is mock data there you go.
  • I use the FCC License databases for most of my small to medium database testing. They can be downloaded at FCC Universal Licensing System [fcc.gov], and are in BCP format (for SQL Server, easy to import into anything else). Layout files and schema create scripts are also available for download.

    There are a few related tables available for download, but I mostly use it for Name/Address test data.

    Dean

  • The freedb.org [freedb.org] database is available for download [freedb.org].
  • I do a little work with environmental data, and in my travels I found this site [geocomm.com]. It has a huge amount of GIS data collected from various sources for the purposes of mapping various parameters, such as political boundaries, vertical relief, drainage networks, rail/road/piping networks, etc. The only downside is that the data is aging, and the free stuff only goes down to 1:1000000, but its a neat repository if you need a rough map drawn.

    Have fun.
  • by norculf ( 146473 )
    Build a log bot, run it in a few large channels on Freenode.
  • Check out the Nat'l Imaging and Mapping
    Agency (NIMA)'s database of placenames:

    http://earth-info.nga.mil/gns/html/cntry_files. h tm l

    This is the db (200M compressed, 700M un) of
    foreign placenames. I actually used the US
    placenames file that NIMA has (don't have the
    URL, but it should be free for US residents)
    for zips, counties and cities of the US. Each
    "placename" has the latitude and longitude as
    well.

    Note that db's like these need to be scrubbed
    and massaged before they can be properly
    read into a relationa
  • by acceleriter ( 231439 ) on Tuesday July 06, 2004 @11:58PM (#9628965)
    . . . the U.S. Department of Justice has a foreign lobbyist database that should be big enough to test with.
  • Isn't the TPC-C or some similar sql-based (like crashme) database benchmark closer to what you need?
  • Wikipedia [wikipedia.org] has weekly MySQL database dumps of their content.

    ~~~~

  • I think that the do not call list [donotcall.gov] would be a great data set with a bunch of records and non-random data.

"If it ain't broke, don't fix it." - Bert Lantz

Working...