Community Test Data Repository?

Community Test Data Repository? 50

Posted by Cliff on Thursday January 27, 2005 @04:21AM from the reinvention-prevention dept.

BlizzyMadden inputs this query: "Currently I am working on a small utility to convert HTML to plain text. As I test this, I create more and more different types of HTML files to regression test it. I wonder to myself if these test files that I make would be beneficial to other developers who may be doing similar work. To expand on this thought, I wonder if there is a community-based repository of test data anywhere that developers and use and contribute to. Just curious if anyone knows of any project website out there that offers this."

"Such a repository would be useful for files like the following:

Complex HTML files.
RFT and Word files with lots of formatting.
Large text files.
Excel files with complex equations and macros.

Files like this would be great if developers were to share them to debug their own applications."

Community Test Data Repository?

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 50 Comments Log In/Create an Account

Comments Filter:

You could try Mangleme (Score:5, Informative)

by Gopal.V ( 532678 ) writes: on Thursday January 27, 2005 @04:29AM (#11490170) Homepage Journal

Mangleme [freshmeat.net] generates Malformed HTML used for testing browsers.
Another good idea is to pull a couple hundred websites with Wget -r :)
OF course, slashdot belongs in the "Broken HTML No-Css Table Mess" variety of HTML (just like they call Crushed Bean No-Froth Dark Latte - a coffee)

- Re:You could try Mangleme (Score:5, Informative)
  
  by Gopal.V ( 532678 ) writes: on Thursday January 27, 2005 @04:34AM (#11490193) Homepage Journal
  
  >Another good idea is to pull a couple hundred websites with Wget -r :)
  Feels wierd replying to my own post... but I remembered something else that I had. A copy of the Google Programming contest data files. Get a whopping 16000 web pages in one shot from research.google.com [google.com]. (wish they'd gzipped it - but content-encoding: gzip works too)
  Sadly, all those pages are from .edu websites :)
  
- Re:You could try Mangleme (Score:3, Insightful)
  
  by LardBrattish ( 703549 ) writes:
  
  These are all useful resources but it's not what he's asking. What he wants to know is: is there a project that deliberately clooects test data in a GPL sort of way so developers don't have to generate the test data themselves...
- Re:You could try Mangleme (Score:2)
  
  by JohnFluxx ( 413620 ) writes:
  
  Also try the pages that kde use. They are in the kde cvs tree:
  
  http://webcvs.kde.org/khtmltests/
here's mine (Score:4, Funny)

by DrSkwid ( 118965 ) writes: on Thursday January 27, 2005 @05:03AM (#11490294) Journal

sed s'/<[^>]+>//g'

=)

- - Re:here's mine (Score:1)
    
    by DrSkwid ( 118965 ) writes:
    
    garbage in, garbage out
    
    you also forgot
    
    <a href="/"
    >click here</a>
    
    did you notice the =) ?
or you could... (Score:3, Informative)

by cyborch ( 524661 ) writes: on Thursday January 27, 2005 @05:13AM (#11490319) Homepage Journal

... just use lynx --dump.

- Re:or you could... (Score:1, Funny)
  
  by Anonymous Coward writes:
  
  Yeah, but who wants all the bloated dependencies of that?
- Re:or you could... (Score:1)
  
  by Christopheles ( 803724 ) writes:
  
  What is this? 1999? elinks -dump
- Re:or you could... (Score:3, Interesting)
  
  by seanyboy ( 587819 ) * writes:
  
  Not if you're in the UK. [boingboing.net]
- Re:or you could... (Score:1)
  
  by RabidSquirrel ( 643514 ) writes:
  
  We did some tests with lynx and links and found that links produced slightly nicer looking output.
  
  Unfortunately both lynx and links had a maximum line length when dumping (width option). This created random line breaks in the middle paragraphs and what not.
  
  You can increase the max size of the width by editing the links source and recompiling. Clip from an email I sent to my coworker about changing this limit:
  
  "In case you want to play with links before I get there in in the morning.
  
  LI character replaceme
  - Re:or you could... (Score:2)
    
    by cyborch ( 524661 ) writes:
    
    I set it to 65536 which should be the max size of an integer right? For some reason it lets me go over that though. wtf?"
    
    Max size of a 16 bit unsigned integer is 65535. Today most integers are 32 bits or larger leaving you with a maximum of at least 4294967295 though I wouldn't recommend a max line length that high since lynx most likely (I didn't look at the lynx source) allocates memory enough to store the entire line and a 4gb memory footprint per line of output seems a bit excessive.
Sourceforge? (Score:5, Interesting)

by LardBrattish ( 703549 ) writes: on Thursday January 27, 2005 @05:18AM (#11490331) Homepage

If there isn't a test data project maybe you could start one. If people agree that it's a good idea then it'll grow... if not...

I believe the idea has merit and should be done. This would be useful for the developers of many FOSS applications. A "torture test" of nasty Excel files or Word files would help Open Office etc. HTML files would be good for the Mozilla team. Maybe they would be interested in providing the first few sets of data.

I'd also recommend tying the automated regression tests to this open source test data so every developer could download the source & the test data and make sure the new feature doesn't break anything...

Any new troublesome files could be added to the test data and new tests could be built to ensure that the software deals with them.

- Re:Sourceforge? (Score:2)
  
  by Kiaser Zohsay ( 20134 ) writes:
  
  Mozilla has done a tiny bit [mozilla.org] of html testing.
Great idea. (Score:3, Insightful)

by seanyboy ( 587819 ) * writes: on Thursday January 27, 2005 @05:57AM (#11490436)

Not only that, but it'd be great to see things like lists of made up addresses and other test data.

- Re:Great idea. (Score:4, Interesting)
  
  by seanyboy ( 587819 ) * writes: on Thursday January 27, 2005 @07:27AM (#11490726)
  
  Why the hell is that a troll. In the past I've wanted 100,000 or so mailing addresses to test an indexing routine on, and have ended up spending time writing a random address generator. If I'd have been able to go to a site (like lorum ipsum), ask for 100,000 addresses in CSV format and had these downloadable as a zipped file, it'd have saved time. I'm sure I'm not the only developer this has happened to. Jeez.
  
  - Re:Great idea. (Score:2)
    
    by Justice8096 ( 673052 ) writes:
    
    Yes - especially if you had common miss-spellings of the streets.
- IAWTP (Score:3, Interesting)
  
  by Clover_Kicker ( 20761 ) writes:
  
  I once needed a few thousand names for test data. The only big list I could find was the list of men killed in Vietnam [war-stories.com]
  Anyone have a less disturbing list of real or fake names? I suppose someone could grab some data from a geneology site, strip out just the names, and use that.
  If anyone knows of (or starts) a project like this I'd probably contribute.
  - Re:IAWTP (Score:2)
    
    by archeopterix ( 594938 ) * writes:
    
    I once needed a few thousand names for test data. The only big list I could find was the list of men killed in Vietnam
    Anyone have a less disturbing list of real or fake names? I suppose someone could grab some data from a geneology site, strip out just the names, and use that.
    I would just scan and OCR a few pages from a phone book. As far as I know, the data in the phonebook cannot be copyrighted, although there might be some privacy protection laws that forbid keeping databases of personal data withou
  - Re:IAWTP (Score:2)
    
    by Thng ( 457255 ) writes:
    
    I used the US Census bureau list of names [census.gov] for a school project once (this is the 1990 listing). Wrote a small perl script that took random names from each file and put them together for a full name.
    There are last names, men's first names and women's first names files.
Mozilla has it (Score:1, Informative)

by Anonymous Coward writes:

Mozilla has a plaintext serializer for HTML.

Vidar Braut Haarr
http://www.q1n.org/ [q1n.org]
css zen garden (Score:1)

by Free_Trial_Thinking ( 818686 ) writes:

Here's a python script I wrote to download all the zen garden examples. It works by incrementing the url and getting the next page. (myutils.pad turns '1' into '001') This puts all the pages into one big file, but you could easily make it do seperate files:

import os,sys,time,urllib2,urlparse,re
import myutils

baseurl=r'http://www.csszengarden.com/'
for i in range(1,146):
paddedi=myutils.pad(str(i),3,'0',True)
url=baseurl + paddedi + '/' + paddedi + '.css'
print 'trying: ' + url
try:
- Re:css zen garden (Score:2)
  
  by DrSkwid ( 118965 ) writes:
  
  those who don't know unix .......
  
  curl -f 'http://www.csszengarden.com/[001-146]/[001-146].c ss' -o 'csszengarden#1.css'
  
  ok it does a 145 * 145 extra requests but hey, who cares !!
  
  btw how can you trust the design advice of a site that has dark brown text on a lighter brown background and grey body text. awful, try reading that when you're over 65 and your eyes get 30% less contrast!
  - Re:css zen garden (Score:1)
    
    by neglige ( 641101 ) writes:
    
    try reading that when you're over 65 and your eyes get 30% less contrast!
    
    Don't read. "Become one with the web." (SCNR)
- - Re:css zen garden (Score:1)
    
    by Free_Trial_Thinking ( 818686 ) writes:
    
    Thanks, good idea. How does it work? Is it some kind of formatting instruction deal?
Proprietary Problems... (Score:2)

by Justice8096 ( 673052 ) writes:

I'd like to see something like this centralized for everything... (databases, C++ compilers, etc...) but there would need to be a way to anonymously post, because otherwise corporate counterintelligence could be gleaned from checking which things most companies check for (and don't check for).
For your purposes, check out www.org . They have "test suites" that check the web standard compliances of browsers, readers, HTML, CSS, etc... I've used them whenever I do web sites as a way of assuring that my displa
yes (Score:4, Funny)

by Tom7 ( 102298 ) writes: on Thursday January 27, 2005 @10:43AM (#11491842) Homepage Journal

I hear that the internet is a community-driven repository of html

And what about media tags? (Score:2)

by ciroknight ( 601098 ) writes:

To expand the original prompt: how about media tags? EXIF, ID3, etc?
Interesting Idea, but basically useless (Score:5, Insightful)

by Jahz ( 831343 ) writes: on Thursday January 27, 2005 @12:19PM (#11492940) Homepage Journal

The idea of a testing repository is quite interesting, but, in practice, a useless one.

Such a repository would end up as no more than a garbage collection. Additionally, it is generally not too hard to create test data for most projects. Also, the chance that someone else has created test data for the exact problem you are working on is quite slim. And then there is always the most important point of them all:

If someone has already created test data for your specific problem, they have probably already solved your problem! Enter respositories like CPAN and SourceForge.

- Re:Interesting Idea, but basically useless (Score:2)
  
  by thpr ( 786837 ) writes:
  
  Such a repository would end up as no more than a garbage collection.
  I fear that this is a significant problem, but disagree some of the rest of your analysis.
  If someone has already created test data for your specific problem, they have probably already solved your problem! Enter respositories like CPAN and SourceForge.
  You have a powerful point, but that one solution may not work for everyone. It may not be in a suitable programming language or it might be an unusable license. Also, just because one
  - Re:Interesting Idea, but basically useless (Score:1)
    
    by Jahz ( 831343 ) writes:
    
    Good point.
    
    Although CPAN and SourceForge host almost only GPL'd (or MIT'd etc) code. Thus you should not have a problem using it as long as you license the derivative works under an equal or lesser restricting license.
    
    Also your point about other solutions being very close to what is needed, but not close enough, was interesting. Such a collection would be far more beneficial if the testing files came with a list of OSS that used them. That way you can see how other developers used the testing code.
    
    A
- Re:Interesting Idea, but basically useless (Score:3, Insightful)
  
  by dubl-u ( 51156 ) * writes:
  The idea of a testing repository is quite interesting, but, in practice, a useless one.
  
  Your imagination is pretty limited. This is of use in any area where developers will use similar data for lots of different things, especially in areas of active research. Some examples include:
  
  mailing addresses - all sorts of apps need to parse international mailing addresses: wouldn't it be better to test with real samples?
  email - corpuses of known good email and known spam email are necessary for any spam recogniti
  - Re:Interesting Idea, but basically useless (Score:2)
    
    by Meostro ( 788797 ) writes:
    
    Also, something like facial recognition [cmu.edu] needs large test datasets, and it's never a "solved" problem. There's always a way to do it faster or better or more easily. Other things like Canterbury Corpus [canterbury.ac.nz] or Calgary Corpus [uwaterloo.ca] are datasets used for comparison between compression algorithms. Meaningful comparisons can be made between different algorithms based on how well they perform on them simply because they've been used enough and are standard enough.
    
    I'm so interested in this that I just registered gpldata.
I'd love to help... (Score:1)

by Ciaran_H ( 579351 ) * writes:

I would definitely be interested in helping with making something like this if it turns out there isn't one (or if there is, I'd be interested in helping to maintain it). It sounds like a good idea.
Crystal reports? (Score:1, Offtopic)

by Destoo ( 530123 ) writes:

I need a Crystal report to plain text converter.

Anyone can cook up a script or something? I really can't make sense out of them...

just drop me a note at my gmail if you'd like to try to help.
Just found an existing repository (Score:1)

by malcomvetter ( 851474 ) writes:

It's located at www.theWholeDangInternet.com

. For an older copy try the Internet Archive [archive.org].
- Re:Just found an existing repository (Score:1)
  
  by BlizzyMadden ( 814008 ) writes:
  
  Oh, snap! You got me. No, my whole point is test data in all sorts of formats, such as PDF and Excel. Test data that is known to cause all sorts of problems. For example, if I have a Word files that crashes OO.org at one point, why not offer those files publicly to other developers of other products (Abiword, KWord) to see if they perhaps have similar problems.
Great Idea (Score:1)

by Agent_9191 ( 812909 ) writes:

I think this would be a great idea because it would give developers a great starting point for applications. Specifically if your application can handle the files in the repository they way you expect them to, then you've done good. Granted you may be reinventing the wheel because if it's up there, it means that somebody else may have solved the problem already. OR they used the file to solve a different problem...Either way it'd be a great thing to have. Sounds like it'd be prime for Sourceforge (as lo
Use html2text (Score:1)

by devillion ( 831115 ) writes:

Works fine for me.
link [tucows.com]
International languages (Score:1)

by tungwaiyip ( 608795 ) writes:

It would be great to have web pages for all natural languages that the current computer infrastructure supports.
Test pages from the W3C (Score:1)

by vil ( 144488 ) writes:

This might be what you're looking for:

http://www.w3.org/MarkUp/Test/ [w3.org]

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

You could try Mangleme (Score:5, Informative)

Re:You could try Mangleme (Score:5, Informative)

Re:You could try Mangleme (Score:3, Insightful)

Re:You could try Mangleme (Score:2)

here's mine (Score:4, Funny)

Re:here's mine (Score:1)

or you could... (Score:3, Informative)

Re:or you could... (Score:1, Funny)

Re:or you could... (Score:1)

Re:or you could... (Score:3, Interesting)

Re:or you could... (Score:1)

Re:or you could... (Score:2)

Sourceforge? (Score:5, Interesting)

Re:Sourceforge? (Score:2)

Great idea. (Score:3, Insightful)

Re:Great idea. (Score:4, Interesting)

Re:Great idea. (Score:2)

IAWTP (Score:3, Interesting)

Re:IAWTP (Score:2)

Re:IAWTP (Score:2)

Mozilla has it (Score:1, Informative)

css zen garden (Score:1)

Re:css zen garden (Score:2)

Re:css zen garden (Score:1)

Re:css zen garden (Score:1)

Proprietary Problems... (Score:2)

yes (Score:4, Funny)

And what about media tags? (Score:2)

Interesting Idea, but basically useless (Score:5, Insightful)

Re:Interesting Idea, but basically useless (Score:2)

Re:Interesting Idea, but basically useless (Score:1)

Re:Interesting Idea, but basically useless (Score:3, Insightful)

Re:Interesting Idea, but basically useless (Score:2)

I'd love to help... (Score:1)

Crystal reports? (Score:1, Offtopic)

Just found an existing repository (Score:1)

Re:Just found an existing repository (Score:1)

Great Idea (Score:1)

Use html2text (Score:1)

International languages (Score:1)

Test pages from the W3C (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals