Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Programming Technology

Community Test Data Repository? 50

BlizzyMadden inputs this query: "Currently I am working on a small utility to convert HTML to plain text. As I test this, I create more and more different types of HTML files to regression test it. I wonder to myself if these test files that I make would be beneficial to other developers who may be doing similar work. To expand on this thought, I wonder if there is a community-based repository of test data anywhere that developers and use and contribute to. Just curious if anyone knows of any project website out there that offers this."
"Such a repository would be useful for files like the following:
Complex HTML files.
RFT and Word files with lots of formatting.
Large text files.
Excel files with complex equations and macros.
Files like this would be great if developers were to share them to debug their own applications."
This discussion has been archived. No new comments can be posted.

Community Test Data Repository?

Comments Filter:
  • by Gopal.V ( 532678 ) on Thursday January 27, 2005 @04:29AM (#11490170) Homepage Journal
    Mangleme [freshmeat.net] generates Malformed HTML used for testing browsers.

    Another good idea is to pull a couple hundred websites with Wget -r :)

    OF course, slashdot belongs in the "Broken HTML No-Css Table Mess" variety of HTML (just like they call Crushed Bean No-Froth Dark Latte - a coffee)
  • here's mine (Score:4, Funny)

    by DrSkwid ( 118965 ) on Thursday January 27, 2005 @05:03AM (#11490294) Journal
    sed s'/<[^>]+>//g'

    =)

  • or you could... (Score:3, Informative)

    by cyborch ( 524661 ) on Thursday January 27, 2005 @05:13AM (#11490319) Homepage Journal
    ... just use lynx --dump.
    • by Anonymous Coward
      Yeah, but who wants all the bloated dependencies of that?
    • What is this? 1999? elinks -dump
    • Re:or you could... (Score:3, Interesting)

      by seanyboy ( 587819 ) *
      Not if you're in the UK. [boingboing.net]
    • We did some tests with lynx and links and found that links produced slightly nicer looking output.

      Unfortunately both lynx and links had a maximum line length when dumping (width option). This created random line breaks in the middle paragraphs and what not.

      You can increase the max size of the width by editing the links source and recompiling. Clip from an email I sent to my coworker about changing this limit:

      "In case you want to play with links before I get there in in the morning.

      LI character replaceme
      • I set it to 65536 which should be the max size of an integer right? For some reason it lets me go over that though. wtf?"

        Max size of a 16 bit unsigned integer is 65535. Today most integers are 32 bits or larger leaving you with a maximum of at least 4294967295 though I wouldn't recommend a max line length that high since lynx most likely (I didn't look at the lynx source) allocates memory enough to store the entire line and a 4gb memory footprint per line of output seems a bit excessive.

  • Sourceforge? (Score:5, Interesting)

    by LardBrattish ( 703549 ) on Thursday January 27, 2005 @05:18AM (#11490331) Homepage
    If there isn't a test data project maybe you could start one. If people agree that it's a good idea then it'll grow... if not...

    I believe the idea has merit and should be done. This would be useful for the developers of many FOSS applications. A "torture test" of nasty Excel files or Word files would help Open Office etc. HTML files would be good for the Mozilla team. Maybe they would be interested in providing the first few sets of data.

    I'd also recommend tying the automated regression tests to this open source test data so every developer could download the source & the test data and make sure the new feature doesn't break anything...

    Any new troublesome files could be added to the test data and new tests could be built to ensure that the software deals with them.
  • Great idea. (Score:3, Insightful)

    by seanyboy ( 587819 ) * on Thursday January 27, 2005 @05:57AM (#11490436)
    Not only that, but it'd be great to see things like lists of made up addresses and other test data.
    • Re:Great idea. (Score:4, Interesting)

      by seanyboy ( 587819 ) * on Thursday January 27, 2005 @07:27AM (#11490726)
      Why the hell is that a troll. In the past I've wanted 100,000 or so mailing addresses to test an indexing routine on, and have ended up spending time writing a random address generator. If I'd have been able to go to a site (like lorum ipsum), ask for 100,000 addresses in CSV format and had these downloadable as a zipped file, it'd have saved time. I'm sure I'm not the only developer this has happened to. Jeez.
    • IAWTP (Score:3, Interesting)

      I once needed a few thousand names for test data. The only big list I could find was the list of men killed in Vietnam [war-stories.com]

      Anyone have a less disturbing list of real or fake names? I suppose someone could grab some data from a geneology site, strip out just the names, and use that.

      If anyone knows of (or starts) a project like this I'd probably contribute.

      • I once needed a few thousand names for test data. The only big list I could find was the list of men killed in Vietnam

        Anyone have a less disturbing list of real or fake names? I suppose someone could grab some data from a geneology site, strip out just the names, and use that.

        I would just scan and OCR a few pages from a phone book. As far as I know, the data in the phonebook cannot be copyrighted, although there might be some privacy protection laws that forbid keeping databases of personal data withou

      • I used the US Census bureau list of names [census.gov] for a school project once (this is the 1990 listing). Wrote a small perl script that took random names from each file and put them together for a full name.
        There are last names, men's first names and women's first names files.
  • Mozilla has it (Score:1, Informative)

    by Anonymous Coward
    Mozilla has a plaintext serializer for HTML.

    Vidar Braut Haarr
    http://www.q1n.org/ [q1n.org]
  • Here's a python script I wrote to download all the zen garden examples. It works by incrementing the url and getting the next page. (myutils.pad turns '1' into '001') This puts all the pages into one big file, but you could easily make it do seperate files:

    import os,sys,time,urllib2,urlparse,re
    import myutils

    baseurl=r'http://www.csszengarden.com/'
    for i in range(1,146):
    paddedi=myutils.pad(str(i),3,'0',True)
    url=baseurl + paddedi + '/' + paddedi + '.css'
    print 'trying: ' + url
    try:

    • those who don't know unix .......

      curl -f 'http://www.csszengarden.com/[001-146]/[001-146].c ss' -o 'csszengarden#1.css'

      ok it does a 145 * 145 extra requests but hey, who cares !!

      btw how can you trust the design advice of a site that has dark brown text on a lighter brown background and grey body text. awful, try reading that when you're over 65 and your eyes get 30% less contrast!
  • I'd like to see something like this centralized for everything... (databases, C++ compilers, etc...) but there would need to be a way to anonymously post, because otherwise corporate counterintelligence could be gleaned from checking which things most companies check for (and don't check for).
    For your purposes, check out www.org . They have "test suites" that check the web standard compliances of browsers, readers, HTML, CSS, etc... I've used them whenever I do web sites as a way of assuring that my displa
  • yes (Score:4, Funny)

    by Tom7 ( 102298 ) on Thursday January 27, 2005 @10:43AM (#11491842) Homepage Journal
    I hear that the internet is a community-driven repository of html
  • To expand the original prompt: how about media tags? EXIF, ID3, etc?
  • by Jahz ( 831343 ) on Thursday January 27, 2005 @12:19PM (#11492940) Homepage Journal
    The idea of a testing repository is quite interesting, but, in practice, a useless one.

    Such a repository would end up as no more than a garbage collection. Additionally, it is generally not too hard to create test data for most projects. Also, the chance that someone else has created test data for the exact problem you are working on is quite slim. And then there is always the most important point of them all:

    If someone has already created test data for your specific problem, they have probably already solved your problem! Enter respositories like CPAN and SourceForge.
    • Such a repository would end up as no more than a garbage collection.

      I fear that this is a significant problem, but disagree some of the rest of your analysis.

      If someone has already created test data for your specific problem, they have probably already solved your problem! Enter respositories like CPAN and SourceForge.

      You have a powerful point, but that one solution may not work for everyone. It may not be in a suitable programming language or it might be an unusable license. Also, just because one

      • Good point.

        Although CPAN and SourceForge host almost only GPL'd (or MIT'd etc) code. Thus you should not have a problem using it as long as you license the derivative works under an equal or lesser restricting license.

        Also your point about other solutions being very close to what is needed, but not close enough, was interesting. Such a collection would be far more beneficial if the testing files came with a list of OSS that used them. That way you can see how other developers used the testing code.

        A
    • The idea of a testing repository is quite interesting, but, in practice, a useless one.

      Your imagination is pretty limited. This is of use in any area where developers will use similar data for lots of different things, especially in areas of active research. Some examples include:
      • mailing addresses - all sorts of apps need to parse international mailing addresses: wouldn't it be better to test with real samples?
      • email - corpuses of known good email and known spam email are necessary for any spam recogniti
      • Also, something like facial recognition [cmu.edu] needs large test datasets, and it's never a "solved" problem. There's always a way to do it faster or better or more easily. Other things like Canterbury Corpus [canterbury.ac.nz] or Calgary Corpus [uwaterloo.ca] are datasets used for comparison between compression algorithms. Meaningful comparisons can be made between different algorithms based on how well they perform on them simply because they've been used enough and are standard enough.

        I'm so interested in this that I just registered gpldata.
  • I would definitely be interested in helping with making something like this if it turns out there isn't one (or if there is, I'd be interested in helping to maintain it). It sounds like a good idea.
  • by Destoo ( 530123 )
    I need a Crystal report to plain text converter.

    Anyone can cook up a script or something? I really can't make sense out of them...

    just drop me a note at my gmail if you'd like to try to help.
  • It's located at www.theWholeDangInternet.com

    . For an older copy try the Internet Archive [archive.org].
    • Oh, snap! You got me. No, my whole point is test data in all sorts of formats, such as PDF and Excel. Test data that is known to cause all sorts of problems. For example, if I have a Word files that crashes OO.org at one point, why not offer those files publicly to other developers of other products (Abiword, KWord) to see if they perhaps have similar problems.
  • I think this would be a great idea because it would give developers a great starting point for applications. Specifically if your application can handle the files in the repository they way you expect them to, then you've done good. Granted you may be reinventing the wheel because if it's up there, it means that somebody else may have solved the problem already. OR they used the file to solve a different problem...Either way it'd be a great thing to have. Sounds like it'd be prime for Sourceforge (as lo
  • Works fine for me.

    link [tucows.com]

  • It would be great to have web pages for all natural languages that the current computer infrastructure supports.
  • This might be what you're looking for:

    http://www.w3.org/MarkUp/Test/ [w3.org]

"Take that, you hostile sons-of-bitches!" -- James Coburn, in the finale of _The_President's_Analyst_

Working...