Community Test Data Repository? 50
BlizzyMadden inputs this query: "Currently I am working on a small utility to convert HTML to plain text. As I test this, I create more and more different types of HTML files to regression test it. I wonder to myself if these test files that I make would be beneficial to other developers who may be doing similar work. To expand on this thought, I wonder if there is a community-based repository of test data anywhere that developers and use and contribute to. Just curious if anyone knows of any project website out there that offers this."
"Such a repository would be useful for files like the following:
Files like this would be great if developers were to share them to debug their own applications."Complex HTML files.
RFT and Word files with lots of formatting.
Large text files.
Excel files with complex equations and macros.
You could try Mangleme (Score:5, Informative)
Another good idea is to pull a couple hundred websites with Wget -r :)
OF course, slashdot belongs in the "Broken HTML No-Css Table Mess" variety of HTML (just like they call Crushed Bean No-Froth Dark Latte - a coffee)Re:You could try Mangleme (Score:5, Informative)
Feels wierd replying to my own post... but I remembered something else that I had. A copy of the Google Programming contest data files. Get a whopping 16000 web pages in one shot from research.google.com [google.com]. (wish they'd gzipped it - but content-encoding: gzip works too)
Sadly, all those pages are fromRe:You could try Mangleme (Score:3, Insightful)
Re:You could try Mangleme (Score:2)
http://webcvs.kde.org/khtmltests/
here's mine (Score:4, Funny)
=)
Re:here's mine (Score:1)
you also forgot
<a href="/"
>click here</a>
did you notice the =) ?
or you could... (Score:3, Informative)
Re:or you could... (Score:1, Funny)
Re:or you could... (Score:1)
Re:or you could... (Score:3, Interesting)
Re:or you could... (Score:1)
Unfortunately both lynx and links had a maximum line length when dumping (width option). This created random line breaks in the middle paragraphs and what not.
You can increase the max size of the width by editing the links source and recompiling. Clip from an email I sent to my coworker about changing this limit:
"In case you want to play with links before I get there in in the morning.
LI character replaceme
Re:or you could... (Score:2)
I set it to 65536 which should be the max size of an integer right? For some reason it lets me go over that though. wtf?"
Max size of a 16 bit unsigned integer is 65535. Today most integers are 32 bits or larger leaving you with a maximum of at least 4294967295 though I wouldn't recommend a max line length that high since lynx most likely (I didn't look at the lynx source) allocates memory enough to store the entire line and a 4gb memory footprint per line of output seems a bit excessive.
Sourceforge? (Score:5, Interesting)
I believe the idea has merit and should be done. This would be useful for the developers of many FOSS applications. A "torture test" of nasty Excel files or Word files would help Open Office etc. HTML files would be good for the Mozilla team. Maybe they would be interested in providing the first few sets of data.
I'd also recommend tying the automated regression tests to this open source test data so every developer could download the source & the test data and make sure the new feature doesn't break anything...
Any new troublesome files could be added to the test data and new tests could be built to ensure that the software deals with them.
Re:Sourceforge? (Score:2)
Great idea. (Score:3, Insightful)
Re:Great idea. (Score:4, Interesting)
Re:Great idea. (Score:2)
IAWTP (Score:3, Interesting)
Anyone have a less disturbing list of real or fake names? I suppose someone could grab some data from a geneology site, strip out just the names, and use that.
If anyone knows of (or starts) a project like this I'd probably contribute.
Re:IAWTP (Score:2)
I would just scan and OCR a few pages from a phone book. As far as I know, the data in the phonebook cannot be copyrighted, although there might be some privacy protection laws that forbid keeping databases of personal data withou
Re:IAWTP (Score:2)
There are last names, men's first names and women's first names files.
Mozilla has it (Score:1, Informative)
Vidar Braut Haarr
http://www.q1n.org/ [q1n.org]
css zen garden (Score:1)
import os,sys,time,urllib2,urlparse,re
import myutils
baseurl=r'http://www.csszengarden.com/'
for i in range(1,146):
paddedi=myutils.pad(str(i),3,'0',True)
url=baseurl + paddedi + '/' + paddedi + '.css'
print 'trying: ' + url
try:
Re:css zen garden (Score:2)
those who don't know unix
curl -f 'http://www.csszengarden.com/[001-146]/[001-146].
ok it does a 145 * 145 extra requests but hey, who cares !!
btw how can you trust the design advice of a site that has dark brown text on a lighter brown background and grey body text. awful, try reading that when you're over 65 and your eyes get 30% less contrast!
Re:css zen garden (Score:1)
Don't read. "Become one with the web." (SCNR)
Re:css zen garden (Score:1)
Proprietary Problems... (Score:2)
For your purposes, check out www.org . They have "test suites" that check the web standard compliances of browsers, readers, HTML, CSS, etc... I've used them whenever I do web sites as a way of assuring that my displa
yes (Score:4, Funny)
And what about media tags? (Score:2)
Interesting Idea, but basically useless (Score:5, Insightful)
Such a repository would end up as no more than a garbage collection. Additionally, it is generally not too hard to create test data for most projects. Also, the chance that someone else has created test data for the exact problem you are working on is quite slim. And then there is always the most important point of them all:
If someone has already created test data for your specific problem, they have probably already solved your problem! Enter respositories like CPAN and SourceForge.
Re:Interesting Idea, but basically useless (Score:2)
I fear that this is a significant problem, but disagree some of the rest of your analysis.
If someone has already created test data for your specific problem, they have probably already solved your problem! Enter respositories like CPAN and SourceForge.
You have a powerful point, but that one solution may not work for everyone. It may not be in a suitable programming language or it might be an unusable license. Also, just because one
Re:Interesting Idea, but basically useless (Score:1)
Although CPAN and SourceForge host almost only GPL'd (or MIT'd etc) code. Thus you should not have a problem using it as long as you license the derivative works under an equal or lesser restricting license.
Also your point about other solutions being very close to what is needed, but not close enough, was interesting. Such a collection would be far more beneficial if the testing files came with a list of OSS that used them. That way you can see how other developers used the testing code.
A
Re:Interesting Idea, but basically useless (Score:3, Insightful)
Your imagination is pretty limited. This is of use in any area where developers will use similar data for lots of different things, especially in areas of active research. Some examples include:
Re:Interesting Idea, but basically useless (Score:2)
I'm so interested in this that I just registered gpldata.
I'd love to help... (Score:1)
Crystal reports? (Score:1, Offtopic)
Anyone can cook up a script or something? I really can't make sense out of them...
just drop me a note at my gmail if you'd like to try to help.
Just found an existing repository (Score:1)
. For an older copy try the Internet Archive [archive.org].
Re:Just found an existing repository (Score:1)
Great Idea (Score:1)
Use html2text (Score:1)
link [tucows.com]
International languages (Score:1)
Test pages from the W3C (Score:1)
http://www.w3.org/MarkUp/Test/ [w3.org]