Do-It-Yourself Internet Archiving? 29
A moron asks: "Web pages change and disappear all the time. For legal and historical purposes, I need to have accessible archives of the websites I maintain. I'm basically looking for a do-it-yourself version of Internet Archive's Way Back Machine which provides a simple versioning system and accessibility through web interface. Is there already software that does this? If not, what ideas does Slashdot have to make such a system possible? How should it work? What existing tools can be used together to make a workable system?"
"There are all sorts of tools out there that will archive web pages, and each have other necessary features such as making links relative. I don't always have filesystem access to pages, so tools that rely on such access won't work. There are some obvious tools that do part of the job such as:
But grabbing pages is only part of my, and I suspect many other peoples needs. The other pieces include intelligently archiving the pages, and making them accessible. If a page or a page element hasn't changed, there is no need to store multiple copies. The archives need to be easy for end users to navigate, search, and link."
In five lines or less... (Score:5, Informative)
ARCDIR = `date +%y%m%d` /var/www/archives
cd
mkdir $ARCDIR
cd $ARCDIR
wget -r http://mysite.com
Add error-checking and season to taste.
If you want to be more efficient like the poster wanted, you could easily have it always fetch to the same directory and just use cvs to check in. This eliminates duplicate storage. There are many free web-based CVS browsers out there with date searching and similar features. Might not be quite as nice as the wayback machine, but it definitely does the job for free.
A lot of folks are doing a simple version of the above to maintain SCO mirrors so there's to be no history erasing before the trial. God bless you all -- it will make the case that much stronger for us.
Re:In five lines or less... (Score:5, Informative)
If archiving SCO or other such pr0n sites, or if you have no-robots policies set on your own site that you're archiving, you'll need to tell wget to be a little rude. He needs to go where robots aren't meant to go. I figure if you were going to visit every page yourself anyway, it's not so impolite. And besides, robots.txt is for other people. You know... the ones we make ride the back of the internet.
To accomplish this: cat >>~/.wgetrc "robots = off"
Re:In five lines or less... (Score:2, Informative)
Re:In five lines or less... (Score:1)
Re:In five lines or less... (Score:2)
Use - (Score:2, Informative)
CPAN is your friend (Score:2, Informative)
Re:wget (Score:2)
Put each site in a CVS repository. Check it out, wget the live site, check the copy of the live site back into the CVS repository.
wget is the right path (Score:1, Redundant)
Re:wget is the right path (Score:2, Informative)
But personally I don't think wget and CVS are very helpful in this case. I think it would be better to use something like Perl or Ruby to write a custom spider, and then using cp -lR to make iterative snapshot copies of your working archive tree (you use cp -l because then your copies don't take up extra space). This way you can write hooks to test whether content has changed before writin
this brings up an interesting point.. (Score:1)
CVS? (Score:3, Informative)
Re:CVS? (Score:1)
There's no real need for console access, unless its a dynamic site in which case you need to store the source for your scripts as well as maintain versions of the database!
At this point it's nothing more than keeping multi-versioned backups of your website and database files. Check out rdiff-backup [stanford.edu]
Best of Luck.
Why make something harder than it should be (Score:1)
sourceforge (Score:3, Interesting)
Re:sourceforge (Score:4, Informative)
Re:sourceforge (Score:2, Interesting)
Check the URL you referenced, and you'll notice that the last release was made on 2001-11-04. And the code released there is actually even older than that, as the release date got updated when they moved it from the original Alexandria project.
SourceForge intentionally killed off public development of the SourceForge code, and then did an excellent job of convincing people that it was still an Open Source project. They kept promising and promising th
Linkrot (Score:3, Informative)
I like Adobe Acrobat for this job, you just point it at a URL, tell it how many levels you want to archive, and go. You can even archive externally linked pages if you uncheck "stay on same server," or you can select other options like "Archive Whole Site."
Re:Linkrot (Score:3, Funny)
For dynamic sites... (Score:2, Interesting)
downloading webpages (Score:1)
Re:downloading webpages (Score:2)
Use the Archive's crawler (Score:2, Informative)
Re:Use the Archive's crawler (Score:2)
I need to crawl/archive a set of websites, can I use Heritrix?
Eventually. For now, the crawler is still in early development, and only if you are comfortable grabbing code directly from CVS, wrestling with incomplete documentation, and running into undocumented limitations, would you want to use the current software.
webpage versioning (Score:1)