Developing a Niche Online-Content Indexing System? 134
tebee writes "One of my hobbies has benefited for 20
years or so by the existence of an online index to all magazine
articles on the subject since the 1930s. It lets you list the
articles in any particular magazine or search for an article by
keyword, title or author, refining the search if necessary by
magazine and/or date. Unfortunately the firm which hosts the
index have recently pulled it from their website, citing security
worries and incompatibilities with the rest of their e-commerce
website: the heart of the system is a 20-year-old DOS program! They
have no plans to replace it as the original data is in an unknown
format. So we are talking about putting together
a team to build a open source replacement for this – probably using
PHP and MySQL. The governing body for the hobby has agreed to host
this and we are in negotiations to try and get the original data. We
hope that by volunteers crowd-sourcing the conversion, we will be
able to do what was commercially impossible." Tebee is looking for ideas about the best way to go about this, and for leads to existing approaches; read on for more.
tebee continues:
"It occurs to me that there could be
existing open-source projects that do roughly what we want to do —
maybe something indexing academic papers. But two days of trawling
through script sites and googling has not produced any results.
Remember that here we only point to the original article, we don't have the text of it online, though it has been suggested that we expand to do this. Unfortunately I think copyright considerations will prevent us from doing it, unless we can get our own version of the Google book agreement!
So does anyone know of anything that will save us the effort of writing our system or at least provide a starting point for us to work on?"
Ask Pubmed guys (Score:3, Interesting)
Ask guys behind the Pubmed
http://www.ncbi.nlm.nih.gov/pubmed [nih.gov]
The database of scientific articles in the field of medicine and biology.
NCBI has the most generous software code licensing that is possible: the code is absolutely free, absolutely no restriction for distributing, changing, selling, even closing it. All because we, taxpayers, paid for it already.
I am surprised none of them reacted yet, I am sure they read ./
Re:Sphinx or Lucene (Score:4, Interesting)
If this isn't what you have in mind, please elaborate.
Drupal, hands down. (Score:2, Interesting)
Wayback (Score:4, Interesting)
Re:Sphinx or Lucene (Score:4, Interesting)
If you have relatively little but highly structured data, running it through a general search engine like Lucene or Sphinx doesn't seem like the ideal solution, because it doesn't make it easy to do structured queries ("give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966").
A bibliography indexer would probably be a better choice. Two good free ones are Refbase [refbase.net] or Aigaion [aigaion.nl]. Both are targeted mainly at databases of scientific literature, so might need some tweaking for this purpose, though.
Re:Sphinx or Lucene (Score:3, Interesting)
Like someone else pointed out, though, if at some point he expects to get access to the full text or even just scans of the articles, he'd better have chosen a system that can easily expand to handle that.
hoarding == massive replication (Score:3, Interesting)
Re:Sphinx or Lucene (Score:3, Interesting)
I do the same thing for tropical fish and wrote a shitload of C code. If this is an old DOS program it should port to C/UNIX really stupid easy.
Drop me a line if you want to and I'll ask you to send me some sample data. This might be really easy.
Re:Just migrate it to VMware or KVM (Score:3, Interesting)
If you do get the original data, I'll volunteer to either disassemble the exe or RE the data format or preferably both. Just for the fun of it. Contact me at the /. nick over in the google mail system.
Offer to let them host a redirect if they want - interstitial advert page with a 'we have moved', and offer to redirect to that page if they are not the referrer for a certain timeframe. They get some advert money, you get the data, I have something to entertain myself with.
Gimme just the DOS program at elast, I'll get you the format.