Developing a Niche Online-Content Indexing System? 134
tebee writes "One of my hobbies has benefited for 20
years or so by the existence of an online index to all magazine
articles on the subject since the 1930s. It lets you list the
articles in any particular magazine or search for an article by
keyword, title or author, refining the search if necessary by
magazine and/or date. Unfortunately the firm which hosts the
index have recently pulled it from their website, citing security
worries and incompatibilities with the rest of their e-commerce
website: the heart of the system is a 20-year-old DOS program! They
have no plans to replace it as the original data is in an unknown
format. So we are talking about putting together
a team to build a open source replacement for this – probably using
PHP and MySQL. The governing body for the hobby has agreed to host
this and we are in negotiations to try and get the original data. We
hope that by volunteers crowd-sourcing the conversion, we will be
able to do what was commercially impossible." Tebee is looking for ideas about the best way to go about this, and for leads to existing approaches; read on for more.
tebee continues:
"It occurs to me that there could be
existing open-source projects that do roughly what we want to do —
maybe something indexing academic papers. But two days of trawling
through script sites and googling has not produced any results.
Remember that here we only point to the original article, we don't have the text of it online, though it has been suggested that we expand to do this. Unfortunately I think copyright considerations will prevent us from doing it, unless we can get our own version of the Google book agreement!
So does anyone know of anything that will save us the effort of writing our system or at least provide a starting point for us to work on?"
It would help (Score:3, Insightful)
Developing a Niche Online-Content Indexing System? (Score:4, Insightful)
File format, not the implementation details (Score:3, Insightful)
Using a Howitzer to Hunt Squirrels (Score:3, Insightful)
Lots of people here are recommending using tools that are built for very large scale projects. Based on the fact you have a DOS based system that likely used a pretty common library for storing the data (something like c-tree, btrieve, a dbase library or simply saving binary data using whatever language the app was written in), using any RDBMS like MySQL or even SQLite probably would do the job. PHP, Python, Ruby and Perl would probably make writing the actual application a snap - and be able to handle more of a load that the DOS app could.
Here's to hoping you can get the data. Hopefully the vendor that pulled the database down realizes how important to marketing it is and reverses course.
No, no, NO! (Score:4, Insightful)
Your suggestions make sense, but suggesting to store comma-delimited plain text in a SQL table is wrong by any and all database standards & best practises. You fail to reach even the first normalized form.
Read http://en.wikipedia.org/wiki/Database_normalization [wikipedia.org]
You want to define a table "tags" or something with id, article_id, name, comment. Make the combination of id, parent_id, name unique.
* id is on auto-increase, not NULL
* article_id is a foreign key to the id of the article, not NULL
* name is the name of the tag, not NULL
* comment is an optional comment explaining the tag (for example in the mouse-over or on the site listing everything with that tag), may be NULL
Not only is that easier to maintain in the long run (think of parsing plain text out of a VARCHAR. argh!), but all of a sudden, you have the data you _store_ available to _access_.
How many artcles are tagged electric? SELECT count (1) FROM article_tags WHERE name = "electric";
Give me a list of all article relating to foo and bar? SELECT article_id FROM article_tags WHERE name = "foo" OR name = "bar".
etc pp.
If you want to go really fancy with multi-level tags, replace article_id with parent_id (referring to the id in the same table) and create a relation table as glue. If you want all upper levels to apply, throw in a transitive closure:
http://en.wikipedia.org/wiki/Transitive_closure [wikipedia.org]
Generally speaking, you want a table for magazines with their names, publication dates, publisher, whatnot; and only refer to them via foreign keys. Same goes for train models (which you could cross-ref via tags. Yay for clean db design!), authors, collectors, train clubs and and pretty much everything else.
One last word of advice: No matter what anyone tells you: Either you use a proper framework or you _ALWAYS_ use prepared statements. You get some performance benefits and SQL injection becomes impossible, for free! Repeat: Even if you ignore all the other tips above, you _MUST heed this.
http://en.wikipedia.org/wiki/SQL_injection [wikipedia.org]
Richard
PS: You are more than welcome to reply to this post once you have your DB design hammered out. I will have a look & optimize, if you want.