Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
Unix Operating Systems Software

Using Relational Databases as Virtual Filesystems? 52

Pogie asks: "At my office, we've got what one could only describe as a huge Network attached storage infrastructure. We're talking multiple terabytes of applications, user trees, data files, sybase and oracle databses, etc. 'In the beginning' it was a concious decision to create a shared NFS infrastructure using NetApp Filers (I humbly recommend them over SAN solutions any day...flame on!), but our data center has grown so large, and there are so many interdependencies that we're becoming concerned that if the wrong filer goes down, our production network would be, to say the least, hosed." To combat this problem, Pogie wants to implement his filesystem in a relational database...Oracle to be precise. Read on for his reasoning.

"To conquer our fears we're trying to get a handle on exactly what is where, with the goal of reorganizing the true physical locations of data to minimize the business impact if any single NFS server goes down. At the moment, the plan of attack is to construct a relational Oracle 8.1.6 database on linux which will basically mirror the filesystem in a DB. To accomplish this, I'm writing a horde of scripts using the perl DBI which will poll the entirety of the NFS filesystems on our network and create what basically amounts to a virtual filesystem in the DB which we can then drill into for specific information in much less time than it would take us to search through the actual filesystems in question. In addition, we gain the ability to maintain historical data, which allows us, among other things, to know exactly what went wrong if a luser rm's, mv's, or cp's the wrong thing to the wrong place.

Has anyone tried this before? And is this even a good idea? Does anyone know of existing packages that will do this? I'm really curious what the slashdot community thinks of the idea. I was several hours into this before someone said to me, 'Do you realize you're writing a filesystem in SQL?'"

This discussion has been archived. No new comments can be posted.

Using Relational Databases as Virtual Filesystems?

Comments Filter:
  • Take a peek over at linux-ha.org [linux-ha.org] and do some reading.

    Distributed filesystems like CODA and/or GFS might go quite a ways to solving some of your problems without resorting to implementing a filesystem on Oracle. In the end, what extra capabilities is Oracle going to give you that the right sort of filesystem wouldn't?

    Plus, if your data should be stored in a database, then store it in a database. Don't store database data in a filesystem, and don't store filesystem data in a database. They're two very different ideas for data storage (heirachical vs relational is just for starters).

    Perhaps simply sitting back and doing a bit of reorganising and replication (produce some read-only mirrors using rsync, or use the network block device to do some network-raid or something) will solve all your problems.
    • Re:HA Linux (Score:3, Insightful)

      by sigwinch ( 115375 )
      In the end, what extra capabilities is Oracle going to give you that the right sort of filesystem wouldn't?

      1. Consistent backups are trivial. Are there any common filesystems that provide this?
      2. Applications can do execute atomic transactions that involve multiple files and directories.
      3. It is easy to keep older versions files around and do undeletes.
      4. If you store keep the data and metadata in separate tables, it is easy to create totally different views of the same files. Dunno if this is useful...
      5. You can do access controls as appropriate to your problem.

      I'm not necessarily advocating RDMBS-as-filesystem, but the idea does have some merit.

      They're two very different ideas for data storage (heirachical vs relational is just for starters).

      Hierarchical is a special case of relational: 1) Each item has a foreign key for its parent directory, or NULL if it's in the root. 2) There is a UNIQUE constraint on foreign key + item name.

      • Consistent backups are trivial. Are there any common filesystems that provide this?

        VxFS does for a start. My first point of call would be to check whether the freevfxs filesystem in linux supports the point in time copies of the real thing. Or wait for Veritas's official release (which is purely a marketing issue -- it's been running on Linux for a long time inside Veritas). Other than that, maybe check if XFS or JFS support similar features.

  • Or rather, too much effort in the wrong direction. I can't imagine this would work very well.

    You don't need to stuff your file system data into a database, you need to investigate high-availability .. ya know, fall over systems, redundant everything etc.

    My biggest objection would be that you're violating the KISS rule, and making life a living hell for whoever follows you.

    Or, maybe I just have a limited imagination.

  • Oracle iFS (Score:3, Insightful)

    by sohp ( 22984 ) <snewton@NoSPAM.io.com> on Monday December 31, 2001 @02:09AM (#2766771) Homepage
    Why yes, this has been done, and you can get it from Oracle, under the name Oracle Internet File System [oracle.com], and I've played with it a bit. Interesting concept, not a very robust implementation, but perhaps it's gotten better since I tried it under 8.1.7? It's kind of neat to be able to mount a drive under windows that's really data in an Oracle table.
  • I think you may have a faulty assumption here or there--you don't like centralized filesystems, but want to use a relational datastore to back up a distributed filesystem?

    A relational database provides three things:

    • Relational access to data. This is what you pay Larry E. money for.
    • (Sometimes) transactional support. This is what you pay Larry E. more money for.
    • Persistence. This is handled by the datastore underneath the RDBMS. In 99% of cases, this is a (relatively) centralized filesystem. Everything eventually ends up as bits on a big disk somewhere, be it my hard drive or a networked RAID array.

      If you are using a database to back up a filesystem, wouldn't it be easier to just go with a centralized filesystem, and forego the Relational Layer and transaction support which you:

    • Don't seem to need, and
    • Are paying a lot per CPU for?

      Why don't you just a hardware redundant centralized filesystem in the first place?

  • Just store every file in a huge table with two columns: Key and File. The key could be a URL, like "file:///usr/local/blah" and the file would be the file. Of course you'll need to text-index the keys do that you can search for "file:///user/*", but since you would only be making a single call (instead of call after call as you recuse down directory tables) it would make for a very simple system that would be fast as fast can be.
  • It sounds an interesting idea, but it's not clear what you hope to gain from having the data in a relational database.

    A file and it's associated details is an atomic, with only one relation, it's parent directory. A relational database doesn't provide much advantage over say an xBase implementation. There are some incredibly fast implementations (like Codebase [sequiter.com] from Sequiter [sequiter.com])
    This will provide lightning fast lookups of files whose names are indexed.

    On the other hand, if you were going to add some additional functionality to your 'filesystem database', such as the ability to make usefull links and associations, like 'see also' or 'see related documents' or 'see documents by this author', then a relational database is just the trick as you can make 'many to many' joins between the various files on your network. Add a simple Web interface and voila! you have a really neat way of navigating through the filesystem.

    Another cool thing would be some kind of 'system monitor' that notes changes to important files on servers and the like. While there are third-party tools that do this, you can really customise it if you build it yourself.

    And of course, there is the ability to add notes, web links, references etc in each file's db record. This can speed up decision making on whether to look at a document or not.

    And lastly, you can index the text inside the documents for really fast text searches that return document references. Again, there are third party tools that already do this, but most of them are proprietery.

    The major problem you will face is 'currency'. That is, how often do the update scripts run? Once a day? Then you can't search for documents created this morning. Once every ten minutes? Then you chew up bandwith and CPU slowing down the servers and the network.

    And of course if one of the servers goes down just before you run the scripts, then you lose the info for that day.

    It would be good if you could build some kind of 'hook' in the 'writefile API' that triggers some kind of event or signal for your script/whatever to add the file details to the database 'realtime'. But I have no idea how that could be done.

    Anyway, just my thoughts.
    • Wouldn't it be better to use a trigger to solve the currency issue?

      I can see a couple of advantages to doing this though, if you keep the user files in the database you could add a whole bunch of fancy features that let users search, (or have agents search on their behalf) for different items, soundex indexing, funky notification and monitoring systems, the list is endless and mostly YAGNI I suppose, but it sure sounds cool.

      I'm guessing that this makes more sense the more integrated your user's environment is; if they are using a smorgasbord of tools and they are installing their own, a fancy filesystem is likely just another opportunity for mayhem. If, on the other hand you're building an integrated work environment that they are going to be working inside of fulltime, then this starts to have advantages. Although for a solution of the second kind it might be a better payoff to use some sort of persistent object storage.

    • A> Why use a Relational data base to store hirercial data? A HDB is *much* faster than a comparable RDBS, because it architecture is much simpler.
      B> All of the things that you mention could be solved with a good, high-end file system. Forgive me for intreducing a non-Unix concept, but I didn't have the chance to work with file systems there. But on NTFS, you can have pretty much anything you list here, including triggers (called repharsed points) and change journal, additional properties and notes can be handled via streams.

      Other systems, for example, provide more complete solution, in that they provide journal data (as opposed to journaled meta data only on NTFS), which can help increase your stability.

      All in all, I don't see they reason to go with an RDB instead of a good, high end, file system or a HDB.
      • There are plenty of reasons to want to use an RDBMS.

        If you are working with financial data or health records, there are Federal reporting requirements relating to who accesses what data where & when. Using a central Oracle or other RDBMS makes it easier to keep track of what's up.

        Why Oracle? Maybe the organization has a bunch of PL/SQL gurus. Maybe having Java integrated into the DB is advantageous. Or maybe they have a giant Oracle server sitting around with extra cycles.
      • It isn't necessarily Hierarchial. The relationship between a file/folder and its parent folder is obvious. But there is also a relationship between the current contents of the file and the contents as it existed last friday at 1:00pm. There are relationships between the author and the document, the document and the way it was produced, etc, etc.

        You should read about Oracle's Internet File System [oracle.com] to see what is possible with their technology.

        In particular, you could implement nearly-transparent versioning, process control ("my boss must approve this document before I can publish it on the website"), integrated security, etc. In addition, iFS can do full-text-indexing and many other neat things automatically since those features are built into the Oracle database.

        On top of that, it is all done in a platform-independent way, and on some platforms (e.g. Windows) it is even completely integrated.

        But I imagine a $100,000+ filesystem must be a hard sell. (Of course, it could cost less or much more than that, depending on the server you install it on).

        • But I imagine a $100,000+ filesystem must be a hard sell. (Of course, it could cost less or much more than that, depending on the server you install it on).

          Have you priced this or are you just guessing? $100K isn't actually all that much compared to the amount of developer time to build something similar.
          The problem with Oracle is that it's hard to get any kind of pricing structure out of them, tiered or otherwise.
          • Well, I know we got a quote on the Oracle database that was $100,000 for one Dual-PIII Windows NT server. iFS requires a license for the Oracle database, but I don't think it requires a seperate license as well.

            It is true that $100,000 isn't that much if you are just going to implement the same thing anyway. My statement about it being a hard sell was more a reference to this line of thinking, which I don't particularly agree with: "every operating system comes with a file system for free, and most operating systems are (nearly) free, but you want me to pay $XXXXX for only part of an operating system? You are nuts!"
            • It is true that $100,000 isn't that much if you are just going to implement the same thing anyway. My statement about it being a hard sell was more a reference to this line of thinking, which I don't particularly agree with: "every operating system comes with a file system for free, and most operating systems are (nearly) free, but you want me to pay $XXXXX for only part of an operating system? You are nuts!"

              I believe you would need to sell it as a document/information management system. Of course, for the $100K you could develop your own really neat system in-house (assuming competent developers), and tailor it specifically to your organisation's needs/structure.
  • In one of my earlier projects, I've developed a database-driven file "storage" system (I wouldn't call this a file system) to store content management data in a way filesystems usually don't provide (yet). Each file had different attributes like Creator, Created, Active from/to, Object Type, Language, Version, Title and Description, complex ACL's and so on. I only did this because it is useful in that context, but I wouldn't use it as a replacent for a filesystem. The idea key->file is a nice solution when you look at this first, but you will quickly learn that it isn't a very performant solution. As one of the previous comment posters wrote: Use a filesystem what it was thought for, and use a database what it was thought for. You have Unix! Just glue the existing components together instead of writing propiraitaery (spell checker required!) application to accomplish this.
  • Why relational? (Score:3, Insightful)

    by cperciva ( 102828 ) on Monday December 31, 2001 @03:42AM (#2766875) Homepage
    IANADE (... Database Expert), but...

    Why are you considering a *relational* database? Unless you're planning on completely changing filesystem semantics I don't see why you wouldn't just use a simple hierarchical database.

    I mean, seriously, you want to have a filesystem which acts like a distributed database; but you don't really need to be able to run RDBMS queries do you? You'll probably end up with a much better result if you work down a checklist and decide which database features you want and which will just add bloat.
  • by one-egg ( 67570 ) <geoff@cs.hmc.edu> on Monday December 31, 2001 @04:23AM (#2766928) Homepage
    Back in the 60's, there was the Michigan Terminal System, out of the University of Michigan. Their filesystem was DB-based. That was before relational became the "in" thing, so it was ISAM.

    It was an interesting idea. I think that the problem they had in MTS will be the same with your idea: not everything fits neatly into the DB model. In fact, some things really have to be shoehorned in.

    The insightful reader will be saying, "But wait! You also have to shoehorn stuff into the conventional FS model." True enough. The question is how much fits naturally and how much has to be shoehorned.

    My contention is that the conventional model is a better fit for most stuff. That's especially (perhaps sadly) true because of legacy software that expects the conventional model. Perhaps a ground-up OS and application implementation would be able to rethink some of those issues and find new insights. But I'm naturally skeptical.

    There is also the issue of performance. I know little about DBs (my loss), but it seems to me that if the FS is stored in an existing relational system, you're going to have to warp some stuff to make it fit. I'd suspect that either you're going to have to make every file be a different table, or you're going to have to store the contents of every file as a variable-length text field. Either option is going to have really nasty effects on the efficiency of the DB, which has been highly optimized under the assumption that each table contains tons of highly homogeneous records.

    I wouldn't want to dive into that kind of can of worms as an "I want to use it in production" project. It might make interesting research on a 5-year horizon, though.

  • I see nothing wrong with what you are doing, and it is one heck of a good idea. I really wish I had the time to go in and write some little drivers that would journal the addition/deletion of files and folders to the personal SQL server on my PC, so that when I do my (frequent) searches for files, they would be quick. Locate is nice, but it isn't real-time...

    You'll get good info, and the info is the most important part for doing a good job of reorganizing things. Ad-hoc can be fine when everybody is responsible for their own stuff, but when the whole system is supposed to work cohesively, nothing is as cool as a really well-engineered large system that is bulletproof, and you can't do that with out really good planning.

    Maybe off topic...

    I think the whole "files and folders" system is artificial anyway. We had no concept of files at first, until the technology on the mainframes was good enough that we were able to finally put some structure around the data. Then we started categorizing the files, and eventually the device, folders/directories, etc. structure evolved. In addition, we have this server hierarchy, with mount points, etc. It is a lot more complex, and somewhat more capable of organizing our data, but it still isn't even close to how we really think. It was invented because it was a good way to organize that was efficient for the horsepower available at the time. Now we're somewhat entrenched by it.

    Database filesystems are a much more natural way to do things, though. How do you categorize your MP3s? By Hard rock/Soft Rock/Pop/Oldies? By Artist? By title of the song? With folders you have to pick one method. With a database, you can switch anytime.

    Now that the computers are capable of it, we're starting to move in the direction of database filesystems already. MP3 categorizers are coming in quickly, as are filesystem indexers (locate, and MS's indexing server). Handheld devices kind of go by a "database" filesystem as well.

    I envision a filesystem as follows: a flat set of files, each with a serial number (inode number basically). Also, a database that associates each inode with any number of attributes. There are certain pre-defined attributes with globally well-understood meanings, and namespace rules about defining new attributes and personal-use attributes. The attributes can themselves have attributes (more on this later).

    Attributes include file names (as opposed to "filenames," though similar), creation/modification/access dates, owner, comment, file type, keywords. And a file may or may not be assigned a "Default open with" attribute... Or maybe 2 or 3 (in which case a list box would pop up when you double clicked on the file!)

    If the file didn't have a "default open with," what then? Well, it probably has a file type attribute. Say, File Type text/plain. Well, the attribute text/plain has a "default open with" attribute of "GVim.EXE." Well, cool!

    This would be nice in a lot of places: source code control would be very simple. Just put a "version" attribute on, and move the "latestversion" tag around. Also, this eliminates the need for multiple copies of a file in a build environment (well, theoretically, you should be able to eliminate this by proper engineering, but we all know that sometimes you miss something and have to copy a file somewhere...) -- you just refer to the same file twice.

    That is my idea of how the database file system should work. Of course, it really messes with our current paradigm, and it introduces some problems (bye bye canonical pathnames for files!) but sometimes I really wish I had a database behind my filesystem, not a hierarchy.
    • by Anonymous Coward
      I'm not sure, but isn't this in some ways how the BFS (The Be File System) is organized? I mean a BFS file could be associated with arbitary amounts of metadata, one of which was the "default open with" mentioned in the post I'm responding to.

      In his article comparing Mac OS X to Be OS [birdhouse.org] Scott Hacker (author of the Be Bible) even used the MP3 example to demonstrate the flexibility of Be OS metadata. He demonstrated that he could, by only using metadata effectivly mirror the ID3-tags in his MP3 collection and query by artist/genre/year/etc. directly from the Tracker (the Be OS desktop application). How the BFS goes about handling folders I know nothing about.
      • Don't forget that NTFS is extensible with attributes. Even the file data is an attribute. It stores attributes nicely by packing smaller ones into blocks (a small file will fit in the same block as it's metadata). Also, a file can have multiple streams all contained within the same file. As usual it's the tools that let us down. Windows doesn't give us access to much of that.
    • If you run Windows 2000 you can turn on the Indexing Service that does full-text and categorized indexing of your file systems. The index can even tell you what HTML files contain elements with certain text.

      If you are running NTFS the indexing service can intelligently update the index by reading the journal.

      Given the right pluggable filter you are able to index MP3 metadata too.
  • sounds crazy (Score:1, Insightful)

    by DrSkwid ( 118965 )
    i can understand why one would want to put the database into the filesystem but to put the filesystem into the database!!

    you hide the data from many of the day to day tools people are used to using

    cat, more, >, >>, awk, sed, grep, join, and a host of other cli tools.

    hierarchies work and people understand them. How are you going to switch the filesystem tree into RDBM tables. It would keep Codd himself burning the midnight oils.

    You might be getting yourself some extra sleep at night secure that your Oracle won't fail you but other tools are around to help keep a filesystem available (as other posters have mentioned).

    Think about what you are throwing away and locking in to. All new employees will have to learn Oracle SQL to get to files, a delete with a badly phrased where clause could cost you plenty of time (and don't say you haven't done it).

    A simple example
    vi /home/data/somefile.txt
    or
    select data from home-data where filename='somefile.txt' > tmp
    vi tmp
    sed tmp -f escapescript.sed
    copy the output to the clipboard
    update home-data set data = 'paste here'

    Ok I realise that a few scripts can replace it but please, what's the gain?

    The expertise of your internal support staff will need to be at the advanced oracle user level.

    But, hey, go for it

    come back in a year and let us know how it went
    • I think he's talking about references to files, like a RDBMS filesystem Roster or something...not the actual files ;)
      • oh yeah :)

        It's just locate under another name and a smaller cron interval.

        The way to really do it would be hack up your own nfs clients that did the logging for you rather than polling.

        but now I get what he's getting at I'm not sure I know what the big deal is.

        Certainly nothing like a "Virtual Filesystem".

        It's just an expensive way of running ls on a cron every night/hour and using awk, sed, grep & join on the plain text output

        if you really are interested in virtual filesystems though then plan9 is a place to look. Each processes get's a file namespace. Using bind, attach & union one can build different views of the same namespace.

        Each namespace is a combination of disks on file and programs presenting files and directories.

        All file access is via a protcol 9p so you can write your own file servers to present data like files and directories.

        they say it better than me [bell-labs.com]
    • Re:sounds crazy (Score:2, Informative)

      by rfreynol ( 169522 )
      Actually, if he used Oracle iFS (internet files system) the files could reside in the db (security and ability to get good, clean backups) while being mountable in via many protocols (nfs, samba, http, ftp, etc.)

      I often hit files with grep/sed/awk from a unix box that has a iFS filesystem mounted via nfs.

      The solution to his problem is iFS, and he probably already has a license for it.
  • Although you have an investment in oracle have you looked at MysqlFS [memes.net]. MySQL is perhaps not the most functional of RDBMS' but it's certainly very fast and may provide all of what you require without wasting your expensive oracle licenses.
  • Using a database as a filesystem is a bad idea if there is a lot of traffic. Database IO is generally slower than filesystem access. My former internet company would agree with this as well as we tried to make db access as small as possible. One solution was to use caching of tables in memory to make the access faster, then timed updates to the database.

    I'd recommend setting up failover servers instead of DB access.

  • The way you propose your idea - without any details - it's hard to tell if your data really does fit a DB model well. If your data is just "all the files", it probably doesn't map well.

    It's also not clear what problem you're trying to solve. Is it a problem in administering the large amounts of data? Is it a problem in managing the complex relationships between the data? Is it just a worry about the reliability of the storage media?

    If you've been using part of your filesystem *as* a database (i.e. very large numbers of files in a directory, where the filename is a key) then you may come out ahead by putting the data where it really belonged in the first place, a database!

    It may very well be time to closely scrutinize what all the files you have are and what they contain. Store files with complex interrelationships to each other on the same machine; store other groups with little relations somewhere else. Draw ER diagrams, do it in UML, whatever tool you like. But you won't solve a lack of understanding of all your files by hiding them all behind a database layer of abstraction, you'll just be brushing everything under the rug.

  • Managing terabytes of data in Oracle is not such easy task. You're talking signigicant backup issues (e.g. each backup is going to be huge).

    Consider putting the meta data (name, categorisation, ACL, etc) in the database and keep your files on your NetApps. The database would contain a lookup to the NetApp file.

    You would gain better management, searching, integration of your data.

    You would still be at risk if key NetApps filers went down. (But perhaps some kind of replication based on the database metadata could sort that out...?)
  • CVS (Score:2, Interesting)

    by DrZaius ( 6588 )
    Why not use CVS?

    We don't use Filers where I work, but just the local file system. Our CVS root looks something like this:

    systems/$hostname
    systems/$hostname/home
    systems/$hostname/etc
    systems/$hostname/var

    and so forth. You can then add files as you want -- if you don't want to back up everything in /etc as it was created automagically, then don't.

    This set up uses a very mature piece of software. It has a lot of nice interfaces (cvsweb, wincvs). It also gives your users the ability to pull their own backups and doing branching of their home directories :)

    It works really well as you can back up the central repository at what ever frequency you want.

    It may take a lot of storage to back it all up, but it will probably be smaller than an oracle database and you will get diffs for your history.

    • by c0d1 ( 71507 )
      Having a transparent version control system behind my filesystem is an interesting idea that I've mused over a number of times in the past (of course, you might not want this for certain rapidly changing, unimportant data).

      However, there are certain flaws to CVS that speak against using it as the VCS. In particular, its biggest flaw, the inability to cope with directory structure modification, is even worse for this use than for software development.

      I'm curious how your system at work deals with this. I do a similar thing as well, but only for /etc and a few other areas which carry system configuration; these never change in structure. I can't imagine how you cope with user directories.

      It's an expensive commercial product, but, if I were going to go this route (and had the budget to justify it), I would want to use something like Rational's ClearCase.

      ClearCase manages structure changes quite robustly and is presented as a filesystem to begin with. The view concept is powerful. If it were combined with regular automatic version set labelling, you could quickly go check out the state of your filesystem from any particular day by simply using a different rule set.

      Actually, to wander slightly more off topic, does anyone know of a SCM/VCS that provides ClearCase's capabilities for a cheapskate's budget?
  • Reiser FS (Score:2, Interesting)

    by CaraCalla ( 219718 )
    The guy who wrote Reiser FS has quite some things to say on this very subject: What he says (in short) is:

    .) Databases shouldn't exist. They exist only because System-Programmers didn't listen to Application-Programmer's needs.

    .) Those Application-Programmers implemented their own solution to the problem: Databases.

    That's why he started Reiser FS. It is today the ideal filesystem to store little bits of information ( 100 bytes) which you normally store in Databases, because you don't want millions of small files lying around.

    (I know this is a simplified view, read his own arguments for a broader discussion.)
  • There's a product I've been working with for 3 years which does something similar to what you are asking for:

    OpenText Livelink [opentext.com]

    They have a number of other products that probably have similar underpinnings.

    Basically, a database (can be Oracle, Sybase, or SQL Server) keeps meta-information on items stored in the database. Items can be documents, folders, URL links, tasks, discussion topics and replies, etc. Meta-information includes dates of creation, updating, deletion (i.e. the makings of an audit trail), whether the document is checked out for editing and if so, by whom (that makings of a simple source-control system), who can see or edit the document (implies that a table of users is also maintained, which it is), etc.

    Livelink provides a server (basically a big CGI app) that you can run through a web server, allowing this stuff to be navigated and maintained through the web. The web pages are customizable. Really, the whole thing is customizable, which means you can write all kinds of little apps and processes above and beyond what is supplied by Livelink (our most common examples are scanning apps, that scan and store new documents in one step, and document expiration processes, which force certain documents to be read and revised every 6 months or whatever). There's also an API for VB, C++, and Java, to allow access methods other than the web.

    Depending on the number and size of documents you're storing, documents can be kept in the database itself, or can be kept in a filesystem, and pointers to those documents stored in the database. The second option is usually preferred because the first option will cause trouble when it comes time to backup or restore from backup, or to migrate data, etc.

    The biggest disadvantage from a user's point of view is the need to log in, if you plan to keep any semblance of an ACL, source control, or auditing. You could provide one common login for read-only access to most of your files, which would ease the pain a bit. Or you could 'roll your own' solution, based on some of the premises used by this type of system.

    I'd like to add that I think the technology to use is the least of your issues. The biggest issue will be in finding and categorizing all of the content that's already out there. When you find 4 different variations of the same document, or 3 different builds of the same source code, or tables for 3 different apps in one database schema, and it's not clear what is what, how will you know who to contact? Who among your users has the extra headcount to spare to give you detailed info on all their files and databases, etc? How much stuff is out there, that was owned by people who have left or been laid-off, who no one else can provide info on? That's gonna be your real battle.

  • One common theme that I read in the replies is the question of whether Oracle (or any RDBMS or DB in general) is appropriate for your situation. In light of this sort of response, it might be appropriate to reconsider what you are attempting to accomplish rather than focusing on how you've decided (perhaps prematurely) to solve it.

    This is a list of requirements I've read and inferred from your post (with definite emphasis on the inference part...check me if I'm incorrect in any of these). Whatever solution you put in place:

    1) must appear identical at application level to existing information topology

    2) must protect you from failure of any node in the underlying storage system

    3) is desired to provide you the ability to perform rollbacks

    4) must be of enterprise-level robustness from day one

    5) must have adequate performance at current level of (tens of?) terabytes of storage

    6) should be able to scale up in size and complexity of topology from current (you are unlikely to start using less space and suddenly have simpler systems, correct?)

    6) must be maintainable for years into the future by network administrators of average skill (you do not want to become irreplaceable in this role; it may provide job security, but it sure limits potential for advancement and personal growth)

    7) should be deployable with minimum downtime (perhaps zero?)
  • I suspect you've already got soem sort of RAID in place. Redundant boxes are nice but server class processors and memory are mindblowingly expensive.
    Implementing an object oriented "filesystem" is something that is still being toyed with at MIT and who knows where else. There are supposed to be benefits to it but you would be insane to trust this much data to a bought and paid for solution that comes with certain guarantees.
    Drop this idea. Drop it now. Maybe someday Oracle or someone will have a solution based around this, maybe someone already does but for the love of the great pumpkin don't home brew something this big. Have you considered clustering, do you have fibre to disk already, can your boxes do load balancing, does all that data have to be live or can you lighten the load by archiving some data? So many questions but most importantly: Have you exhausted every /Vendor Supported Solution/ to address the availability and performance considerations you have? I strongly suggest buying some solution over creating your own and risking the farm.
    All the same here's some linkage related:
    http://org.lcs.mit.edu/SOW/proceedings/2001/xyz. pd f Good luck.
  • If I had my choice on what to implement for the ultimate network distributed filesystem, I would concentrate on LDAP, GFS, and Kerberose. LDAP, by its very nature was designed to be a distributed, redundant resource locator and data respository. It can be back-ended by any number of engines, including your more popular RDBMS's. It may seem a bit overwhelming, but well worth the investment in time and energy. Check out the OpenLDAP [openldap.org] site for more information.

    The second issue you're trying to address is data redundancy and failover. You want a high-availability solution. Look into using the Linux Global Filesystem (GFS) [sistina.com]. In a nutshell, it's a clustered journaling filesystem whose participants are equally responsible for the data on disc. If one of the servers in the cluster goes down, the first server to see it plays back the unfinished journal of the downed server, and the whole cluster continues on its merry old way.

    So, it would be one GFS+LDAP cluster with multiple 1U, fiberchannel servers attached to a fiberchannel disc array. Tack on a gigabit ethernet backbone, and you've got a winner.

  • There are two parts to this (most of which makes my post redundant):

    1) Can it be done/has it been done? Yes and yes. Mainframes used to do this. The OS/400 on the AS/400 uses basically this same idea. Apple tried to do this with Hypercard. Oracle does this with iFS. Microsoft wants [theregister.co.uk] to do this (and had plans to for a long time).

    2) Should you do this? No. You're quickly leaning why NAS is so cheap. What if you or someone else leaves you business? There's nothing elegant about something so complex no one can come in after you and figure it out (screw job security - there's plenty of other shit to do).

    Adding a database wrapper to an already complex web of file system objects is only going to make things worse. Get out a whiteboard and start drawing pictures until you find the end of the knot and start undoing it. I'd offer more helpful advice, but I'd need more information about how things are laid out.

    Consider a fiber channel SAN and sleep well at night knowing you have virtual disk extraction of this mess - this morning I concatinated a 10GB volume segment onto a Netware server and moved a virtual disk from a dead Sun Netra to a new one.
  • ... I'm sure Oracle Consulting would be happy to deisgn and implement a solution for you. :P

    But it sure sounds like you are using the wrong tool for the job. I mean, you can drive nails with a crescent wrench, but it's not efficient and not good for the wrench.

  • It's already been done for quite a few years.
    A Filesystem with any number of extended attributes, queryable thru SQL, and Live
    queries on the FS as it changes...

    Check out this book, great details.

    http://www.mkp.com/books_catalog/catalog.asp?ISB N= 1-55860-497-9

"Falling in love makes smoking pot all day look like the ultimate in restraint." -- Dave Sim, author of Cerebrus.

Working...