Dumping Lots of Data to Disk in Realtime? 127
AmiChris asks: "At work I need something that can dump sequential entries for several hundred thousand instruments in realtime. It also needs to be able to retrieve data for a single instrument relatively quickly. A standard relational database won't cut it. It has to keep up with 2000+ updates per second, mostly on a subset of a few hundred instruments active at a given time. I've got some ideas of how I would build such a beast, based on flat files and a system of caching entries in memory. I would like to know if: someone has already built something like this; and if not, would someone want to use it if I build it? I'm not sure what other applications there might be. I could see recording massive amounts of network traffic or scientific data with such a library. I'm guessing someone out there has done something like this before. I'm currently working with C++ on Windows. "
2-stage approach (Score:5, Informative)
Cost... Are you going to go for local storage or NAS? Need SCSI and RAID or a less expensive hardware setup? Do you think gigabit ethernet will be sufficient for the transfer from the data dump hardware to the processing/indexing/search machines?
Sounds like you might want to run a test case using commodity hardware first.
Ramdisk database (Score:5, Informative)
Either make a big ramdisk and put your database out there (see my Journal from a few months back, ramdisk throughput is pretty damn fast from the local machine, given certain constraints, and random access writing is hella fast), or use a database that runs entirely in memory (think Derby, aka Cloudscape that comes with WebSphere Application Developer.)
When you got your data, save it out to the hard drive.
Granted it helps to have a box with a ton of memory in it, but they are out there now, almost affordable. If you are collecting more than 4G of data in one session, well YMMV - but 4G is a LOT of data, perhaps consider your approach.
Re:Ramdisk database (Score:2)
Re:Ramdisk database (Score:3, Insightful)
I don't really care what it pays if it has anything to do with real-time systems (guidance or delivery systems a plus), if the R&D budget has enough wiggle room for better hardware (toys) than I have at home, if you promise that I will
Re:Ramdisk database (Score:2)
My recent forays to Crucial show 4 sticks of 2GB reg/ecc PC2700 DDR memory will set one back a bit over $3k. For 8 to 16GB of data, the most economical route would be a dual Opteron box, things start getting expensive above 16GB.
Re:Ramdisk database (Score:2)
That said, as I understand it AM
4GB not that much for a session... (Score:2)
The application is vibration monitoring in industrial machinery. A reasonable "session" in such an environment would be a machine startup or stop - depending on the machine this could take several minutes to hours. For the "hours"
Re:2-stage approach (Score:2)
The front-end gets all the data, then passes it along using a file-based backing-store queueing system to the back-end that posts the data to your permanent store.
This also gives you the flexibility to let the front-end choose which back-end to send it to (usually on another machine).
Suuuure. (Score:4, Funny)
Suuuure.
Re:Suuuure. (Score:2)
of course we can see how well the Govt's spyware works... of course he could have a network of "voulanteers" allowing him to "monitor" their computing habits... that would be a lot of info too...
Wonderware InSQL (Score:5, Informative)
Check out wonderware InSQL. We update roughly 50k points every 30 seconds without loading the server much at all. Pretty nice product, also has some custom extensions to SQL built in for querying the data (eg cyclic, resolution, delta storage, etc etc).
http://www.wonderware.com/ [wonderware.com]
Of course, you'll need your data to come from an OPC/Suitelink/other supported protocol, but should work nicely for you.
- Joshua
Re:Wonderware InSQL (Score:3, Interesting)
Do you actually charge a license fee PER point?
We had a need for a smaller SCADA system in our company and Wonderware could not answer these questions (except for the fee per point, which they actually charge PER POINT). This department is going with a different product.
Sorry, but be very cautious of
Re:Wonderware InSQL (Score:1)
Sounds to me like you're either not throwing the hardware you ought to at this project or you are looking at the wrong software.
SCADA is very versatile and powerful. Are you feeding data in mostly from local or remote RTU's?
Re:Wonderware InSQL (Score:3, Informative)
SCADA is very versatile and powerful. Are you feeding data in mostly from local or remote RTU's?
You do understand that SCADA is a general term which describes a type of system, right? A SCADA system could be designed (and has been)
Anyway, we
Re:Wonderware InSQL (Score:1)
For those looking to find out more about SCADA and/or OPC, you might want to have a look at the SCADA Working Group webpage or primers such as this one [web.cern.ch].
Re:Wonderware InSQL (Score:2)
Re:Wonderware InSQL (Score:3, Informative)
InSQL works as an OLE Processes for SQL Server. You can use pretty much any tool (ODBC/ADO/excel/DAO/whatever) to query the database. Yes I realize I mixed libraries/methods/applications in the tool list, but just trying to get across a basic idea.
Yes, per point licensing, I believe we licensed for 60k points, not sure on the cost. This is pretty typical in the SCADA world I believe.
Sample query I'd use to get all data for a specific rtu
select * from live where tagname like 'StationName%'
Two tables us
Re:Wonderware InSQL (Score:2)
Actually, I would refuse to pay a license per amount of points. That is completely an arbitrary way to make more money. The only thing the amount of points should affect is disk space and possibly CPU power.
Numerous companies do not charge for a license based on how many points you have and I find the practice of charging for points reprehensible. Similar to the concept of an ISP charging per packe
Re:Wonderware InSQL (Score:1)
Don't roll your own (Score:4, Informative)
In order to process so many points realtime, it usually will have to be in RAM for performance reasons.
Cluster it (Score:4, Insightful)
I know your working with windows but when I read this I said yes.
I'm guessing someone out there has done something like this before.
Google has a cluster of machines far larger than you need but their approach was a Linux cluster. Plus, for the amount of writes going on your going to want not to have any burdens on the system that are not needed.
Re:Cluster it (Score:2)
Re:Cluster it (Score:1)
Re:Cluster it (Score:2)
Just dump it (Score:1)
And remember - Keep It Simply Stupid. Be sure you can ree
Just use the file system (Score:2)
Re:Just use the file system (Score:1)
Re:Just use the file system (Score:2, Funny)
Re:Just use the file system (Score:1)
Why do all the people who use UNIX and Linux for these things use UNIX and Linux and not Windows?
Re:Just use the file system (Score:2)
Re:Just use the file system - windows blows (Score:1)
When you click near the directory (parent directory did it) in the "file explorer" the whole thing locks up for few minutes, desktop, task bar, etc. I haven't tried making a tree of subdirrectories to avoid this problem. I'm not too sure
Re:Just use the file system - windows blows (Score:2)
Anyway, the only alternative to using the OS file system is to implement your own file system in user space, or use a database as a file system. If you choose to
A commercial RDMS can cut it (Score:4, Informative)
Re:A commercial RDMS can cut it (Score:1)
Re:A commercial RDMS can cut it (Score:1)
Here is sql server performance test that gets over 9000 inserts per second.
http://www.sql-server-performance.com/jc_large_dat a_operations.asp [sql-server...rmance.com]
It took me two minutes to find these two exmamples. Now I didn't find an Oracle. But you do realize that 2000 inserts per second is not that many, OLTP database design is made for this.
Re:A commercial RDMS can cut it (Score:1)
And BTW why do you think messing with database connections would be easier than doing it manually? I did this things [slashdot.org] in REXX (read Perl), it takes about hundred lines for all.
Re:A commercial RDMS can cut it (Score:5, Interesting)
The short answer is "no."
A 10,000 RPM disk has a period of 6 mSec. That's 3 mSec latency on average for random access (not counting seek time or the fact that read-modify-write will take at least 3 times this long: read, wait one full rotation, write).
So one disk can do, as a generous upper bound, 333 random accesses per second. I'll spare you the details of the Poisson distribution, but if you managed to spread these updates randomly over a disk farm, you'd need about 2000/333*e = 16 independent spindles.
The trick to high throughput is harnessing, and creating, non-randomness. You can do a much better job of this with a purpose-built solution.
Re:A commercial RDMS can cut it (Score:2)
You seem to be presuming that there's no:
Re:A commercial RDMS can cut it (Score:2)
The whole idea behind caching and any other memory hierarchy is that it takes advantage of locality of reference, which is explicitly precluded by the stipulation in the
Re:A commercial RDMS can cut it (Score:2)
Yes.
which is explicitly precluded by the stipulation in the great-grandparent that the accesses are random.
Didn't notice that part. I was thinking more of the Original Asker, but in the context of an RDBMS.
If I'm trying to shove as much data as possible into a table, caching will definitely help. And since the table will have to be indexed, caching may help there, too, depending on the
Re:what about reordering requests? (Score:2)
Re:A commercial RDMS can cut it (Score:1)
I've build a system like this, only the ammount of data is smaller. Our system is written in Java and has MySQL backend. On stress test it could perform about 1000 updates per second on single-processor x86 hardware. With better hardware and a few optimizations even our system could perform at 2000 updates / sec.
-Kari
Have you tried a relational database? (Score:2)
With your specs, chances are you will either need a very beefy machine, or a distributed approach spreading the load across many machines, regardless of the software approach. But I wouldn't be surprised if a good RDBMS would outperform a flatfile approach. It is what they're designed for after all.
Re:Have you tried a relational database? (Score:3, Interesting)
Re:Have you tried a relational database? (Score:2)
What I meant is: SQLite (and presumably other RDBMS as well) is quite fast. Even with an index.
Yes, this sort of thing has been built before (Score:3, Informative)
If your data streams are continuous, and can be represented as audio data, then you are pretty much dealing with a solved problem, and your other problem of selecting from large number of possible 'instruments' is solved by an audio patchbay.
If this isn't feasible, then a number of solutions might be appropriate (spreading the load over a number of machines/huge ram caches/buffering/looking at the problem and thinking of a less intensive sampling strategy/etc.) but without more information on the sort of data you are collecting, and exactly how quickly you need to access it, it's very hard to be specific.
horizontal scaling is good... (Score:3, Interesting)
Linux Virtual Server in front of several instances of your windows box will do, with some proxying stuff for queries. Probably cheaper than spending months trying to tweak single node to get to your scaling target, and will scale trivially much farther out.
in-memory (Score:1)
Re:in-memory (Score:1)
This thing is going to run for months at a time. Eventually the stuff has to go to disk.
The Solution (Score:2, Funny)
The program is called Microsoft Access 97
Re:The Solution (Score:2)
Sequential files are your friend (Score:2)
If your data are already arriving on a single socket, just mark up the data and write it out. Then you can retrieve anything you like with linear search. And you can be reasonably certain that you have captured all the data and will never lose it due to having trusted it to some mysterious DB software.
If linear search isn't good enough, you have to
Did something like this some years ago (Score:3, Insightful)
2.000 items/sec means that you must do bulk updates. You cannot flush to disk 2.000 times per second. So you program will have to store the items temporarily in a buffer, which gets flushed by a secondary thread when a timer expires or when the buffer gets full. use a two-buffer approach so you can stil receive while committing to the database.
Depending on you application it may be beneficial to keep a cache of the most recent items for all instruments.
You also have to consider the disk setup. If you have to store all the items then any multi-disk setup will do. If you actually only store a few items per instrument and update them, then raid-5 will kill you because it performs poorly with tiny scattered updates.
Do you have to backup the items? How will you you handle backups while your program is running? This affects your choice of flat-file or database implementation.
Yup... (Score:3, Informative)
The design of a data acquisition systems will of course differ, depending on how much data it records per sensor, how many sensors there are, how often to record the data, and if the data is to be available for online or offline processing.
In most of the "hard" cases, you will use a pipelined architecture, where data is received on one or more realtime boxes, and buffered for an appropriate (short) period. A second stage occurs when data is collected from these buffers, and buffered/reordered/processed to make writing the desired format to a file or DBMS easier. The last stage, is, of course, to write it. You might use zero or more computers at each stage, with a fast dedicated network in-between. You might even decide to split up some of the stages even further. Depending on how much you care about your data, you may also add redundancy. And make sure it's fault-tolerant, it's generally better to loose some data, as long as it's tagged as missing, than to loose it all. To check this in real-time you can also add data-monitoring anywhere it makes sense for your system.
In the simper cases, you simply remove things not needed, such as a soundcard instead of dedicated realtime-boxes, redundancy, monitoring, dedicated network, etc...
Some commercial off-the-shelf systems will surely do this. But the more advanced systems, you still build yourself, either from scratch, or by reusing code you find in other similar projects (I'm sure there are some scientific code available from people interested in medical science, biology, astrophysics, geophysics, meteorology, etc...).
Most of the "heavy" systems will not run on Windows, or even Intel, due to limitations of that platform for fast I/O. This has obviously changed a lot recently, so it's no longer the stupid choice it was, but don't expect too many projects of this kind to have noticed, as they probably have existed much longer.
Have you considered memory-mapped files? (Score:4, Interesting)
The standard file API architechture just didn't hold up, so we (the development team I was working with) had to rewrite some of the file management routines ourselves and work directly with the memory mapped architechture directly. This does give you some other advantages beyond speed as well, as once you establish the file link and set it in a memory address range you can treat the data in the file as if it were RAM within your program, having fun with pointers and everything else you can imagine. Copying data to the file is simply a matter of a memory move operation, or copying from one pointer to another.
The thing to remember is that Windows (this is undocumented) won't allow you to open a memory-mapped file that is larger than 1 GB, and under FAT32 file systems (Windows 95/98/ME/and some low-end XP systems) the total of all memory mapped files on the entire operating system must be below 1 GB (this requirement really sucks the breath out of some applications).
Remember that if you are putting pointers into the file directly, that it works better if the pointers are relative offsets rather than direct memory pointers, even though direct memory pointers are in theory possible during a single session run.
Re:Have you considered memory-mapped files? (Score:1)
Good advice. These are "self-relative-pointers". Instead of this:
Re:Have you considered memory-mapped files? (Score:2)
The thing to remember is that Windows (this is undocumented) won't allow you to open a memory-mapped file that is larger than 1 GB
OK... does it fail at CreateFileMapping or MapViewOfFile? If the latter, you can work with larger files, you'll just need to restrict yourself to a 1Gb window within them.
and under FAT32 file systems (Windows 95/98/ME/and some l
Re:Have you considered memory-mapped files? (Score:2)
It is indeed the "CreateFile" portion of what Windows NT deals with that causes the problems. I did experiment with different memory window ranges and various strategies to access the data. It didn't seem to have any effect at all regarding this absolute limit as it appears Windows does some sort of alternate mapping beyond what is formally p
Re:Have you considered memory-mapped files? (Score:1)
I'm also not sure I see an advantage. I've considered trying windows "overlapped IO", but I'd like to stear clear of everything platform specific even if it means using lots of threads.
Re:Have you considered memory-mapped files? (Score:1)
When I did data flow experiments, I got up to a 3x to 5x data thro
Re:Have you considered memory-mapped files? (Score:1)
I'm curious about the 5GB files though... I was under the impression that NTFS in general won't allow you to create or open a file > 1GB.
Nahh, I've had files of ~4Gb. Fat32 has a limit around here at
Re:Have you considered memory-mapped files? (Score:1)
I may have been the reason why it got documented in the first place, and it seems like a really silly limitation.
Specialized Hardware (Score:2, Informative)
just dump to disk (Score:1, Insightful)
And don't forget the magic words: striping. you should interleave your data across many disks, and the index files should be on separate disks as well.
Do striping+mirroring for data protection. do the striping at the app level for maximum throughput, do th
Kdb+ (Score:3, Informative)
From how you descibed your needs, this would probably bit the bill..
Been there, done that (Score:1)
Re:Been there, done that (Score:2)
From another perspecive, we use to post inventory transaction to our ERP system at the bandaide level, creating 100,000 inventory journals a day. This is a large load for a hospital's ERP and makes financial analysis a headache. I would do an use case to determine the 'real' granularity needed for the data. Remember, users ask for everything, we give them what they need.
More info (Score:2)
NetCDF or HDF5 (Score:3, Informative)
I'm more familiar with NetCDF (because I use it) so let me tell you some of the things it can do. (HDF5 can also do these things, I'm sure).
With NetCDF, you can store +2 gigabyte files on a 32 bit machine (it supports Large File support). I've saved 12 gigabyte files with no problems. It supports both sequential and direct access, meaning you can read and write either starting from the beginning of the file or at any point in the middle of the file.
The format is array-based. You define dimensions of arrays and variables consisting of zero, one, or more dimensions. You can also define attributes that are used as metadata, information describing the data inside your variables.
You can read or write slices of your data, including strides and hyperslabs. This allows you to read/write only the data you're interested in and makes disk access much faster.
It's also easy to use with good APIs. They have APIs for C, Fortran95, C++, MATLAB, Python, Perl, Java, and Ruby.
Take a look at it. It might be what you're looking for.
-Howard Salis
Re:NetCDF or HDF5 - interesting (Score:1)
Re:NetCDF or HDF5 - interesting (Score:2)
You can set the buffer to whatever you want and it really depends on your computing architecture on how buffering is handled. Normally, the data is kept in memory unti
Thousands of instruments? (Score:1)
SQLite (Score:1)
An RDBMS won't cut it? (Score:1, Informative)
*Many* RDBMS systems can do this without breaking a sweat.
Do some googling on Interbase for example - one of the success stories for IB is a system that does 150,000 inserts per second - sustained. It's a data capture system that may well be similar to yours.
Oracle can definately do it - but you'll probably need a good Oracle DBA to tune it up properly.
Informix can definately do it as well - don't know about the latest version, never used it, but whatever was curren
HP-IB and ISAM (Score:3, Informative)
On the coding end, there are numerous (hell, hundreds) of commercial, F/OSS, and books on ISAM libraries for you to use for the actual storage and retrieval. It may even be included in your existing libraries given how old the technique is now. I was doing this back in the '80s for the US Navy using a 24 bit, very slow, mini-computer, so any normal box should be able to handle it today!
We use these techniques in electronic instrument monitoring, logistical systems, systems engineering, you get the idea. You may want to mosey over to the HP developer web site to see if there is a drop in solution, as I imagine there is (sorry, haven't looked).
I hope this helps.
Re:HP-IB and ISAM - Ahh now I know as name for it. (Score:1)
Thanks. I now know the name for it. It still looks like I might be better off writting something from scratch. Maybe I can slap it up on sourceforge afterwards.
Re:HP-IB and ISAM - Ahh now I know as name for it. (Score:1)
DBM Family: esp GDBM and Berkeley DB (Score:1)
The interface is a model of simplicity: pointers to arbitrary length buffers for keys and data. All you need is key scheme that provides the post acquisition access that you require.
Berkeley offers hash and BTree style organization of the keys.
It may use memory mapped FileIO under the hood and handles all transfer of multiple buffers.
It provides multipe files or multiple tables in one file
Re:DBM Family: esp GDBM and Berkeley DB (Score:1)
GDBM looks really simple. It also seems to have just one table per file. So all my instruments would have to go in that one. I'm wondering if something that looks so simple really has the performance I need.
How do you know? (Score:1)
Based on my (relatively basic) knowledge of how databases work these days, using large in-memory caches and fast commits, I wouldn't be surprised if a good enterprise database could handle this rate of commits.
You should remember that 2000 commits != 2000 random disk accesses!
This might be able to do it (Score:1)
Advantages:
- it's fast and it's not constrained by column length. If you want a table with 16,000 columns, go right ahead.
- it's very portable. Runs on just about every operating system that has more than 100 users.
The disadvantages:
- last time I looked (admittedly) about 4 years ago, their SQL integration could have been better.
- it's not a high-level database. To work most effectively with it, you need to know about the way that your data is stored.
I'm sure it's improved a grea
sqlite @ 120,000 inserts per second (Score:1)
Re:sqlite @ 120,000 inserts per second (Score:1)
Re:sqlite @ 120,000 inserts per second (Score:1)
That, I want to know as well
>Is it all in memory or can I have serveral GB of data?
Definitely GB of data stored on the disk:
All I have now is an athlon 1.7 with 512 megs of ram (debian).
I don't care that much for fast inserts, more like, I have HUGE quantities of images (png+32bit timestamp) which I grab and store in a sqlite database (insert rate could go up to gigabytes of data per hour for 5 image/sec, I need to check
Re:sqlite @ 120,000 inserts per second (Score:1)
I wanted to access the database with
Wrap up (Score:1)
* database is 5GB right now, after improving the thing's performance, it could be 10 times bigger. 50GB
* Yes, some people guessed it. It's financial data. I'm tring to dump all the trades of all stocks and futures in the US and EU. Right now we do a subset, but there's always something missing.
* Hardware. Yes I can get one or two monster machines for our server farms.
You should check out... (Score:2)
Another thing to look into is testing for dynamic loads on cars or aircraft. At least for aircraft, they'll put thousands of accelerometers all over the frame to measure the various accelerations.
Both of those are prime examples of a similar system to yours. And suc
Re:If you were running Linux OR BSD (Score:1)
Raid will improve disk accessing performance but... theres always a but, you might want to take note that AGP is for graphics only, you'll have fun finding an AGP RAID Card.
Re:If you were running Linux OR BSD (Score:2)
Re:If you were running Linux OR BSD (Score:1)
Re:If you were running Linux OR BSD (Score:1)
Re:Proprietary patented stuff - but yeah... (Score:1)
Hmm, where have I heard that number before...? Oh, right, that's just about exactly the current population of the US [census.gov]!
So, you say these are your "users" ?
[...] and have data associated with each user.
Ok, well, I don't think I'm going to sue you, and I really don't care who you work for, but I do think I'm going to go find my tinfoil hat RIGHT NOW...!
Re:Proprietary patented stuff - but yeah... (Score:1)
300 million users (Score:2)
"MasterCard member banks added an EMV chip to 40% of the 200 million MasterCard"
"In Asia Pacific, Visa has a greater market share than all other payment card brands combined with 59 percent of all card purchases at the point of sale being made using Visa cards. There are currently more than 365 million Visa cards in the region." (2003)
If Visa had 365 million cards holder just in Asia Pacific in 2003, I wonder how many they have worldwide nowaday...
Re:Proprietary patented stuff - but yeah... (Score:1)
Re:just writ the shit as it comes in. (Score:1)
Re:One word. (Score:2)
Anything above a trivial level of reporting gets nasty fast. Personally, I'd love to see an iSeries pulled into SCADA apps... the programming model fits the PLC world perfectly. But it's just not optom