Please create an account to participate in the Slashdot moderation system


Forgot your password?

Ask Slashdot: Which NoSQL Database For New Project? 272

DorianGre writes: "I'm working on a new independent project. It involves iPhones and Android phones talking to PHP (Symfony) or Ruby/Rails. Each incoming call will be a data element POST, and I would like to simply write that into the database for later use. I'll need to be able to pull by date or by a number of key fields, as well as do trend reporting over time on the totals of a few fields. I would like to start with a NoSQL solution for scaling, and ideally it would be dead simple if possible. I've been looking at MongoDB, Couchbase, Cassandra/Hadoop and others. What do you recommend? What problems have you run into with the ones you've tried?"
This discussion has been archived. No new comments can be posted.

Ask Slashdot: Which NoSQL Database For New Project?

Comments Filter:
  • by Richard_at_work ( 517087 ) <{richardprice} {at} {}> on Wednesday April 09, 2014 @06:00AM (#46702959)

    Theres probably an element of multithreaded access that needs to be taken into consideration here - writing to a single text file may get you into issues if the receiving webserver is multithreaded, meaning the threads will either have to queue for write locks, or write to a different file.

    Database engines don't have this issue, so while it may be overkill, there may be reasons to have one irregardless.

  • Re:NoSQL? (Score:5, Interesting)

    by Sarten-X ( 1102295 ) on Wednesday April 09, 2014 @11:15AM (#46704925) Homepage

    "Why not" is because the cost/benefit analysis is not in NoSQL's favor. NoSQL's downsides are a steeper learning curve (to do it right), fewer support tools, and a more specialized skill set. Its primary benefits don't apply to you. You don't need ridiculously fast writes, you don't need schema flexibility, and you don't need to run complex queries on previously-unknown keys. Rather, you have input rates limited by an external connection, only a few entity types, and you know your query keys ahead of time.

  • by Anonymous Coward on Wednesday April 09, 2014 @11:45AM (#46705215)

    If the goal really is just to amass data and then do offline reports on it (not completely clear from the question) then I can report that at my company we've been doing this at scale for over five years. Here's how:

    * A bunch of web servers accept data and append it to a local disk file.
    * Every hour, that "log" is pushed from each host into HDFS and a new log file started. (HDFS as in the Hadoop Distributed Filesystem)
    * Querying is done later, using Hive with a custom deserializer that natively understands our on-disk format. (You could also just make sure your on disk-format is the delimited text format Hive natively understands, of course. We had some unique needs here.)
    * An hourly task runs a small set of Hive aggregation queries (Hive presents a SQL-like interface to defining and running MapReduce jobs) on the raw "table" to produce some smaller datasets that can return aggregate-based results faster than the raw data, including copying some of the smaller aggregates into a MySQL database for online access via some reporting applications.

    At this point our daily dataset is a few terabytes in size, when considering the sum of all of the collecting servers across all of the hours. (There are some peak hours due to the nature of our business, so the volume isn't even across the whole day.)

    The only thing we've ever disliked about this system is the delay between data arriving and it being available to query. For a little while experimented with using Apache Storm to with realtime log streaming, and produced a working prototype that was shown to work for a one-tenth sample of the data, but ultimately we concluded that the need for faster data wasn't strong enough to justify the additional complexity and stuck with the above design. Therefore I can't speak to how far that solution would scale, but if real-time analysis isn't a requirement -- and scaling up in data size is -- then I can certainly recommend the above design.

Overload -- core meltdown sequence initiated.