Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
Databases Data Storage Oracle Stats

Ask Slashdot: Choosing a Data Warehouse Server System? 147

New submitter puzzled_decoy writes The company I work has decided to get in on this "big data" thing. We are trying to find a good data warehouse system to host and run analytics on, you guessed it, a bunch of data. Right now we are looking into MSSQL, a company called Domo, and Oracle contacted us. Google BigQuery may be another option. At its core, we need to be able to query huge amounts of data in sometimes rather odd ways. We need a strong ETLlayer, and hopefully we can put some nice visual reporting service on top of wherever the data is stored. So, what is your experience with "big data" servers and services? What would you recommend, and what are the pitfalls you've encountered?
This discussion has been archived. No new comments can be posted.

Ask Slashdot: Choosing a Data Warehouse Server System?

Comments Filter:
  • by iamwhoiamtoday ( 1177507 ) on Saturday November 08, 2014 @02:53AM (#48339293)

    Oregon Resident here. After the recent issues with Oracle..... yup. Not gonna recommend 'em again. Not a big fan of my tax money being wasted.

    • Re:Skip Oracle. (Score:5, Informative)

      by RuffMasterD ( 3398975 ) on Saturday November 08, 2014 @08:00AM (#48339829)
      Just from a technical and financial point of view, I wouldn't recommend Oracle either. Oracle Advanced Analytics just seems to be a very expensive way to get R.

      Financially - R is open source and free (as in both free as a bird, and free beer), so you don't need to buy it from Oracle. No doubt Oracle will make you buy their DBMS as well to work with Advanced Analytics, and a big server to run it on, plus support to get it up and running.

      Technically - Oracle make a good DMBS for sure, but you don't need all the advanced features their DBMS is good at, such record level locking, three phase commit, redo logs, conflict resolution etc. You need that sort of stuff to maintain data integrity on transaction processing systems, but not for analysis. For analysis you just need a giant de-normalised table, and maybe indexes if you want to pick out specific subsets of records without full table scans.

      Personally I use SAS. It's not sexy, but I have never found a dataset too large to handle. It will thrash the harddrive all night if it has to to get a result, but it won't crash. SPSS will definitely crap itself with even moderate datasets. Stata does OK, but even that can't handle the larger datasets. I haven't pushed R hard enough to find it's limit.
      • by Anonymous Coward

        While your post is informative, it sort of misses the mark. Granted, TFS is clearly more on the clueless side, but you should have realized that 'analytics' here does not mean actual analytics, as much as simple BI reporting. The main requirement is ETL, reports are 'nice to have'.

        Back to you post, it's nice to know about SAS. For SPSS you want to try and push as much processing into the DB as possible, otherwise you need to get the server version and throw hardware at it, the local server that comes with t

      • I don't get it. Why are you denormalizing your tables?

        If you're talking about denormalizing, you're talking about a relatively complicated data set, else there would be nothing to denormalize. Almost nothing you'll do in SAS on any resonably complex data requires all the fields. So any DB on the back end (postgres, mysql) should be able to join up what you need from a well-normalized dataset quickly.

        Or do you mean you're just making a big text file (or SAS data blob) and using that in SAS? If that's the

        • by Anonymous Coward

          I don't get it. Why are you denormalizing your tables?

          Most likely to simplify a star schema with many dimensions, it's a standard approach to keep your query run times relatively sane.

      • SAS may be the best answer to "query huge amounts of data in sometimes rather odd ways". Using SQL Server for storage is fine, but not using anything else in front of it (SSAS is useless) is bringing a knife to a gun fight. Trying to do everything in a relational way means tying a hand and a foot behind your back. The real world doesn't neatly fit the model, hard as you might try to make it, so performance suffers greatly and doing unusual ad hoc things takes longer to figure out. Get SAS to send pure relat

  • First step (Score:5, Insightful)

    by Anonymous Coward on Saturday November 08, 2014 @03:00AM (#48339309)

    The first step is to ask Slashdot a really vague question to a highly technical and expensive undertaking.

  • by Anonymous Coward

    Define the goals. Don't mistake software for creativity and insight. If your company is going to crunch a lot of data find someone qualified to think analytically and recommend the correct tools for the job.

    I hear that R is very upcoming in statistical work. I also hear that any other 'big data' solution is going to cost you as much as a full time employee anyway.

    Also, yes, skip Oracle. If you put that much effort in to tuning a system/the way you're asking the question nearly anything could come up wit

  • by ArchieBunker ( 132337 ) on Saturday November 08, 2014 @03:20AM (#48339355) Homepage

    Help do my job for me.

    • Re:Dear Slashdot, (Score:5, Insightful)

      by Sesostris III ( 730910 ) on Saturday November 08, 2014 @03:54AM (#48339423)
      Maybe. However I would also be interested in any answer (especially any answer involving FLOSS software). Interested not because it's my job or my company is looking to use such software, but because I'm curious and like to expand my knowledge.

      In general I don't mind such questions on Slashdot, as they're usually interesting and informative to the rest of us. And if they're not, then I (we) don't read the article!
    • Are you one of those people that think developers should do everything themselves without asking for assistance? That shit leads to really, really bad code.

      It may not be fashionable in your circles, but human communication is, and will always be, a basic element of engineering.
  • by Anonymous Coward

    The way you're going at it you're basically burning money. "We must have this big data thing too!" is every hardware vendor's eyes going "ka-ching" and you'll be overpaying whatever you do. Even if you think you're getting a good price.

    The problem with big data as a thing (BDaaT) is that without a clear goal you'll be gathering too much data and storing it for too long. Thereby you "need" too much processing power to shoot through it, and the only way left is downhill. This creates myriads of problems, of w

  • Postgres-XL (Score:5, Informative)

    by bruce_the_loon ( 856617 ) on Saturday November 08, 2014 @03:45AM (#48339403) Homepage

    Open-source so you don't have to cough up millions of dollars to see if you can get business.

    Clusterable, scalable and standards-based so you're not locking down too far into one solution-space.

  • by Anonymous Coward

    do your job or go apply at mcdonalds.

  • by Anonymous Coward

    Pretty easy to try it out immediately... http://aws.amazon.com/redshift [amazon.com]

  • Don't waste your time and money, just go with Hadoop.
    Need ETL? Well for one there is PIG, but if you want to do stream processing Apache Storm / Kafka.
    Take a look at this, http://hortonworks.com/hdp/ [hortonworks.com]
    All completely Open Source.

    • And if you want visual drag and drop ETL development and orchestration, use Pentaho Data Integration (a.k.a. Kettle). Comes in open source with an Apache license or professionally supported. Supports visual Map/Reduce development, integration with pig, scoop, oozie, ...
      For SQL you can use Hive but try one of the alternative engines like Impala as well.

    • If you want your Hadoop cluster to be fast and easy to use, go with Spark https://spark.apache.org/ [apache.org].

    • This site 's good. I will use it for my business http://www.hangngoainhap.com.v... [hangngoainhap.com.vn]
  • by TyFoN ( 12980 ) on Saturday November 08, 2014 @04:16AM (#48339457)

    Whatever you do, don't go mssql as you will end up processing most of your data in the analytics tool.
    I've seen it lock tables even on only reads causing other processes to be terminated.
    The closest it has got to materialized views are clustered indexed views which suck and can barely do any processing.

    • by Anonymous Coward

      Yeah, I use only the sql side, and I really hate ssrs and I never use ssas.
      SQL Server reporting services is great, until you want to do something really cool and complex, and then it is a hellish wasteland of tears.

    • by lucm ( 889690 )

      I've seen it lock tables even on only reads causing other processes to be terminated.

      That's because someone who does not understand how the product works has configured a serializable transaction isolation level. I would suggest to RTFM but maybe you need to start with the basics: http://en.wikipedia.org/wiki/I... [wikipedia.org]

    • by Anonymous Coward

      Whatever you do, don't go mssql as you will end up processing most of your data in the analytics tool.


      I've seen it lock tables even on only reads causing other processes to be terminated.

      Try enabling snapshot isolation if you want MVCC

      The closest it has got to materialized views are clustered indexed views which suck and can barely do any processing.

      Try columnstore indexes if you want your mind blown.

  • by Anonymous Coward on Saturday November 08, 2014 @04:17AM (#48339463)

    Big data is an entire field of study, this is not "should I use vi or emacs or nano" and even that requires a shitload of context and the source of flame wars until the end of time.

    Think about your budget, your audience, and the value that you can add by spending time and money on this.

    MapReduce (hadoop) is awesome and open source, you can run it in house or in multiple cloud offerings and has a tremendous community. BUT it sucks at relationships (foreign keys) graph calculations and others.

    Graph databases can make connections between things that are impossible in other systems, but are only good for graph relationships.

    OLAP data stored in n-dimensional cubes allows reporting and analysis if familiar tools that many analysts (not programmers) think is the cat's pajamas.

    Your best be is to slow down and talk to your users, while reading Seven Databases in Seven Weeks
    And then realize that you probably need to hire a consultant so you have somebody to fire when the whole thing goes south.

    • I'm out of modpoints but I would like to stress that _this post_ is an example of why Ask Slashdot is so successful at answering questions that boil down to "I don't know what I need to know to get this job done". This is the type of answer that will put the OP on the right track to figuring out what he needs.

      • Plus the strategic element of bringing in a consultant. Outside expertise is valuable not only for the expertise, but also because of other less tangible benefits. The outside guy is always more trusted by the business units. It is just human nature. You can lecture everyone on the benefits of some new initiative until you are blue in the face and get nowhere, but bring in a consulting firm to say the same thing and everyone suddenly thinks it is a great idea.

        The same goes for having a scapegoat when th

  • by thogard ( 43403 ) on Saturday November 08, 2014 @04:19AM (#48339469) Homepage

    If the data fits in a database, it is not Big Data.

  • by Sesostris III ( 730910 ) on Saturday November 08, 2014 @04:32AM (#48339485)
    If I was tasked with coming up with ideas for a Data Warehouse Server System, and given that I know almost nothing about such systems, my first port of call would probably be Apache. What about Cassandra, Hadoop, Hive, Mahout or Pig (or combinations thereof)? All of these are downloadable and playable-with (and being Apache, FLOSS).

    As a previous poster pointed out, there is also PostgreSQL, again FLOSS. Again downloadable and playable-with.
  • by zmooc ( 33175 ) <zmooc&zmooc,net> on Saturday November 08, 2014 @04:54AM (#48339509) Homepage

    Sounds like you're very good in the buzzword-department but have no idea what you're doing at all.... What kind of data are we talking about? Lots of writes? Lots of reads? Is the data suitable for splitting up? What kind of queries will you need to run? Do you need uptime? Or consistency?

    Also if you're looking at MSSQL or Oracle, you obviously DO NOT HAVE Big Data. Big Data is data that cannot be dealt with using regular RDBMSes. Do you really have or plan to have multiple terabytes of data? If not, you don't have big data.

    Based on the information you've given us we cannot give you any advice at all apart from stopping what you're doing and hiring an expert.

    • My thoughts exactly. This question is stupid.

    • by leuk_he ( 194174 )


      Big data is a different thing from datawarehousing.

      In a big data scenario you have lots of data, that you process with a highly scalable solution.

      In a databasehouse you collect data from different sources and transform them in several steps to a datamodel you can create reports from it in a simple way. .

      And there is the other option you just have to process lots of records from a simular source (measuring data), where you carefully monitor and tune the processing of that data.

      The question does not e

    • The key problem is most business run into their own insecurities.
      They are afraid of picking the uncool system that in 5 years would be scoffed at.
      Such as creating a new web app in Perl, nothing technically wrong but it isn't cool anymore.
      It's the no one has gotten fired for choosing IBM. It is more about picking the name that suppose to impress your customers. Not what is best for the job.

    • by drolli ( 522659 )

      "Big Data is data that cannot be dealt with using regular RDBMSes"

      Let's say: you may need or use a "regular" (RDBMs) for some things in big data, but it's not going to be your "Data Warehouse".

  • I would use open source and my own servers, but since you're considering Oracle and Microsoft,

    You should look at IBM Bluemix. I've heard good things about it. Watson integration.
  • We are trying to find a good data warehouse system to host and run analytics on

    You're asking the wrong questions.You should start higher up the chain in business-value land - WHYdo you need a data warehouse system (to run analytics)... great WHY do you need to run analytics (to discover XXXXX from the data we generate/own/handle). OK now you're getting closer... now, armed with the knowledge about what data you will be storing, and what kind of insights you would like to generate, you need to approach a spe

    • I would definitely recommend to go with a reputable external consultant when it comes to getting started with queries and reports. They will be able to come up with good questions to get from your data, but more importantly they can help avoid bad answers. For instance, given the initials, height, eye color, age and other such data of presidential candidates, I can probably come up with a filter that will correctly indicate whether or not the candidate won the elections, based on the data. But how useful
    • Right now, if you are starting with "Data Warehouse" you probably are using the wrong answer key to score your wrong questions.

  • by Ceriel Nosforit ( 682174 ) on Saturday November 08, 2014 @05:38AM (#48339559)

    If your company buys 'big data', I have a bridge to sell you.

    Know your data. Don't build a castle in the sky; that's how SAP happened.

  • by codepunk ( 167897 ) on Saturday November 08, 2014 @05:44AM (#48339565)

    1. Hire some bonehead that is expendable and ask him to make the decision.
    2. Fire him when the project fails.
    3. Nobody will ever bring this up again.

  • I've recently had good experiences with running SQL queries on fairly large (# records: 200 mln. plus) databases on a Teradata machine in a corporate environment. I wasn't involved in any sysadmin work, just the statistical modeling / analysis side of things.

    The company I consulted for uses SAS (on the mainframe, AIX boxes, and PC's) for almost all of its dataprocessing needs, including ETL work. Now they're looking at "Big Data" and discovered they need parallel processing to make it cost-effective (outp

    • Re: (Score:2, Informative)

      by Anonymous Coward

      Former back-office Teradata employee here. Teradata makes a very powerful product, but if security and availability of your data is critical, then I would look elsewhere. I'm not going to divulge any company secrets, but I will copy some snippets from employee reviews on Glassdoor:

      "Security is nonexistent. LAN credentials are sent in plain text (unencrypted) everywhere... CUSTOMER credentials to CUSTOMER systems (IP addresses and credentials) are sent in plain text (unencrypted)"

      "IT outages are frequent,

  • .. a slashdot topic on which I might actually be qualified to comment. I’ve spent a lot of time analysing suitable databases for data warehouse. As other commenters have mentioned you don’t really give enough detail about the types of data and likely use cases however I can assumer your going to do similar things to most of our customers. We have used 2 products in our business, both are column stores which tend to have the characteristics of very fast read/join and query but should not be used
  • by operator_error ( 1363139 ) on Saturday November 08, 2014 @07:56AM (#48339819)

    The ELK Stack might be an option. In my field, (many) web servers can stream all their logs off-site in Real-Time using Logstash Forwarder (or instead they might use rysnc, or rsyslog, or...). A central server, in the secure private intranet perhaps reads and indexes this log data, (that's ElasticSearch, which is sort of like a personal Google for your logs, any logs of any kind, or other Big Data). Kibana is a user-friendly Angular.js application and presentation layer. If you're familiar with NewRelic for server monitoring, you can save views just like when using that tool.

    http://jakege.blogspot.nl/2014... [blogspot.nl]

    Okay, maybe this is sort of like 'when all you have is a hammer, everything looks like a nail', but this suggestion is the extent of my background in this area. Although I have had an itch to scratch, and so far, this is my best open-source result.

    There's a ton of citations you should search for yourself, but I'll provide one I found that might start to help. Using this tool, it is fairly easy to parse out the myriad of hacker efforts at attacking the servers for example; even when you're the NY Times.

  • I know a few. They are all looking at options to get rid of Oracle, and often of Solaris as well. On the other hand, MSSQL is still basically a toy. It really depends on you data model and the queries you run. Key-value stores ("no SQL"), for example, are really easy to distribute over many servers.

  • Hi, i would take a look at exadata if you really need great performance. The oracle rdbms is one of the most reliable i know of so paired with hardware specifially designed to perform is a nice thing. Otherwise you could try to work with flash storage (flash cards) for really high performance but you will still need a good database. I dont know how good all these SQLDB (open, ms) work but for sure i know that they dont play a vital role in the enterprise environment i work in and know of. So stick with IBM/
  • This is a MASSIVE undertaking, requiring deep and profound strategic decisions to be made at the highest levels of the company/organization.

    To go all in on what advice you might receive from slashdot is fool hearty at best.

    Do yourself and your company a favor, hire a world class consultant to come in and provide some advice.

  • If you need to ask this question on Slashdot then chances are you don't have the skills to build and run such a system properly.

  • I'm a little late to the party so this might get buried but here goes.

    I would strongly recommend looking into Microsoft's Analytics Platform System(APS), Formerly Parallel Data Warehouse(PDW). It's an MPP appliance that combines PDW and Hadoop. I got to spend a week on one of these appliances recently and I can't wait to get back on it. It supports combined queries usinf polybase across Hadoop and the Data warehouse(as well as the cloud).

    Typically data scientists will want to work in Hadoop and use R,
    • "Those are a little easier to set up because you don't need to get your organizations IT server team involved" Your IT bretheren might be a little non-plussed being skipped in the process.
      • I would only recommend that for the Proof of concept. You will definitely need your infrastructure team involved when it comes to installing the appliance. And you need an executive to remove roadblocks and help make things happen once you get moving.

        With that said, I do understand that IT infrastructure can get a little butt hurt about installing appliances. Like I said, I've been doing tech for 30 years now and one trend I have noticed is that a lot of IT departments have drifted away from the customer
    • by mikaere ( 748605 )
      +1 to this, I did some training on this stuff earlier this year, and I was impressed with the overall offering. I particularly like the fact that you buy a pre-set appliance, so no mucking around with config and installations etc Also, the SQL 2014 columnstore indexes should give massive query speed improvements.
  • You can find out a lot in a few hours just by going to a Big Data meetup. Traditional database vendors are trying to hijack big data and make it their buzzword. Real big data players are using tools like Hadoop, Spark, Solr, Elastic Search and other tools that allow you to use commodity hardware to get a much more performant platform for big data. The appliance vendors have some interesting off the shelf stuff... you should really take some time to see what is going on... it's wild west time.

  • We use http://en.wikipedia.org/wiki/G... [wikipedia.org] which is a clustered Postgres implementation. It has its problems (Postgres 8.2? seriously?) But it is very fast for ETL and batch queries on large data sets. We house 100+TB and get excellent performance. Its commercial and you pay by the TB.

    Then there is also AWS Redshift. We have found it to be quicker at some things and possibly cheaper but immature in its feature set (no UDF, etc). The thinking here is that if you have a separate system for ETL, Redshift would m

  • Don't confuse a regular data warehouse with Big Data. If Big Data is a "thing" your company wants to get into, it probably does not apply to you.

    As for your data warehouse, MS SQL Server and is a good enough base to start with. IBM's DB2 is another underrated platform. Don't feed Oracle please.

  • Use SSD based storage for the data, so you don't have to wait for spindles. Seems that Pure Storage does it best of late, whereas other vendors have optimized the spindle based storage. PS did it from ground up. Best part is the documentation, Its ALL written on a single 3x5 card. No matter what software you use, skip the spindles.
  • This is a wildly nontrivial question. Volumes are written about building data warehouses, and there's a lot to consider. In a large complicated environment, you could spend weeks doing comparisons (some people spend years, but that seems extreme); and some of the decisions are worth weighing.

    The first question is what capability are you looking for -- why are you sure one of these vendors is correct, and have you truly explored your options? If you want a place to capture and gather lots of near-real-tim

  • I see lots of buzz words, but they don't make much sense together. Big Data and a Data Warehouse are not the same thing. If you *only* care about big data, you don't need to care about ETL. All of these things require you to know your data (and to have a goal). Its one of those things were the execution is a lot more important than the product chosen. Your goal cannot be 'get into this "big data" thing'. I'd recommend finding some user groups for the tools you're interested in and asking a few other
  • No offense, but from the sound of it you have no clue about a BI infrastructure, which is what you're talking about. If your company is serious they'll hire a team of 10 people w/ an average salary well north of 100k and have a couple million dollar budget per year for IT systems, including an analytic data base, ETL system, and BI application.

    My guess is that you just want to start off by incrementally building a DW and want ad hoc analytic capabilities. My proof of concept solution would be to use Penta

  • When a piece of data come in, store it everywhere you need it. This might be aggregated tables (if you don't use indexed views) or whatever you may need. If you have background processes like ETL, you'll use a lot of your hardware for processing at the expense of queries.

    Avoid ETL. You've got one shot to store your data everywhere.

If I have seen farther than others, it is because I was standing on the shoulders of giants. -- Isaac Newton