Slashdot Deals: Cyber Monday Sale! Courses ranging from coding to project management - all eLearning deals 25% off with coupon code "CYBERMONDAY25". ×

Google Launches Cloud Dataproc, a Managed Spark and Hadoop Big Data Service 18

An anonymous reader writes: Google has a new cloud service for running Hadoop and Spark called Cloud Dataproc, which is being launched in beta today. The platform supports real-time streaming, batch processing, querying, and machine learning. Techcrunch reports: "Greg DeMichillie, director of product management for Google Cloud Platform, told me Dataproc users will be able to spin up a Hadoop cluster in under 90 seconds — significantly faster than other services — and Google will only charge 1 cent per virtual CPU/hour in the cluster. That's on top of the usual cost of running virtual machines and data storage, but as DeMichillie noted, you can add Google's cheaper preemptible instances to your cluster to save a bit on compute costs. Billing is per-minute, with a 10-minute minimum."

Is Big Data Leaving Hadoop Behind? 100

knightsirius writes: Big Data was seen as one the next big drivers of computing economy, and Hadoop was seen as a key component of the plans. However, Hadoop has had a less than stellar six months, beginning with the lackluster Hortonworks IPO last December and the security concerns raised by some analysts.. Another survey records only a quarter of big data decision makers actively considering Hadoop. With rival Apache Spark on the rise, is Hadoop being bypassed in big data solutions?
United Kingdom

GCHQ Builds a Raspberry Pi Super Computer Cluster 68

mikejuk writes GCHQ, the UK equivalent of the NSA, has created a 66 Raspberry Pi cluster called the Bramble for "educational" purposes. What educational purposes isn't exactly clear but you do associate super computers with spooks and spies. It seems that there was an internal competition to invent something and three, unnamed, GCHQ technologists decided that other Pi clusters were too ad-hoc. They set themselves the target of creating a cluster that could be reproduced as a standard architecture to create a commodity cluster. The basic unit of the cluster is a set of eight networked Pis, called an "OctaPi". Each OctaPi can be used standalone or hooked up to make a bigger cluster. In the case of the Bramble a total of eight OctaPis makes the cluster 64 processors strong. In addition there are two head control nodes, which couple the cluster to the outside world. Each head node has one Pi, a wired and WiFi connection, realtime clock, a touch screen and a camera. This is where the story becomes really interesting. Rather than just adopt a standard cluster application like Hadoop, OctaPi's creators decided to develop their own. After three iterations, the software to manage the cluster is now based on Node.js, Bootstrap and Angular. So what is it all for? The press release says that: "The initial aim for the cluster was as a teaching tool for GCHQ's software engineering community....The ultimate aim is to use the OctaPi concept in schools to help teach efficient and effective programming."

Microsoft's First Azure Hosted Service Is Powered By Linux 66

jones_supa (887896) writes "Canonical, through John Zannos, VP Cloud Alliances, has proudly announced that the first ever Microsoft Azure hosted service will be powered by Ubuntu Linux. This piece of news comes from the Strata + Hadoop World Conference, which takes place this week in California. The fact of the matter is that the news came from Microsoft who announced the preview of Azure HDInsight (an Apache Hadoop-based hosted service) on Ubuntu clusters yesterday at the said event. This is definitely great news for Canonical, as their operating system is getting recognized for being extremely reliable when handling Big Data. Ubuntu is now the leading cloud and scale-out Linux-based operating system."

Meet Flink, the Apache Software Foundation's Newest Top-Level Project 34

Open source data-processing language Flink, after just nine months' incubation with the Apache Software Foundation, has been elevated to top-level status, joining other ASF projects like OpenOffice and CloudStack. An anonymous reader writes The data-processing engine, which offers APIs in Java and Scala as well as specialized APIs for graph processing, is presented as an alternative to Hadoop's MapReduce component with its own runtime. Yet the system still provides access to Hadoop's distributed file system and YARN resource manager. The open-source community around Flink has steadily grown since the project's inception at the Technical University of Berlin in 2009. Now at version 0.7.0, Flink lists more than 70 contributors and sponsors, including representatives from Hortonworks, Spotify and Data Artisans (a German startup devoted primarily to the development of Flink). (For more about ASF incubation, and what the Foundation's stewardship means, see our interview from last summer with ASF executive VP Rich Bowen.)

The Joys and Hype of Hadoop 55

theodp writes "Investors have poured over $2 billion into businesses built on Hadoop," writes the WSJ's Elizabeth Dwoskin, "including Hortonworks Inc., which went public last week, its rivals Cloudera Inc. and MapR Technologies, and a growing list of tiny startups. Yet companies that have tried to use Hadoop have met with frustration." Dwoskin adds that Hadoop vendors are responding with improvements and additions, but for now, "It can take a lot of work to combine data stored in legacy repositories with the data that's stored in Hadoop. And while Hadoop can be much faster than traditional databases for some purposes, it often isn't fast enough to respond to queries immediately or to work on incoming information in real time. Satisfying requirements for data security and governance also poses a challenge."

The Great IT Hiring He-Said / She-Said 574

Nemo the Magnificent writes: Is there an IT talent shortage? Or is there a clue shortage on the hiring side? Hiring managers put on their perfection goggles and write elaborate job descriptions laying out mandatory experience and know-how that the "purple squirrel" candidate must have. They define job openings to be entry-level, automatically excluding those in mid-career. Candidates suspect that the only real shortage is one of willingness to pay what they are worth. Job seekers bend over backwards to make it through HR's keyword filters, only to be frustrated by phone screens seemingly administered by those who know only buzzwords.

Meanwhile, hiring managers feel the pressure to fill openings instantly with exactly the right person, and when they can't, the team and the company suffer. InformationWeek lays out a number of ways the two sides can start listening to each other. For example, some of the most successful companies find their talent through engagement with the technical community, participating in hackathons or offering seminars on hot topics such as Scala and Hadoop. These companies play a long game in order to lodge in the consciousness of the candidates they hope will apply next time they're ready to make a move.

The Apache Software Foundation Now Accepting BitCoin For Donations 67

rbowen writes The Apache Software Foundation is the latest not-for-profit organization to accept bitcoin donations, as pointed out by a user on the Bitcoin subreddit. The organization is well known for their catalog of open-source software, including the ubiquitous Apache web server, Hadoop, Tomcat, Cassandra, and about 150 other projects. Users in the community have been eager to support their efforts using digital currency for quite a while. The Foundation accepts donations in many different forms: Amazon, PayPal, and they'll even accept donated cars. On their contribution page the Apache Software Foundation has published a bitcoin address and QR code.

Book Review: Data-Driven Security: Analysis, Visualization and Dashboards 26

benrothke writes There is a not so fine line between data dashboards and other information displays that provide pretty but otherwise useless and unactionable information; and those that provide effective answers to key questions. Data-Driven Security: Analysis, Visualization and Dashboards is all about the later. In this extremely valuable book, authors Jay Jacobs and Bob Rudis show you how to find security patterns in your data logs and extract enough information from it to create effective information security countermeasures. By using data correctly and truly understanding what that data means, the authors show how you can achieve much greater levels of security. Keep reading for the rest of Ben's review.

Job Postings For Python, NoSQL, Apache Hadoop Way Up This Year 52

Nerval's Lobster writes: "Dice [note: our corporate overlord] collects a ton of data from job postings. Its latest findings? The number of jobs posted for NoSQL experts has risen 54 percent year-over-year, ahead of postings for professionals skilled in so-called 'Big Data' (up 46 percent), Apache Hadoop (43 percent), and Python (16 percent). Employers are also seeking those with expertise in Software-as-a-Service platforms, to the tune of 20 percent more job postings over the past twelve months; in a similar vein, postings for tech professionals with some cloud experience have leapt 27 percent in the same period. Nothing earth-shattering here, but it's perhaps interesting to note that, for all the hype surrounding some of these things, there's actually significant demand behind them."

Ask Slashdot: Which NoSQL Database For New Project? 272

DorianGre writes: "I'm working on a new independent project. It involves iPhones and Android phones talking to PHP (Symfony) or Ruby/Rails. Each incoming call will be a data element POST, and I would like to simply write that into the database for later use. I'll need to be able to pull by date or by a number of key fields, as well as do trend reporting over time on the totals of a few fields. I would like to start with a NoSQL solution for scaling, and ideally it would be dead simple if possible. I've been looking at MongoDB, Couchbase, Cassandra/Hadoop and others. What do you recommend? What problems have you run into with the ones you've tried?"
Open Source

Spark Advances From Apache Incubator To Top-Level Project 24

rjmarvin writes "The Apache Software Foundation announced that Spark, the open-source cluster-computing framework for Big Data analysis has graduated from the Apache Incubator to a top-level project. A project management committee will guide the project's day-to-day operations, and Databricks cofounder Matei Zaharia will be appointed VP of Apache Spark. Spark runs programs 100x faster than Apache Hadoop MapReduce in memory, and it provides APIs that enable developers to rapidly develop applications in Java, Python or Scala, according to the ASF."

Ask Slashdot: Do You Run a Copy-Cat Installation At Home? 308

Lab Rat Jason writes "During a discussion with my wife last night, I came to the realization that the primary reason I have a Hadoop cluster tucked under my desk at home (I work in an office) is because my drive for learning is too aggressive for my IT department's security policy, as well as their hardware budget. But on closer inspection the issue runs even deeper than that. Time spent working on the somewhat menial tasks of the day job prevent me from spending time learning new tech that could help me do the job better. So I do my learning on my own time. As I thought about it, I don't know a single developer who doesn't have a home setup that allows them to tinker in a more relaxed environment. Or, put another way, my home setup represents the place I wish my company was going. So my question to Slashdot is this: How many of you find yourselves investing personal time to learn things that will directly benefit your employer, and how many of you are able to 'separate church and state?'"

Facebook Testing Screen-Tracking Software For Users 115

cagraham writes "Facebook is currently testing software that would track user's cursor movements, as well as monitor how often a user's newsfeed was visible on their mobile phone, according to the Wall Street Journal. The additional data from such tracking would potentially let Facebook raise their ad prices, as they could deliver even more information about user's on-site behavior to advertisers, such as how long users hovered over specific ads. In order to analyze the extra data, Facebook will utilize a custom version of Hadoop."

DEF CON Hackers Unveil a New Way of Visualizing Web Vulnerabilities 57

punk2176 writes "Hacker and security researcher Alejandro Caceres (developer of the PunkSPIDER project) and 3D UI developer Teal Rogers unveiled a new free and open source tool at DEF CON 21 that could change the way that users view the web and its vulnerabilities. The project is a visualization system that combines the principles of offensive security, 3D data visualization, and 'big data' to allow users to understand the complex interconnections between websites. Using a highly distributed HBase back-end and a Hadoop-based vulnerability scanner and web crawler the project is meant to improve the average user's understanding of the unseen and potentially vulnerable underbelly of web applications that they own or use. The makers are calling this new method of visualization web 3.0. A free demo can be found here, where users can play with and navigate an early version of the tool via a web interface. More details can be found here and interested users can opt-in to the mailing list and eventually the closed beta here."

Why Netflix Is One of the Most Important Cloud Computing Companies 111

Brandon Butler writes "Netflix, yes the video rental company Netflix, is changing the cloud game. During the past two years the company has pulled back the curtains through its Netflix OSS program to provide a behind-the-scenes look into how it runs one of the largest deployments of Amazon Web Services cloud-based resources. In doing so, the company is creating tools that can be used by both entire business-size scale cloud deployments and even smaller test environments. The Simian Army, for example randomly kills off VMs or entire availability zones in Amazon's cloud to test fault tolerance, Asgard is a cloud resource dashboard and Lipstick on (Apache) Pig, is a data visualization tool for the Hadoop program; there are dozens of others that help deploy, manage and monitor the tens of thousands of VM instances the company company can be running at any single time. Netflix is also creating a cadre of developers who are experts in managing cloud deployments, and already its former employees are popping up at other companies to bring their expertise on how to run a large-scale cloud resources. Meanwhile, Netflix does this all in AWS's cloud, which raises some questions of how good of a job it's actually doing when it can be massively impacted by cloud outages, such as the one on Christmas Eve last year that brought down Netflix's services but, interestingly, not Amazon's own video streaming system, which is a competitor to the company."
Open Source

Why the 'Star Trek Computer' Will Be Open Source and Apache Licensed 129

psykocrime writes "The crazy kids at Fogbeam Labs have a new blog post positing that there is a trend towards advanced projects in NLP, Information Retrieval, Big Data and the Semantic Web moving to the Apache Software Foundation. Considering that Apache UIMA is a key component of IBM Watson, is it wrong to believe that the organization behind Hadoop, OpenNLP, Jena, Stanbol, Mahout and Lucene will ultimately be the home of a real 'Star Trek Computer'? Quoting: 'When we talk about how the Star Trek computer had “access to all the data in the known Universe”, what we really mean is that it had access to something like the Semantic Web and the Linked Data cloud. Jena provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine. ... In addition to supporting the natural language interface with the system, OpenNLP is a powerful library for extracting meaning (semantics) from unstructured data - specifically textual data in an unstructured (or semi structured) format. An example of unstructured data would be the blog post, an article in the New York Times, or a Wikipedia article. OpenNLP combined with Jena and other technologies, allows “The computer” to “read” the Web, extracting meaningful data and saving valid assertions for later use.'" Speaking of the Star Trek computer, I'm continually disappointed that neither Siri nor Google Now can talk to me in Majel Barrett's voice.

Netflix: 'Arrested Development' Won't Crash Our Service 127

Nerval's Lobster writes "No, the latest season of 'Arrested Development' won't fatally crash Netflix, despite comedian David Cross's tongue-in-cheek comment that the series will melt down the company's servers on its first weekend of streaming availability. 'No one piece of content can have that kind of impact given the size of what we are serving up at any given time,' a spokesperson wrote in an email to Slashdot. Although 'Arrested Development' struggled to survive during its three seasons on Fox (from 2003 to 2006), the series has built a significant cult following in the years following its cancellation. Netflix commissioned a fourth season as part of a broader plan to augment its streaming service with exclusive content, and will release all 13 new episodes at once on May 26. Like Facebook, Google, and other Internet giants, Netflix has invested quite a bit in physical infrastructure and engineers. It stores its data on Amazon's Simple Storage Service (S3), which offers a significant degree of durability and scalability; it also relies on Amazon's Elastic MapReduce (EMR) distribution of Apache Hadoop, along with tools within the Hadoop ecosystem such as Hive and Pig. That sort of backend can allow the company to handle much more than 13 seasons' worth of Bluths binged over one weekend — but that doesn't mean its streaming service is immune from the occasional high-profile failure."

Google's BigQuery Vs. Hadoop: a Matchup 37

Nerval's Lobster writes "Ready to 'Analyze terabytes of data with just a click of a button?' That's the claim Google makes with its BigQuery platform. But is BigQuery really an analytics superstar? It was unveiled in Beta back in 2010, but recently gained some improvements such as the ability to do large joins. In the following piece, Jeff Cogswell compares BigQuery to some other analytics and OLAP tools, and hopefully that'll give some additional context to anyone who's thinking of using BigQuery or a similar platform for data. His conclusion? In the end, BigQuery is just another database. It can handle massive amounts of data, but so can Hadoop. It's not free, but neither is Hadoop once you factor in the cost of the hardware, support, and the paychecks of the people running it. The public version of BigQuery probably isn't even used by Google, which likely has something bigger and better that we'll see in five years or so."

Rackspace Goes On Rampage Against Patent Trolls 132

girlmad writes "Rackspace has come out fighting against one of the U.S.'s most notorious patent trolls, Parallel Iron. The cloud services firm said it's totally fed up with trolls of all kinds, which have caused a 500 percent rise in its legal bills. Rackspace was last week named among 12 firms accused of infringing Parallel Iron's Hadoop Distributed File System patents. Rackspace is now counter-suing the troll, as the firm said it has a deal in place with Parallel Iron after signing a previous patent settlement with them."