rjmarvin writes "The Apache Software Foundation announced that Spark, the open-source cluster-computing framework for Big Data analysis has graduated from the Apache Incubator to a top-level project. A project management committee will guide the project's day-to-day operations, and Databricks cofounder Matei Zaharia will be appointed VP of Apache Spark. Spark runs programs 100x faster than Apache Hadoop MapReduce in memory, and it provides APIs that enable developers to rapidly develop applications in Java, Python or Scala, according to the ASF."
Follow Slashdot stories on Twitter
Lab Rat Jason writes "During a discussion with my wife last night, I came to the realization that the primary reason I have a Hadoop cluster tucked under my desk at home (I work in an office) is because my drive for learning is too aggressive for my IT department's security policy, as well as their hardware budget. But on closer inspection the issue runs even deeper than that. Time spent working on the somewhat menial tasks of the day job prevent me from spending time learning new tech that could help me do the job better. So I do my learning on my own time. As I thought about it, I don't know a single developer who doesn't have a home setup that allows them to tinker in a more relaxed environment. Or, put another way, my home setup represents the place I wish my company was going. So my question to Slashdot is this: How many of you find yourselves investing personal time to learn things that will directly benefit your employer, and how many of you are able to 'separate church and state?'"
cagraham writes "Facebook is currently testing software that would track user's cursor movements, as well as monitor how often a user's newsfeed was visible on their mobile phone, according to the Wall Street Journal. The additional data from such tracking would potentially let Facebook raise their ad prices, as they could deliver even more information about user's on-site behavior to advertisers, such as how long users hovered over specific ads. In order to analyze the extra data, Facebook will utilize a custom version of Hadoop."
punk2176 writes "Hacker and security researcher Alejandro Caceres (developer of the PunkSPIDER project) and 3D UI developer Teal Rogers unveiled a new free and open source tool at DEF CON 21 that could change the way that users view the web and its vulnerabilities. The project is a visualization system that combines the principles of offensive security, 3D data visualization, and 'big data' to allow users to understand the complex interconnections between websites. Using a highly distributed HBase back-end and a Hadoop-based vulnerability scanner and web crawler the project is meant to improve the average user's understanding of the unseen and potentially vulnerable underbelly of web applications that they own or use. The makers are calling this new method of visualization web 3.0. A free demo can be found here, where users can play with and navigate an early version of the tool via a web interface. More details can be found here and interested users can opt-in to the mailing list and eventually the closed beta here."
Brandon Butler writes "Netflix, yes the video rental company Netflix, is changing the cloud game. During the past two years the company has pulled back the curtains through its Netflix OSS program to provide a behind-the-scenes look into how it runs one of the largest deployments of Amazon Web Services cloud-based resources. In doing so, the company is creating tools that can be used by both entire business-size scale cloud deployments and even smaller test environments. The Simian Army, for example randomly kills off VMs or entire availability zones in Amazon's cloud to test fault tolerance, Asgard is a cloud resource dashboard and Lipstick on (Apache) Pig, is a data visualization tool for the Hadoop program; there are dozens of others that help deploy, manage and monitor the tens of thousands of VM instances the company company can be running at any single time. Netflix is also creating a cadre of developers who are experts in managing cloud deployments, and already its former employees are popping up at other companies to bring their expertise on how to run a large-scale cloud resources. Meanwhile, Netflix does this all in AWS's cloud, which raises some questions of how good of a job it's actually doing when it can be massively impacted by cloud outages, such as the one on Christmas Eve last year that brought down Netflix's services but, interestingly, not Amazon's own video streaming system, which is a competitor to the company."
psykocrime writes "The crazy kids at Fogbeam Labs have a new blog post positing that there is a trend towards advanced projects in NLP, Information Retrieval, Big Data and the Semantic Web moving to the Apache Software Foundation. Considering that Apache UIMA is a key component of IBM Watson, is it wrong to believe that the organization behind Hadoop, OpenNLP, Jena, Stanbol, Mahout and Lucene will ultimately be the home of a real 'Star Trek Computer'? Quoting: 'When we talk about how the Star Trek computer had “access to all the data in the known Universe”, what we really mean is that it had access to something like the Semantic Web and the Linked Data cloud. Jena provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine. ... In addition to supporting the natural language interface with the system, OpenNLP is a powerful library for extracting meaning (semantics) from unstructured data - specifically textual data in an unstructured (or semi structured) format. An example of unstructured data would be the blog post, an article in the New York Times, or a Wikipedia article. OpenNLP combined with Jena and other technologies, allows “The computer” to “read” the Web, extracting meaningful data and saving valid assertions for later use.'" Speaking of the Star Trek computer, I'm continually disappointed that neither Siri nor Google Now can talk to me in Majel Barrett's voice.
Nerval's Lobster writes "No, the latest season of 'Arrested Development' won't fatally crash Netflix, despite comedian David Cross's tongue-in-cheek comment that the series will melt down the company's servers on its first weekend of streaming availability. 'No one piece of content can have that kind of impact given the size of what we are serving up at any given time,' a spokesperson wrote in an email to Slashdot. Although 'Arrested Development' struggled to survive during its three seasons on Fox (from 2003 to 2006), the series has built a significant cult following in the years following its cancellation. Netflix commissioned a fourth season as part of a broader plan to augment its streaming service with exclusive content, and will release all 13 new episodes at once on May 26. Like Facebook, Google, and other Internet giants, Netflix has invested quite a bit in physical infrastructure and engineers. It stores its data on Amazon's Simple Storage Service (S3), which offers a significant degree of durability and scalability; it also relies on Amazon's Elastic MapReduce (EMR) distribution of Apache Hadoop, along with tools within the Hadoop ecosystem such as Hive and Pig. That sort of backend can allow the company to handle much more than 13 seasons' worth of Bluths binged over one weekend — but that doesn't mean its streaming service is immune from the occasional high-profile failure."
Nerval's Lobster writes "Ready to 'Analyze terabytes of data with just a click of a button?' That's the claim Google makes with its BigQuery platform. But is BigQuery really an analytics superstar? It was unveiled in Beta back in 2010, but recently gained some improvements such as the ability to do large joins. In the following piece, Jeff Cogswell compares BigQuery to some other analytics and OLAP tools, and hopefully that'll give some additional context to anyone who's thinking of using BigQuery or a similar platform for data. His conclusion? In the end, BigQuery is just another database. It can handle massive amounts of data, but so can Hadoop. It's not free, but neither is Hadoop once you factor in the cost of the hardware, support, and the paychecks of the people running it. The public version of BigQuery probably isn't even used by Google, which likely has something bigger and better that we'll see in five years or so."
girlmad writes "Rackspace has come out fighting against one of the U.S.'s most notorious patent trolls, Parallel Iron. The cloud services firm said it's totally fed up with trolls of all kinds, which have caused a 500 percent rise in its legal bills. Rackspace was last week named among 12 firms accused of infringing Parallel Iron's Hadoop Distributed File System patents. Rackspace is now counter-suing the troll, as the firm said it has a deal in place with Parallel Iron after signing a previous patent settlement with them."
vu1986 writes "With the latest updates — announced in a blog post by BigQuery Product Manager Ku-kay Kwek on Thursday — users can now join large tables, import and query timestamped data, and aggregate large collections of distinct values. It's hardly the equivalent of Google launching Compute Engine last summer, but as (arguably) the inspiration for the SQL-on-Hadoop trend that's sweeping the big data world right now, every improvement to BigQuery is notable."
First time accepted submitter sagecreek writes "Hadoop is an open-source, Java-based framework for large-scale data processing. Typically, it runs on big clusters of computers working together to crunch large chunks of data. You also can run Hadoop in "single-cluster mode" on a Linux machine, Windows PC or Mac, to learn the technology or do testing and debugging. The Hadoop framework, however, is not quickly mastered. Apache's Hadoop wiki cautions: "If you do not know about classpaths, how to compile and debug Java code, step back from Hadoop and learn a bit more about Java before proceeding." But if you are reasonably comfortable with Java, the well-written Hadoop Beginner's Guide by Garry Turkington can help you start mastering this rising star in the Big Data constellation." Read below for the rest of Si's review.
dp619 writes "In an interview, Microsoft Regional Director Patrick Hynds says that avoidance of open source components by a large part of the .NET developer population is abating. '...While some may still steer clear of the GPL, there are dozens of FOSS licenses that are compatible with Windows developers and their customers,' he said. Hynds cites NuGet, an open source package management system was originally built by Microsoft and now an Outercurve Foundation project, as an example of FOSS libraries that .NET developer are adopting for their applications. Microsoft itself has embraced open source — to a point. It has partnered with Hortonworks for a Windows port of Hadoop, allowed Linux to run on Windows Azure, and is itself a Hadoop user."
Nerval's Lobster writes "The Apache Hadoop open-source framework specializes in running data applications on large hardware clusters, making it a particular favorite among firms such as Facebook and IBM with a lot of backend infrastructure (and a whole ton of data) to manage. So it'd be hard to blame Intel for jumping into this particular arena. The chipmaker has produced its own distribution for Apache Hadoop, apparently built 'from the silicon up' to efficiently access and crunch massive datasets. The distribution takes advantage of Intel's work in hardware, backed by the Intel Advanced Encryption Standard (AES) Instructions (Intel AES-NI) in the Intel Xeon processor. Intel also claims that a specialized Hadoop distribution riding on its hardware can analyze data at superior speeds—namely, one terabyte of data can be processed in seven minutes, versus hours for some other systems. The company faces a lot of competition in an arena crowded with other Hadoop players, but that won't stop it from trying to throw its muscle around."
First time accepted submitter punk2176 writes "Recently I started a free and open source project known as the PunkSPIDER project and presented it at ShmooCon 2013. If you haven't heard of it, it's at heart, a project with the goal of pushing for improved global website security. In order to do this we built a Hadoop distributed computing cluster along with a website vulnerability scanner that can use the cluster. Once we finished that we open sourced the code to our scanner and unleashed it on the Internet. The results of our scans are provided to the public for free in an easy-to-use search engine. The results so far aren't pretty." The Register has an informative article, too.
Billly Gates writes "The Apache Foundation released version 1.2 of Cassandra today which is becoming quite popular for those wanting more performance than a traditional RDBMS. You can grab a copy from this list of mirrors. This release includes virtual nodes for backup and recovery. Another added feature is 'atomic batches,' where patches can be reapplied if one of them fails. They've also added support for integrating into Hadoop. Although Cassandra does not directly support MapReduce, it can more easily integrate with other NoSQL databases that use it with this release."
A couple of weeks ago you had a chance to ask Canonical Ltd. and the Ubuntu Foundation founder, Mark Shuttleworth, anything about software and vacationing in space. Below you'll find his answers to your questions. Make sure to look for our live discussion tomorrow with free software advocate and CTO of Rhombus Tech, Luke Leighton. The interview will start at 1:30 EST.
Nerval's Lobster writes "Facebook's engineers face a considerable challenge when it comes to managing the tidal wave of data flowing through the company's infrastructure. Its data warehouse, which handles over half a petabyte of information each day, has expanded some 2500x in the past four years — and that growth isn't going to end anytime soon. Until early 2011, those engineers relied on a MapReduce implementation from Apache Hadoop as the foundation of Facebook's data infrastructure. Still, despite Hadoop MapReduce's ability to handle large datasets, Facebook's scheduling framework (in which a large number of task trackers that handle duties assigned by a job tracker) began to reach its limits. So Facebook's engineers went to the whiteboard and designed a new scheduling framework named Corona." Facebook is continuing development on Corona, but they've also open-sourced the version they currently use.
snydeq writes "Facebook has said that it will soon open source Prism, an internal project that supports geographically distributed Hadoop data stores, thereby removing the limits on Hadoop's capacity to crunch data. 'The problem is that Hadoop must confine data to one physical data center location. Although Hadoop is a batch processing system, it's tightly coupled, and it will not tolerate more than a few milliseconds delay among servers in a Hadoop cluster. With Prism, a logical abstraction layer is added so that a Hadoop cluster can run across multiple data centers, effectively removing limits on capacity.'"
Nerval's Lobster writes "Facebook recently invited a handful of employers into its headquarters for a more in-depth look at how it handles its flood of data. Part of that involves the social network's upcoming 'Project Prism,' which will allow Facebook to maintain data in multiple data centers around the globe while allowing company engineers to maintain a holistic view of it, thanks to tools such as automatic replication. That added flexibility could help Facebook as it attempts to wrangle an ever-increasing amount of data. 'It allows us to physically separate this massive warehouse of data but still maintain a single logical view of all of it,' is how Wired quotes Jay Parikh, Facebook's vice president of engineering, as explaining the system to reports. 'We can move the warehouses around, depending on cost or performance or technology.' Facebook has another project, known as Corona, which makes its Apache Hadoop clusters less crash-prone while increasing the number of tasks that can be run on the infrastructure."
pmdubs writes "The U.C. Berkeley AMPLab research group will be hosting a free 'Big Data Bootcamp' on-campus and online, August 21 and 22. The AMP Camp will feature hands-on tutorials on big data analysis using the AMPLab software stack, including Spark, Shark, and Mesos. These tools work hand-in-hand with technologies like Hadoop to provide high performance, low latency data analysis. AMP Camp will also include high level overviews of warehouse scale computing, presentations on several big data use-cases, and talks on related projects."