Cleversafe 3.0, due out in the fourth quarter, will combine Hadoop MapReduce with the Cleversafe Dispersed Storage Network (dsNet) system. Cleversafe will replace the Hadoop Distributed File System (HDFS) with the Cleversafe information dispersal algorithm, which uses one instance of data across a network of storage nodes.
The ability to replace HDFS comes from an API that makes it seem like the storage system is talking directly to HDFS, and allows the storage software to assign jobs to stores for local data access.
The integration is designed to eliminate the scalability and reliability limitations of Hadoop that make it a poor fit for networked storage. These limitations include the single point of failure in its NameNode and JobTracker. HDFS uses one server for all metadata operations and keeps three copies of a file for data protection. That results in a single point of failure for the metadata node and can require three times as much storage to protect data.
Cleversafe 3.0 will include a new protocol the vendor calls Slicestream, which organizes data in chunks on the dsNet system. If a drive fails or data can't be read, Slicestream will use the dispersal algorithm to find data from slices on other nodes and regenerate the data in real-time, according to Russ Kennedy, vice president of product strategy at Chicago, Ill.-based Cleversafe.
"Hadoop implementations work on a small scale, but not in tens of hundreds of terabytes or petabyte scale," Kennedy said. "We bring limitless scale and the ability to run analytics on large-scale storage systems. The idea is to move computation to where the data resides."
More on cloud storage software
Defining the cloud storage software market
Get a handle on cloud storage software terms
Kennedy said Cleversafe customers are split between commercial companies and federal government agencies, and most run private storage clouds and do "big data" analytics. Cleversafe will distribute Cloudera Hadoop on its storage node. Kennedy said there's no reason Cleversafe cloud storage software won't work with other Hadoop instances, but he said customers who are already running Cloudera Hadoop have requested the integration.
"I would say most of our customers are at the scale where this kind of solution makes sense," he said. "They're doing some Hadoop, but they're not able to get to this kind of scale."
Contractor Lockheed Martin is working with Cleversafe to develop a version of the cloud-based storage software with Hadoop for the federal government. Kennedy said Cleversafe is also talking to Cloudera about doing greater integration around management tools, but that would not be part of the 3.0 release.
Other storage vendors, including NetApp and EMC, are working with Hadoop distributors to make their storage work with HDFS. But John Webster, senior partner at Boulder, Colo.-based Evaluator Group Inc., said Cleversafe has a deeper integration with Hadoop than other storage vendors so far, although he expects to see other similar implementations.
"They're running Hadoop on a distributed clustered storage platform as opposed to running [it] on a rack full of off-the-shelf servers with little disks spinning around inside," Webster said of Cleversafe's Hadoop integration.
"What this does is run Hadoop on the storage cluster, and the idea is to bring computation to the data as opposed to bringing the data to the computation, which is typically done on a standard Hadoop cluster," he explained. "Typically, people make three copies of data in a standard Hadoop implementation. Those copies are kept internal to the cluster and they're full copies of data, so all data is replicated three times. In this implementation, a single copy of data is maintained across nodes of a cluster. If you need to re-create a copy, you can simply generate a new copy using the dispersal algorithms in the storage device."
Webster said not all the other storage vendors working on Hadoop integration will replace HDFS, and not all use object storage systems. He said these types of integrations are driven by the Hadoop community's realization that storage "is an intelligent data layer as opposed to just some device you put things into and take things out of. The transformation is taking place slowly, but it's happening."
Hadoop support will be a separately licensed feature, and will be the main addition to Cleversafe 3.0 software.