Primary data storage options for the cloud
A comprehensive collection of articles, videos and more, hand-picked by our editors
Read and listen to part 1 of the podcast
Cloud storage remains a suitable option for nearline storage, since latency isn't a huge problem with passive data that hasn't been accessed for months if not years. But there are some new twists to help with the cloud-based storage of petabyte and exabyte levels of data.
Two of the latest trends in cloud storage for nearline data are making file-based data Hadoop-ready and using erasure coding to avoid the need to store extra copies of the data to protect against disk failures, noted Marc Staimer, president of Dragon Slayer Consulting in Beaverton, Ore.
In the second of a two-part interview with TechTarget senior writer Carol Sliwa, Staimer discusses the distinction between nearline and primary data, decision points associated with choosing public, private or hybrid cloud storage for nearline scenarios, and data migration and security issues with large quantities of data.
The term nearline originated in the days when mainframes ruled the data center. How do you differentiate the terms nearline and primary storage today?
Marc Staimer: Response time. You're not going to use nearline storage for short or fast response times. You're not going to use it for active applications where you need a rapid response from the storage. Nearline by definition means you're going to have higher latency between the application and its data. Not a good thing. So, if it's active data, you're probably not putting it on nearline [storage] … There's a lot of passive data. In fact, more than 90% of most people's data is passive. That's your backup data, your archive data, data that's aged out, hasn't been accessed in months or years. That data is perfectly fine for nearline storage or nearline in the cloud, whereas active data, data you're accessing daily, or [data that's] part of a database, you probably don't want to have that on nearline storage.
What are the latest developments in cloud used for nearline storage?
Staimer: Primarily when you look at nearline storage, you're looking at the bulk of most data. What you have in that storage is a lot of assets that don't have a lot of high use, but there might be a lot of data in that. So, being able to convert that without migrating the data into something that a Hadoop cluster can read directly is a major trend right now. You'll see a variety of different storage systems that will be able to represent NFS data or CIFS data or file data generally as HDFS, Hadoop Distributed File System, so it can plug right into a Hadoop [cluster].
Another major trend is erasure codes. One of the things that users are becoming very cognitive of with this whole move to lower cost, big data storage or exascale storage -- in other words, storage that scales not just into petabytes but into exabytes and, not too far down the road, zettabytes and yottabytes -- is that you're going to have lots of disk failures. You want to use cheap component parts, which typically mean desktop drives versus even server drives or versus array drives. You're going to have a higher failure than you would even on a typical array. And you're going to have millions, not thousands, but millions of drives, which means you're going to have lots of drive failures all the time. RAID doesn't work well in that situation.
What a lot of these technologies will do is multicopy mirroring, typically three copies. Three copies just means that you're protecting against three failures with that data. So if you have more than three failures, you lose the data. So, now you're looking at quad copies and five copies and six copies. That means six times the storage, too. And that becomes unsustainable at exabyte scale. The cost is just too high. Even if you got the drives for free, the infrastructure cost to support it becomes outrageous.
This is where erasure coding is coming into play. Erasure coding turns a 600% [storage] increase to protect against six failures of where the data resides to a 60% increase because of the way that it distributes or chunks the data. That's becoming a huge trend. It's becoming table stakes in the exascale world.
For nearline storage, what's your advice on when to choose a public cloud, private cloud or hybrid cloud, which is also known as cloud-integrated storage?
Staimer: It depends. There are a number of factors whether you are going with public, private or hybrid, one of which is how much data do you want to put in the cloud. How are you going to get it into the cloud? It's like breathing through a straw. If you want to move a petabyte of data, you're not going to do that across a WAN very rapidly. That's a lot of data. If you're going to move 10 PB or 100 PB, it's going to take a very long time. That kind of lends itself better to a hybrid cloud situation, where you're moving it into a cloud. It could be local. But at the same, that data over time is going to be copied and replicated or migrated to either a public cloud or a distant cloud. It can still be private. But, in a hybrid of that nature, you're going to need on-ramps. And this is the biggest issue with cloud storage. On-ramp says how do you move the data from where you have it now to where you want it to be, and that will determine whether or not you go with a public, private or a hybrid cloud more than anything else you can do. Cloud-integrated storage is an on-ramp, for the most part. Cloud-integrated storage looks like local storage, so it gives you local, primary performance, but will move old data to the cloud and enable you to access it as if it's local. Or, it'll move backup data or archived data. But, it gives you an on-ramp to a public cloud or even another private cloud. So, those are the kinds of things that will determine it.
[Another] issue that comes in play here is security. What level of security do you have? What level of security do you need? Whether you are using a public or a hybrid public cloud, it's got to meet your security needs, and you cannot ever outsource your security. This is especially true in a lot of the verticals, whether you're talking health care or financial services or insurance or manufacturing; there are compliance rules you've got to follow. And even though they can handle that for you, you are still responsible. So, at the end of the day if you've got to make sure that the data you have is encrypted, let's say, and not just encrypted, but it meets National Institute of Standards and Technology [Federal Information Processing Standard] FIPS 140-2 certification and compliance, then you better make sure that if you're using a public, or a hybrid with a public, that it meets those requirements. Those are the kinds of things that will affect whether you can use a private, public or hybrid cloud.