|
Michael Factor, IBM Fellow |
Editor’s note: This article is by Michael Factor,
IBM Fellow, Storage and Systems
Information. Data. It is clichéd to talk about its rate of growth. Not
long ago, a terabyte was a lot of information. Now we talk freely about petabytes
and even exabytes. While not many organizations are at an exabyte yet, having
an exabyte of data will become more and more common.
But what is an exabyte? The simple answer is 1018 bytes. But
this is still tough to conceptualize. So let’s put it in some less conventional
units. Assuming we had an exabyte of data stored on 6TB disks (the
most-commonly available large disks), if we piled these disks one on top of the
other, the pile would be almost 6 times the height of the Burj Khalifa, the world’s tallest
building. And these drives would weigh almost as much as 16 full-grown
elephants.
No need for tall buildings or elephants, just
object storage
This is a lot of data – and of course, actually
storing and managing this data requires more resources than just an exabyte of
disk capacity. So, we work with object stores, a technology that stores data as
objects (versus, say, files) – which is cost-effective for massive
amounts of data. And it’s why IBM recently acquired Cleversafe, to
efficiently, securely, and scalably store and manage petabytes and exabytes of
data in an object store. Other mechanisms for storing data, such as block
storage systems or relational databases, will not handle exabytes. They don’t
even handle petabytes; and if they could, they would be prohibitively
expensive.
Simply storing the data is necessary, but not
sufficient. We need to be able to use the data. One element of ensuring the
data can be used is to have an open interface to access the data. This helps
ensure that businesses can use the cloud services they want and are not limited
to a specific proprietary provider. It also means there will be broad
investment in ensuring various tools work efficiently with the data. One such
interface is that provided by OpenStack Swift, an interface IBM
is committed to both for our products and for our cloud services such as the Bluemix object store.
It is not possible to manually process even terabytes of information, let alone
petabytes or exabytes. The only way one can make sense out of these quantities
of data is through analytics, machine learning, and cognitive computing. Apache
Spark is rapidly growing as the bulk data analytics platform of choice
and is particularly amenable for use with machine learning and cognitive
algorithms.
Given the importance of Apache Spark, we needed a fast connector that would
allow Apache Spark to efficiently process data in an object store. And now
scientists at IBM Research – Haifa have built just that. Called Stocator, the
connector allows Spark to efficiently ‘talk’ to an object store.
Building a fast bridge to the cloud
The developers of Apache Spark very wisely built on
existing code to interact with external data stores. Specifically, they made it
possible to connect with open source framework, Hadoop’s HDFS layer, which
provides access to stored data. This gave them immediate access to a very large
number of data stores. But it also meant the data access path carries with it a
lot of extra code that object stores simply don’t need.
For example, when Hadoop creates the file
/John/Photos/elephant.jpg, it needs to create separate unique objects: one for
John, one for Photos, and one for elephant.jpg. But in an object store, we can
use just a single object to represent /John/Photos/elephant.jpg.
By using Spark with Stocator, we can reduce the
number of calls Spark makes to the object store by an order of magnitude! The
result is faster performance and the need for fewer resources.
Stocator is object store aware, and thus has a very different architecture than
the existing Hadoop drivers for object stores. Hadoop historically started
using a file system to store data. In contrast, our new connector leverages
object store specific features, like creating only one object for
John/Photos/elephant.jpg. The impact of these changes is significant. For
instance, on one application, the number of calls to the object store was
reduced from 190 to 22.
Although we developed and tested this connector
using the Swift API as the interface to the object store, the code can be
extended to support other object stores. Even though this new connector cannot
replace the Hadoop derived code in all cases, it can easily be applied to other
open source projects such as Apache Flume or Secor.
We opened our code and encourage your use of it and
feedback. The code is now available online. There is also a technical blog that goes into more detail on the new connector and describes how to use it with Spark.
Labels: bluemix, Cleversafe, IBM Research - Haifa, Michael Factor, object storage, Spark, Stocator, Swift