Is it a data lake, an oasis or just a mirage??
I read a blog post recently by one of the large infrastructure vendors talking about “data lakes” and how badly customers need them. I happen to pick on this one blog, but I have read so many recently that I felt I wanted to write something from the customers that we are talking with around this subject on a daily basis. I find this whole topic slightly ironic and an example of vendors trying to lead the customer rather than the other way around. Don’t get me wrong, there are customers that are using data lakes successfully and for a good reason, but that really isn’t the solution for the problem.
This specific blog made a couple of points around why data lakes are so critical to a businesses future success and that if you wanted to have a decent data analytics strategy it was imperative that you have a data lake. The points covered were:
Data Lakes Simplfy Data Storage – Well yes, that is a fairly obvious point. If all data is in one place it is a much simpler data storage strategy. The point I would make would be how easy is it for any large customer to actually do this?? So far on previous iterations of storage around block and file that has been near on impossible for most large customers so why would it be any different now?? Oh, I know, its because we are using a new buzz term “data lake” and all customers should want one of those.
Data Lakes elimates silo’s – Yep, another fairly obvious point. If you can centralise all data all silo’s go away. Wouldn’t this just be a wonderful world if that happened! This is such a ridiculous point and almost impossible for any company of any scale. Unfortunately, there are a number of very valid reasons why silo’s are a part of your business; legal, site location in the world, business unit separation, different data types (structured, unstructed etc.), location created (public cloud) as just a few examples where silo’s may be a part of everyday life.
Make it easier to access – There specific point was that big data is broken when data is not easily accessible. I completely agree, but that doesn’t technically mean it needs to be in one place, just that the analytics engine needs to get better at accepting and combing data from multiple sources.
Now if this blog this wasn’t actually a blog about data lakes, but rather a blog about the benefits of a centralised object storage then I would have completely agreed with every point they made. Data is a lot easier to manage, cheaper to store and easier to access if it is one central lower cost object store. That’s just a fact, but how many customers can actually do that with their data?. Even if you do consolidate as much as possible, and build this utopia called a data lake, I will guarantee you, any customer will still have a number of different locations for key data sets. That is just the way a business is run.
I’ve been to several of the Big Data Summits recently and talked to a number of customers looking into these areas and you know the problem that they are trying to solve?
How do I build an analytics environment where I can interrogate multiple data sources?
As most large infrastructure vendors don’t have a solution for this, the actual problem, their solution is a “data lake”! Put it all on the same infrastructure in a centralised storage solution and that is so much easier – right??
This is the polar opposite position that most of the big data software vendors and the entire Open Source arena is taking, and they have been at this a while longer than most of the infrastructure vendors. I would argue that the big data or data analytics arena is an area that is run by software vendors rather than hardware and with the software becoming more enterprise friendly and increasing the amount of features that an average enterprise required. I think the software in this space is becoming more and more encompassing of the wide range of requirements and has had a lot more money invested into it than the hardware it sits on.
Now I’m not here to say which strategy is right or wrong for any customer, and I’m sure that there will be a number of customers that need both, but just be cautious about the Emperor’s new clothes which is now called data lakes and have a look into what your software requirement is before looking at the hardware.