Monday 10 October 2016

Hadoop, huh! What is it good for? Absolutely... everything!

You may come across Hadoop in your everyday tech-world job, and you might even understand what it is. But one thing's for sure, people in our business tend to get very excited about the next new thing.

A few years ago, Hadoop was the next new thing. Google open-sourced it's GFS code in 2003, and Apache began to develop the HDFS and Map Reduce ecosystem. The promise of handling petabytes of data was an intoxicating elixir to anyone sitting on a lot of data. 

By 'a lot of data', we don't mean some big financial record files, or customer databases, we mean BIG. Start from 4 terabytes, because below that your normal file system, Microsoft or Oracle relational database can handle it - on a server with a bunch of disks attached. 

4 terabytes is a lot of data. It is equivalent to the data held in 132,000 regular 500 page fiction books. Stack those books up and your pile would be 8 miles high. Dig down, and 8,000 miles gets you to New Zealand. And that's where Hadoop gets going.

In 2010 Facebook was using 2,300 Hadoop clusters, which can all work together, to store 40 petabytes. Now a petabyte is 1,000 terabytes. That's a pile of books which will almost get you to Mars. So you can see that this really is an astounding amount of data. 

But it's not just words on a page, or lots of numbers that are being stored in Hadoop. The Hadoop File System is great at holding data of all sorts. Where in a regular relational database, you need to know what you want to put into the database before you put it there, Hadoop is like a magic dumpster. You just throw anything you like in there and worry about getting it out later.

I say it's magic because, unlike a regular dumpster, the data doesn't decompose over time, and it can be sorted and 'mapped' to help you find what you're looking for down the line.

But, don't think of Hadoop as either a Dumpster or a Library of books. Think of it more like a shopping mall. It's somewhere people can go to find what they are looking for. Sure, you need an idea of what's there before you start looking, and you might need the help of a mall map. But everything about that mall can be uncovered. Whether it's a comparison between price-tags on similar t-shirts, how many lattes Starbucks sold yesterday, or what the teenagers hanging around in the parking lot are saying to each other. Any and all kinds of information can be stored and retrieved.

This is why it is so compelling for companies to create a data lake, with as much data in it as possible. Where previously your departments held the data and didn't share it, now everyone is sharing all their data, and the correlations, relationships, inaccuracies, trends and insights are ready to be discovered. 

Imagine if a retailer in the mall wanted to know who was buying their products, who was buying competitors products, at what price, and what they were saying to their friends about it, and which shops they visited before they made the purchase, and when they were going to come back, and what would make them choose that store? 

It would be an unfair advantage. And that's what companies get from Hadoop and their data lake. It's an unfair advantage over their competitors who don't have it. The treasure trove of Big Data is hard to fathom, and even harder to implement, but will prove itself over time to be the best way to understand and run your business.

No comments:

Post a Comment