Every time I get into a discussion with someone on the topic of “Big Data” it seems to diverge into one of a handful of subtopics. Whether it is what size is considered big data, what technology must you be using to be considered big data and what problem are you trying to do with your data. These are all great buzz worthy topics but does it matter what technology you are using or truly how big your data is?
When it comes to the size of your data everyone will (and should) have a different definition of big. Really what makes data big is dependent on the resources you have available to manage that data. With this in mind Walmart or Facebook and their data make what I deal with tiny. Does that mean I don’t experience similar challenges to them? No. While their task is far larger and greater scope than mine, they also have deeper wallets, more personel and greater technology resources. So just because you aren’t dealing in Petabytes doesn’t mean you are not dealing with big data.
NoSQL and Hadoop are a requirement for being in the big data space right? I mean after all that is why those tools where invented and if you aren’t using them to solve your problems then you clearly haven’t entered into the big data space. While those tools can be useful (or even detrimental) they are not required to be in the big data space. MySQL, Postgres and the others have been around for ever and people have been using them to solve their data problems without the other tools. Hell, I know of one company that deals with what I would call Gigantic Data and does it all in flat files (although I guess those are the original NoSQL solution). It is not the tools you are using, it is how you are using them and what you are trying to accomplish.
Which brings us to what I think the real problem is, what are you trying to solve? Or better yet what question are you trying to answer. If you don’t really know, then you are data warehousing and most likely dealing with archiving big data sets but not really in what I would call the big data space. This is where I think things get muddy for most. To me all the new tools at our disposal and the fact that storage is so damn inexpensive is causing us to archive everything and we don’t know why, but we have a gut instinct that it will be worth something someday or it holds the answer to some unknown question. Both of those maybe true, but the data isn’t the piece that is truly valuable, it is the unknown question that will bring value.
“Big Data” is asking your data set for an answer to a question and getting that answer as quickly as possible.