Recently I wrote my first article for a publication. What an experience that was. Writing on a blog almost no one reads has much less pressure than writing for a magazine that actual has a readership. The article itself was relatively easy but I learned my fair share for the editing process, which I need to write up later. In the mean time, here is the link to my article.
This was a great book for those looking to get their feet wet with MongoDB. PHP And MongoDB covered many more topics than many of the MongoDB books I have read recently and while I am not a PHP developer gave me a few more ways to leverage MongoDB in my day to day work.
I especially enjoyed the chapters on Geospatial functionality and the GridFS system. Both of these topics we handled throughly and are typically glossed over in other books.
The one place I felt this book was light was the operations and administration side of things. Nowadays, many developers are handling operations as well and I feel that those topics could be explored further in books like this.
All in all, I would recommend anyone in web development looking for somewhere to start with MongoDB pick up this book and give it a read.
Seven Databases in Seven Weeks is a great book for giving you an overview of the latest databases in the different segments out there. It is definitely an entry level chapter on each system that will let you know whether or not to pursue it further with more in depth material.
Anyone curious about what is available besides the de facto SQL standard offerings should give this book a read.
My Twitter stream and usual haunts on the Internet have recently seen an increase in the NoSQL bashing. The one common thread seems to be that “pick your NoSQL” solution is not as good as “pick your SQL” solution at “pick your topic”. I am not here to try debunk these statements or prove one or the other wrong, I would just like us to be comparing apples to apples and having a real conversation about when and where to use the right solutions regardless of the camps they fall in.
First let’s be realistic, NoSQL is not going away and will be more and more a part of our lives everyday, so before taking the fanboy comments on Twitter to heart do yourself a favor and read up on the pros and cons on any solution you are going to use and run some tests on your laptop. Most of the time there is more than one solution that will work for your needs and better understanding the focus and future direction of the technology can help make that decision.
OK, now for the part of the conversation I think is missing:
- My NoSQL is more performant than your SQL!
This statement is not only bold, but very vague. What do you mean more performant? Are we talking about server resources, reads per second, writes per second, etc? Come on this is just going to start an argument where everyone is comparing metrics and benchmarks that are not relavent to each other.Also, you can configure most SQL systems to perform on the levels of their NoSQL counterpart but doing so will degrade their performance in other areas. Doing this maybe beneficial for your team/company in not making them learn a new technology, but also hampers you when leveraging some other feature in your SQL that is not configured correctly anymore.
- NoSQL is immature and not ready for production
This will vary by solution. I would argue that the file system is more mature than any SQL solution (yes it is a NoSQL solution), but I would also say that many of the new kids on the block should be tried and tested before moving them to production and you should expect to have problems and find bugs that have already been worked out in the older, more stable SQL systems. This however is not a reason to dismissed the solution, it is a reason to spend more time reading up on it and talking to the few that are running it so you don’t make the same mistakes
- NoSQL can’t do everything SQL can
Of course it can’t, it isn’t meant to. Most NoSQL systems are built to target a very specific pain point and they accomplish by abandoning features and overhead that most SQL systems implement. This doesn’t mean go implement every NoSQL solution known to man to gain a few milliseconds in your system, but if you find a solution that can make a significant impact on the performance of your application or save you a tremendous amount of time, then it may be time to think about moving that functionality into a NoSQL solution.
- NoSQL is not secure
This is true for a lot of NoSQL solutions. I am not sure why this has been handled this way, but there is good news. You can solve this with your operating system and/or firewall. This is a valid concern and you really need to be aware of how this affects you and your data when implementing any solution.
That is a short list of the statements being flung around, but I think you get the idea.
I don’t know of any NoSQL solution that claims to be a drop in replacement for all things SQL. The performance gains many NoSQL solutions are able to claim come at the expense of not being able to do many of the things SQL can and pushing these concerns out of the database system and back up to the developer. This can be both a blessing and a curse, but with frameworks, ORMs and the such these can be mitigated, but that is a whole other issue that could use some discussion and actually muddies the water even more.
Next time you want to bash or defend NoSQL, think about your reasons, the context and the real world implementations then take the conversation somewhere that allows you more than 140 characters.
Every time I get into a discussion with someone on the topic of “Big Data” it seems to diverge into one of a handful of subtopics. Whether it is what size is considered big data, what technology must you be using to be considered big data and what problem are you trying to do with your data. These are all great buzz worthy topics but does it matter what technology you are using or truly how big your data is?
When it comes to the size of your data everyone will (and should) have a different definition of big. Really what makes data big is dependent on the resources you have available to manage that data. With this in mind Walmart or Facebook and their data make what I deal with tiny. Does that mean I don’t experience similar challenges to them? No. While their task is far larger and greater scope than mine, they also have deeper wallets, more personel and greater technology resources. So just because you aren’t dealing in Petabytes doesn’t mean you are not dealing with big data.
NoSQL and Hadoop are a requirement for being in the big data space right? I mean after all that is why those tools where invented and if you aren’t using them to solve your problems then you clearly haven’t entered into the big data space. While those tools can be useful (or even detrimental) they are not required to be in the big data space. MySQL, Postgres and the others have been around for ever and people have been using them to solve their data problems without the other tools. Hell, I know of one company that deals with what I would call Gigantic Data and does it all in flat files (although I guess those are the original NoSQL solution). It is not the tools you are using, it is how you are using them and what you are trying to accomplish.
Which brings us to what I think the real problem is, what are you trying to solve? Or better yet what question are you trying to answer. If you don’t really know, then you are data warehousing and most likely dealing with archiving big data sets but not really in what I would call the big data space. This is where I think things get muddy for most. To me all the new tools at our disposal and the fact that storage is so damn inexpensive is causing us to archive everything and we don’t know why, but we have a gut instinct that it will be worth something someday or it holds the answer to some unknown question. Both of those maybe true, but the data isn’t the piece that is truly valuable, it is the unknown question that will bring value.
“Big Data” is asking your data set for an answer to a question and getting that answer as quickly as possible.
Quick read but super informative. Anyone looking to setup shards and clusters using MongoDB should read this book. What is really nice is the author’s first hand knowledge of the system (she is a core contributor) and explaining priorities and milestones of future releases. That helped keep this book from failing behind a technology that is evolving quickly.
Last Wednesday night I whipped together a prototype of an application to test some architectural changes and make delivering large amounts of data to a client easier for both of us. The end result was to be a centralized datastore for all of our apps that could be accessed via a very simple API.
Testing went well and then the flood gates opened. Several hundred thousand requests in the opening hour had generated almost 60GB of data in Mongo. While every aspect of the system functioned better than was expected (especially for a late night prototype) the amount of data being generated so quickly was alarming.
When trying to implement the zlib compression the night before Mongo threw fits and I did not have time to deal with it. But now was the time. Trying to find the answer to the errors I was getting was tough. The app itself uses the Mongoid ORM for Rails and the background workers use the Mongo Driver for Ruby. The deflate happens on the background workers and the inflate in the Rails app.
Let’s start with the easy change, Mongoid:
field :large_data, :type => binary
That simply tells Mongoid to expect a binary object and insert as such.
The more difficult issue I ran into was actually doing the insert with the Mongo Driver in the Ruby scripts. I simply wasn’t looking for the right thing. What I needed to do was convert my zlib binary into a new BSON Binary to be stored in Mongo.
Then simply call the standard Mongo Insert command.
The last piece to this was back on the Rails App, I needed to inflate and return this data. The caveat that I found was needing to turn the BSON Binary to a string before trying to inflate it.
The end result was a system that was still keeping up with demand, but was now putting away far less data. My results were a 50K JSON string down to 12.5K and a 300K HTML file down to 72K.
Hope this helps anyone looking to squeeze a little more out of their storage solution.