Big data has revolutionized the way we look at information. Unfortunately, access to the collective set of tools that define this craze is not always easy to come by. Below I lay out three possible ways an SMB can take advantage of the Big Data revolution.
Rent a Cluster
Resource Cost: Potentially High, Labour Cost: So-So, Nerd Props: So-So
If you’re very familiar with the scope of your data crunching project, trust in your team’s ability to write/deploy solid code, and don’t mind spending the extra money on occasion, then maybe a 3rd party computing cluster is for you. Services such as Amazon’s Elastic Map Reduce, Microsoft Azure’s HPC, or Qubole offer you an elastic, on-demand environment to run your code. The benefits are obvious: easy to manage infrastructure, ability to grow/shrink with your data set, ability to grow/shrink with the complexity of your code, and rock-solid performance. The problem with cloud-based computing clusters is that they can (very) easily become expensive. Just moving data around can cost you a few dollars, so make sure your team is able to produce quality code. With that said, we run big data clusters in an Amazon VPC running Spark, and it works very well for our needs.
You Don’t Need a Stinking Cluster
Resource Cost: Low, Labour Cost: Low, Nerd Props: Low
The fact of the matter is, big data is not for everyone. Properly mining data requires a talented team, patience, and a deep understanding of your data. Incomplete data analysis leads to incorrect conclusions. Fortunately, you don’t need access to expensive resources in order to take advantage of the big data revolution. A number of larger enterprises have already done the heavy lifting for you and the results of their analysis are all over the web to review. Websites like Google Trends offer you access to a plethora of information which has been mapped, reduced, analyzed, and made available in lovely chart/graph form. Want to learn more about your particular market segment? A simple Bing or Google search can be your gateway to a world of knowledge. Want to know more about user behavior on Facebook? Just search the web. Chances are, the best work has already been done by researchers at Universities, Think Tanks, and Global Corporations. Just because you’re not mining the data yourself, doesn’t mean it’s not relevant and valuable.
Setup Your Own Cluster
Resource Cost: Low, Labour Cost: High, Nerd Props: High
Modern, open source data crunching platforms are purpose-built to run on all sorts of hardware. Better yet, they’re easy to install and setup. Whereas 15 years ago, the majority of your time would be spent building the actual computing cluster (i.e. a Beowulf cluster), now, you can focus your efforts on collecting and analyzing your data. Although we’re now a SparkDB shop, in the past, we have implemented Hadoop for crunching data. Both are easy(ish) to get setup. The problem: all distributed computing platforms are only as good as the resources that you throw at them. The name of the game here is “distribution,” so the more nodes (computers) you have, the better. Fortunately, most SMBs have access to a large pool of computing nodes right under their noses. With some basic hardware – gigabit switch, dedicated gigabit Ethernet card, and cables – your unused employee workstations can be run as hadoop nodes in the evening. We recently setup a Spark cluster on three machines and a good time was had by all. The one caveat here is that you will probably want to enable dual boot on these unused machines.