Volume, velocity and variety. These three Vs have been used traditionally to classify what constitutes Big Data. So problems that pertain to analysing a few gigabytes of data should technically not be considered Big Data. But it is not so.
"Not all big data problems pertain to large volumes of data. But the data is so diverse that we have to use those techniques to analyse it," explains Girish Juneja, CTO and GM of Intel Corporation's Big Data Software and Services.
"In fact, some of this data is so unstructured that it won't fit into existing systems," adds Moty Fania, Principal Engineer of Big Data Analytics. "It goes back to what you are trying to achieve. Some of the data is on a gigabyte scale. But you need to be able to react to the data fast enough and without buying a super computer."
Speaking at the APAC Big Data & Cloud Summit 2013 in Ho Chi Minh City, Vietnam, Juneja added that the solution depends on the kind of insights being sought from the data. "For Instance, if you are trying to find retail sales for the past week, the key is how quickly you can analyse. But if we combine that with the footfalls for five years and try to find trends, you don't need it that fast."
This is where the Apache Hadoop, thanks to its inherent flexibility, is so effective. "Essentially we are moving the computer to the data and not the data to the computer. So it doesn't matter if the data is large or small," he adds. Hadoop is an open source framework for storing and processing large volumes of diverse data on a scalable cluster of servers. This February, Intel announced its distribution of the software.
Ron Kasabian, GM of Big Data Solutions, says it not just about the size of the data. "What you do with this data is more important...it is all about the correlation." But Kasabian underlines the fact that delivering a proprietary solution might not be in Intel's best interest, even though it might be in a position to do so. "The biggest Hadoop solution providers are still very small. So we felt the need for a large industry backer that will let the eco-system evolve." Intel says it is trying to understand workload types to enhance the software or silicon as needed.
These days there are at least 19 billion connected devices, of which just five billion are consumer devices. The rest, the so called Internet of Things, consists of everything from cameras to air conditioners. That is where the other end of the spectrum comes into the play. Juneja cites the example of Axis Communications that has installed 30,000 cameras in a Chinese city. "By 2017 they will be generating 500 petabytes of raw video of which 80 petabytes will be just metadata. The 500 petabytes is more than all the data created so far."
As Fania says the term big data is really misleading. "Volume is just one aspect."
The author is attending the APAC Big Data & Cloud Summit 2013 in Vietnam on the invitation of Intel Corp.