Do you feel overwhelmed with the amount of information coming at you these days? You are not alone. And it’s not even because we live in the information age. This problem of too much information has been plaguing people for the past 1,000 years. Today, it’s the problem of having to deal with “huge, overwhelming, and uncontrollable amounts of information.”
How do we make it easier to process large amounts of information? The solution is what has become known as Big Data.
History of Big Data
Big data as it applies to computing technology and software has been around only since 2003. But the problem of having to deal with too much information has been around for centuries. During the last century, data overload became a big problem for the U.S. Census Bureau. The Census in the United States is taken every 10 years. After collecting census data in 1880, the U.S. Government declared that it would likely take eight years to handle and process all the data collected. They further predicted that the 1890 census would take more than 10 years, which, of course, was more than the time between surveys. It was simply a problem of too much information for them to handle. The amount of information people consume today is astonishing. Just counting U.S. citizens, Americans consume 4.4 Petabytes (4,855,443,348,258,816 bytes or 4 Quintillion bytes) of internet data every minute. That data is being stored and organized by programs and services such as Outlook, Twitter, YouTube, Netflix, Venmo, online apps, instant messages, etc.
Definition of Big Data
How much data is big data? What’s considered to be big data – “huge, overwhelming, and uncontrollable amounts of information” – is subjective. What huge amounts of data mean to a small accounting firm will differ significantly from what it means to an online video streaming service such as Netflix. A better way to think about big data is using the definition offered by Gartner – the leading research and advisory company for business. Although it is still subjective, it defines the problem better, giving big data a cool definition.
They call the problem of big data the three Vs:
Dealing with a greater Variety of data (video, audio, email, sensor data)
Arriving in increasing Volumes (live data streams and real-time capture)
And with ever-increasing Velocity (the rate at which data is received and acted on).
In other words, big data means we are having to store lots of new data types, in much bigger sizes, coming to us at a much higher rate of speed, requiring us to keep up. Fortunately, there is new software that can help us get a grip on huge datasets. The industry refers to these solutions as also big data. To get an idea of how this software evolved over time, here is the brief history of the development of big data.
History of Big Data Solutions
From a software perspective, big data took off in 2003, when Google published a series of technical papers, detailing ideas on how to handle large-scale data by building a distributed database. A distributed database divides the data and runs it on top of multiple computers. Each computer is able to process and query its part of the data independently while still returning results as a single solution.
These papers evolved into a new programming model promoted by Google called MapReduce. The model showed how to implement big data sets running on parallel distributed clusters. Google used MapReduce to analyze search results of Google internet queries quickly.
MapReduce, in its first iteration, was not very user-friendly. Open-source software enthusiasts wrote software to help govern the MapReduce utilities while allowing it to run on distributed computer clusters. They built it with low-cost off-the-shelf hardware. That open-source software was called Hadoop. It was named by one of the software authors, Doug Cutting, after his son’s favorite toy elephant.
The original Hadoop was programmed using a similar DOS method, through a command line. Programmers working with Hadoop, however, were more familiar with the Java programming language that had taken off by then and craved an easier way to set up and work with Hadoop databases.
In 2012, Matei Zaharia created Apache Spark – an analytics engine for large-scale data processing. It provided an easier-to-use API for building Hadoop systems. Like Hadoop, Apache Spark is built on an open-source code, not requiring a user license.
Next, the creators of Apache Spark founded a software company Databricks. Databricks added an open-source distributed computing framework that provides a web-based platform working with Spark. It extended the data analytics capabilities of using Hadoop.
Microsoft was an important financial investor in Databricks in 2019 and has since released Azure Databricks to help big data operators run data analytics within the Azure Cloud services environment.
Tech Learning
As you can see, there is a long history relating to the development of big data solutions. We hope now you have a greater appreciation of the parts that allow Hadoop to build distributed databases to manage big data. At TechStar, we are always researching and learning about technology. This community site is for customers and potential future customers of TechStar (basically everyone) to share in that learning. We hope you find the information useful and will consider calling TechStar in the future with any technology work you may need to be done.
Join the TechStar community and become active on our site!