If hearing the technical terms such as big data, data lake or SQL Server Integration Services makes you gulp because they sound like Elfish to you, don’t worry. This post will clarify these and other related to big data terms for you. Even if you are an industry pro, you might find this post helpful. In fact, I challenge you to read through it to see if you are indeed using the terms correctly.
Big data is a term used to describe the growing amount of data we receive, the speed at which we receive it, and the way we store and analyze it. In short, big data means both the size of the data itself and the tools used to handle it.
We are having to store lots of new data types in much bigger sizes coming to us at a much higher rate, requiring us to keep up. The type of data we now receive, store, and process include unstructured data, such as video, audio, email security information, etc. The data is arriving in the larger file sizes, such as 4K video, giant internet of things, and technical dumps. I discuss the difference between structured and unstructured data in a separate post.
That’s the problem of the modern world. The amount of data has grown exponentially, and it needs to be acted on a whole lot quicker. That is why we’ve been developing tools that can manage both structured and unstructured data. The type, quality, and size of the data being stored today means we need an option to scale up our database services.
Since much of the new types of data are growing significantly each year, we need the ability to store more unstructured data. That means massive databases. The processing of data in these massive databases becomes a challenge. And that is how Hadoop came into being.
Hadoop uses simple programming models to distribute processing of large datasets across giant clusters of computers. It is designed to scale up from single servers to thousands of machines, offering local computation and storage. Each cluster can run independently of the others and report back information as if it were operating as one single database.
Now that we've got the database, we need tools and storage options allowing us to move the data into and around our system. We have a couple of options, depending on the data type we are managing.
Often customers store data without knowing when they will use it. The footage data from security cameras is a great example. Until a crime is suspected, the data may never be accessed, and 99% of the footage may never be viewed. But it still needs to be stored just in case there is a need.
A data lake is a vast pool of raw data that can contain the type of data such as security footage. There are no data models that describe all aspects of the data itself. Most of the data in the data lake is unprocessed. This means the data lakes often take up more space on disk than your managed data stored in relational data lakes. A data lake offers an effective solution that can load and store large amounts of data very rapidly without data transformations and rules applied.
Data lakes are ideal for housing examples to be used for machine learning. Machine learning takes as many examples as you can give it to understand and process the various parts of your document. An example of that would be doing a forms recognition. This works well for data scientists and data analysts, who would be the typical users of data stored in a data lake.
A counterpart to a data lake is a data warehouse. A data warehouse is a repository of structured filtered data already processed for a specific purpose. Examples include sales data and inventory time sheets. In this case, the data fields are clearly defined. They typically relate to tables of data in efficient ways through SQL databases.
The data follows rules concerning the data types and properties. For example, we know an employee's social security number has three digits followed by two digits followed by four digits. It is the case of a very defined set of rules.
By not having to maintain data that will never be used and storing only processed data, data warehouses also save on pricey storage space. Additionally, processed data can be easily understood by larger audiences, including typical business users. Now we know how to store the data, how do we move the data into and around the storage? Here are some options.
SSIS is a component of Microsoft SQL server. It stands for SQL Server Integration Services. SQL Server Integration Services help with various data migration tasks. SSIS transfers on-premises local data from place to place automatically. It uses memory to optimize these moves. And that allows you to manipulate the data while it is still in memory. This makes SSIS one of the faster tools available on the market. And it is great for on-premises solutions. The main issue with SSIS, however, is that it is a physical part of a SQL server. That requires you to have a SQL server to use it. But what if you prefer a similar to SSIS functionality but want to drive it through a web browser instead of SQL?
If you wish to have an SSIS functionality but use a web browser instead of SQL, you need Azure Data Factory (ADF). Azure Data Factory is a fully managed cloud-based data integration service that can be used to automate data or schedule the refresh of data through an ETL pipeline. ETL stands for extract, transform, and load. It is about getting the data, adding, and changing parts of the data, and then moving it to another destination.
ADF allows developers to integrate disparate data sources by using a graphical interface or by writing code. The best solution you can have if you are already using Azure is your Data Factory.
Another solution to handling data is Databricks. It is an automated system to do ETL pipelines, to move and manage data. It was based on Apache Spark. Azure Databricks is a commercial solution for the Apache Spark program. It can pull data in, transform it using Python, Java, or any number of languages, and drop it somewhere else in a data lake.
Databricks does require the knowledge of Spark, Scala, Java, R, or Python. It is built for data engineering and data science people so that they can perform their related activities, and not for your general business user.
Now you have a better understanding of the vocabulary of data, databases, and tools you can use to store, move, and analyze your big data. At TechStar, we are always researching and learning about technology. This community site is for customers and potential future customers of TechStar (basically everyone) to share in that learning. We hope you find the information useful and will consider calling TechStar in the future with any technology work you may need done.