Menu

Spot the Difference: Databases, Data Lakes and Data Warehouses

Data lakes, data warehouses and databases are all designed to store data. Some do it better than others, more efficiently, and on occasion, at a lower cost. Coupled with this, there are many data buzzwords floating around at the moment and it’s becoming increasingly difficult to keep up with all the terminology and definitions in such as fast paced and ever changing industry. As Triangle continues to position itself as a data thought leader, we’ve set about clarifying the differences between a data lake, data warehouse and database.

Databases

Databases date back as far as ancient times, when database systems were developed and used amongst libraries, hospitals and business organisations. In the 1960s, computerized databases were introduced with the rise of computer use in private organisations. Fastforward to 1970 when E.F. Codd published an important paper, we saw a significant shift in the way people thought about databases.

Over the next 30 years, databases continued to change and evolve with technology and business demands and now, in 2018, the likes of IBM, Microsoft and Oracle largely dominate the database market.

In summary, databases are fundamentally designed to monitor and update real-time structured data and commonly, only the most recent data will be made available.

Data Warehouse

In its simplest definition, a data warehouse collects data, both structured and unstructured from a myriad of different sources, both internal and external. After the data has been collected, the data warehouse will set about optimising the data for retrieval.

Data warehouses were first introduced as a result of the sudden surge in internet use during the 1990s. As the internet grew in prevalence, along with globalisation, networking and computerization, businesses recognised the need for business intelligence and as a result, the need for data warehousing became apparent.

By the time the year 2000 had been reached, businesses hit a problem. As databases and application systems started to expand, businesses realised that their systems had been badly integrated and data was recognised as being inconsistent in many organisations. Furthermore, businesses also recognised that they were storing vast quantities of fragmented data.

In summary, if businesses wanted critical information and business intelligence to make more informed decisions, there needed to be a solution. And so data warehouses were introduced as a way to consolidate the data being sourced from multiple systems into one, reliable source to support strategic business decisions.

Data Lakes

Since the early 2000s, data lakes have continued to grow in popularity, with many organisations typically requiring both a data warehouse and a data lake. A data lake is different from a data warehouse because it stores both relational and non-relational data, and furthermore, the structure of the data or schema is not defined when the data is captured.

Organisations that do employ a data lake will largely benefit from improved customer interactions and operational efficiencies.

When we begin to explore the key differences in characteristics between a data warehouse and a data lake, the first immediate difference is in the types of data that are collected. A data warehouse will collect and store relational data from operational databases and line of business applications, whereas a data lake will store both relational and non-relational data from IoT devices, mobile apps and corporate applications. Secondly, we can also see that there is a major difference in the types of users. Data warehouses are largely used by business analysts, whereas data lakes are used by a much broader range of users made up of data scientists, data developers and business analysts.

However, businesses that use data lakes must be aware of the dangers lurking beneath the water. As the name suggests, a data lake will hold large amounts of data, similarly to a large lake holding lots of water. The danger, however, is that a data lake will not feature governance or oversight capabilities and therefore, the business must provide descriptive metadata and have a mechanism in place to manage the data lake effectively.

To Conclude

A database provides the means to monitor real-time structured data, with often only the most recent data being made available.

A data warehouse will collect and store structured and unstructured data from an array of different internal and external sources and will optimise the data before retrieval.

A data lake will store structured and unstructured data and will provide a method for organising large volumes of data from very diverse sources.

July 16, 2018