Key Considerations for Data Warehouse Design
Data warehouse design forms an important part in determining the accuracy of your business reporting. The style or design of your data mart or data warehouse will have a major influence on the cost, speed, maintainability, flexibility and success of your entire data warehousing and reporting environment.
Many companies follow a similar path on this data warehouse design journey. Having been in the industry for many years we recognise that companies are still making the same data warehouse design mistakes other companies made in the 1990’s.
Given the advances made in the understanding of these problems and techniques devised to solve them, the wise company can learn from competitor’s mistakes with data warehouse design and jump straight to the best solution.
This data warehouse journey always starts with an application such as an ERP, GL, or CRM system from which there is a need to create reports. During the initial project to choose and implement these applications, the reporting requirements are invariably overlooked or greatly underestimated.
The business soon realises they need additional and more complex reports in order to understand and manage their business and these are not available in the application. Getting the data warehouse design correct in the first place will allow your business to avoid this problem.
When undertaking data warehouse design, it is important to consider the following:
Reporting over Source Applications
As constant application development for new requirements is very expensive, departments will classically overlook data warehouse design and just purchase a reporting tool and start to report directly over the source system database. They will soon hit some huge problems – very complex SQL due to the normalised or proprietary nature of application databases in turn causing very slow performance of summary reports.
The effect of numerous complex reports running directly over an application can also cause the application itself to slow down, lock internal tables or in the worst-case crash. The typical response to this is for the application support team or DBA’s to attempt to fix this by creating new summary tables in the application database, often referred to as reporting or summary tables.
Overtime numerous inflexible reporting tables get created and the time required to populate them grows as more and more data needs to be loaded daily. Integrating data from outside the original application remains difficult and may be handled in Excel.
Realising that a database purely for reporting is required, and the ability to combine data from disparate systems and to remove the load on source applications, the next logical step is the creation of a data warehouse.
Inmon Enterprise Data Warehouse
The early data warehouse design approach is best described by Bill Inmon in Corporate Information Factory (or Enterprise Data Warehouse – EDW) methodology devised in 1998. In short, this approach aims to collect all the data in an organisation into a single, integrated database on the assumption it may be required in the future.
This is often referred to as the top-down approach as it assumes all data might be useful and should, therefore, be in the data warehouse. Often IT, driven by RDBMS database administrators are keen to draw on their existing modelling and SQL skills and toolkits. These can be very large, long-winded and expensive projects which if successful mainly solve the integration issues.
As a normalised or 3rd normal form database they remain complex to report on. Typically, these would consist of a staging area, an enterprise or integration area modelled in 3rd normal form, and often a 3rd area of departmental data marts to simplify reporting. Initially, these would have been simplified subsets of the EDW.
Ralph Kimball Dimensional Modelling
Realising the problems with the top down or Corporate Information Factory approach in terms of scale and cost, recognising the need to simplify the database schema to make reporting easier, Ralph Kimball developed the Dimensional Modelling approach.
Based upon the conviction that data warehouses must be designed to be easy to understand and fast to report from, it essentially de-normalises the database schema to the level of a fact table for each subject area containing measures directly linked with dimension tables containing the attributes.
The benefits of de-normalising a schema are many including easier understanding, simpler code and faster query time. Typically, a Kimball data warehouse will contain a staging area and a presentation area only, with any data cleansing or integration carried out in the staging area. Often described as a bottom-up approach the Kimball methodology advocates only bringing in useful and needed data into the data warehouse to save time, cost and keep the model simple and efficient.
In later years Bill Inmon has modified the departmental data marts area of the Corporate Information Factory to use Ralph Kimball’s Star Schemas. It is now industry best practise to use Star Schemas in some form for reporting.
Inmon vs Kimball
The key differences in data warehouse design approach remains the scope of the information to be pulled in (everything vs required only), the longevity of the data (forever vs disposable) and the requirement to have a 3rd normal form schema prior to the final area or not. In most situations, the Kimball Dimensional Methodology is the clear winner in terms of performance, speed to implement, ease of maintenance and ability to change with new requirements.
There are still many Inmon style warehousing projects out there, often as they are a way for consultancies to sell a larger and longer project. Choosing the wrong approach can have expensive implications for your projects.
Sign up and never miss a blog post from Triangle