A Clarity Data Warehouse is design-led. A well designed data warehouse is the cornerstone of a successful Information Management solution. There are a number of ETL and Reporting tools that propose a warehouse solution requiring little analysis or design input. This is fine for demonstrations or prototypes but will not support an organisations information evolution, nor will they support the information needs of distributed analysts.
The case for a Data Warehouse
Typically in any organisation the data that is combined to form “information” will be spread across multiple systems and departments which in turn will use its own workflow or business applications to store data. In order to collect all this disparate data from various sources and create standard and ad hoc reports, and dashboards a central data repository is required.
With the information collected in a central place, there are many benefits to integrating these disparate sources and re-modelling the data. Integrated data sources allow for a 'single view of the customer', while re-modeled data can vastly improve query performance and simplify the presentation of the data.
Enhanced query performance must not be underestimated, and a correctly modeled and tuned Data Warehouse can support the type of information access people have come to expect in the internet era thanks to the likes of Google.
Data Warehouse Design
Clarity Data Warehouse professionals all subscribe to the Kimball multidimensional design philosophy for most presentation layer data structures. While the presentation data structures are designed to be fit for purpose, we find that even the denormalised structures still make use of common conformed dimensions - e.g Date!
This does not mean we are textbook-followers, in fact most Clarity data warehouses tend to be a hybrid Inmon/Kimball implementation. This is often the case when one or more of the major source systems is non 3NF (there are more of these than you think!).
Understanding a Data Warehouse
Data warehouse can be defined as a large repository that contains historical data, which enables decision-making on an enterprise level. To support decision-making, a data warehouse is associated with a set of tools that enable quick query and analysis from historical data for the organisation’s top management to make informed business decisions.
It is common for people to mistake this with an operational database but there are some major differences between a data warehouse and an Online Transaction Processing (OLTP) DB. Relational databases are designed with a specific function or purpose in mind. The purpose of a data warehouse totally differs from that of a relational database resulting in totally different design characteristics for each of them. In order to design a data warehouse it is essential to understand the difference between the two.
OLTP is a relational database that is used for day-today processing and storing data. Data warehouse contains only historical data.
In an OLTP database the data tends to change many times in a single day. This is not the case in a data warehouse. The data being historical in nature rarely changes and receives the changed data at regular intervals and hence will grow over a period of time.
A data warehouse is optimised to handle bulk data loads and process large and complex queries that might access multiple rows in a table. An OLTP on the other hand is optimised to perform small transactions like insert or update a single column or row in a table.
An OLTP database is designed for real-time business operations while a data warehouse is designed for analytical functions.
A data warehouse is huge and usually will be hundreds of gigabytes in size sometimes this can grow up terabytes. The OLTP database is usually limited to small sizes.
A data warehouse allows only a few concurrent users to access the system while the OLTP database allows hundreds or even thousands of concurrent users to access the database.
From the above points it becomes clear that the key driver for a distinct data warehouse design is the need to be able to fetch the results quickly when queried against huge volumes of data.
A typical flow of data in a Data Warehouse
Data is sourced from a number of transaction processing systems within the organisation, these systems will often exist in defferent data storage platforms.
This disparate data is then collected into a temporary storage area that is referred to as staging area. The data in the staging area is then cleansed, checked for duplicates; names and addresses are standardised and then loaded into the data warehouse. This process is known as Extract, Transform and Load or ETL in a BI parlance.
A data warehouse can contain individual data marts for each of the department, like finance, purchase, sales etc and will also contain cubes. Using these data marts and cubes meaningful reports, aggregation and analysis can be performed by generating various reports, query and analysis using standard reporting or ad hoc reporting and dashboards.
The various users in an organisation will be split and grouped together by their business functions or roles and depending on this role or function they will be allowed to access to view standard reports and create ad hoc reports. Dashboards are normally presented to the top management enabling them to make proper business decisions.
Things To Consider While Designing a Data Warehouse
Data modeling and definition of metadata constitute as single, important activity of designing the enterprise data warehouse. A well thought out data model that is flexible, scalable and usable will surely lead up to a successful data warehouse implementation.
There are some core principles that have to be considered before embarking on goal of designing a data warehouse. Let us look at these principles in brief.
It is imperative that the project and architectural goals are clearly defined and understood in order to create a good data warehouse design. A good architecture should deliver ease of use to users, handle complex queries, focus on single subjects and should be based on a dimensional model.
Subject areas form the base of a data warehouse. Identifying the subject areas is key while modeling a data warehouse. Each subject area will be independent of transactions that involve data creation and/or manipulation.
A good data warehouse design should have the capability to integrate individual representations by mapping them to it.
The data warehouse should be flexible enough to handle any changes to the requirements in a rapid fashion.
The data warehouse designed should be capable of handling data in a distributed network environment.
These days a typical BI/DW solution is expected to go beyond the normal reporting capabilities and perfume trend analysis, forecasting and “what if” analysis. Hence the data warehouse should be capable of allowing the any user with proper privileges to update historical data and present the “what if” scenarios. Once the analysis is complete, the data must be rolled back to what it was earlier. For this to work data should be at the proper granularity level.
Provisions to load data in batches and huge volumes by automated ETL scripts should be available. Mechanism and controls should be in place to load only “clean” data into the data warehouse. An effective data cleansing and data quality mechanism and software tools/interfaces should be identified as part of the architectural goals in order to achieve this.
Dimensional Models for a Data Warehouse
A dimensional model can be defined as a group of business facts that are detailed and surrounded by multiple dimensions describing those facts. Typically a dimensional model’s schema will contain a fact table that forms the centrepiece with multiple dimension tables. Normally a dimensional model will produce a snowflake schema or a star schema.
In a Star Schema all dimension tables can directly be joined to the fact table, while a Snowflake Schema is one where one or many dimension tables can only join to the fact table with the help of other dimension tables and not directly.
Some of Kimball’s essential dimension modeling rules Ralph Kimball is widely considered to be the father of data warehousing architecture. His design methodology is popularly known as dimensional modeling and also sometimes referred to as the Kimball methodology and has become a standard designing a data warehouse. Kimball's methodology takes up a bottom-up approach to design and model a data warehouse.
In an organisation each business process typically generates or captures key performance metrics that are unique to each business department. Hence it is best to structure the dimension models around these business processes.
Fundamentally data grain can be categorised as transactional, periodical or accumulating snapshot in a single fact table. It is imperative that the level of details must be the exact same level within a fact. Avoiding multiple levels will eliminate any confusion that may be caused for a business user to understand and avoid error prone results.
Many to many relationships in a fact table and many to one relationship in a dimension tables should be resolved.
Ensure the use of surrogate keys in a dimension table to track any dimension attribute changes.
Enable data integration by creating conformed dimensions, which are essential to Extract, Transform and Load data once and then reuse the same arcoss, many fact tables in order to support data drill down in an enterprise data warehouse. This also eliminates redundancy in design and development and reduces the time-to-market.
Following the above rules as advocated by Kimball will ensure data granularity and provide a flexible data warehouse and will help you from running into data warehousing stonewalls.