Clarity data integration professionals combine knowledge of best practice data integration techniques with experience in the latest technologies to develop effective and efficient solutions.
Data Integration and ETL
The process of structuring all the data available in an organisation in such way that it supports the Data Warehouse and Business Intelligence products or solutions is known as Data Integration. There are three major components in the Data Integration process. They are collectively known as Extract, Transform and Load (aka ETL, or ELT).
The data integration and ETL process can be termed as the data acquisition stage in a typical BI/DW implementation. As the term suggests this is the stage where the data from all the enterprises’ workflow or business applications will be pulled together into a staging area and then loaded into the data warehouse. It is very common to see numerous business applications with each of them having their own database structures, methods of storing data. Some applications will even store data in excel or flat files. The reason behind this is that each individual department in a big enterprise will have its own business applications and methods of data storage and retrieval.ETL developers then write scripts that will pull data from the various sources and bring them all into a common database. This database is usually referred to as the staging area. The data from the staging area will be cleansed and then loaded into the data marts or presentation area of the warehouse.
A successful implementation approach should ideally follow an iterative development approach that easily adapts to project drivers irrespective of the scope, size or complexities involved and slowly break through the technology and data barriers.
A successful data integration implementation starts with a data integration architecture. A good architecture should ideally address 4 key aspects that are involved in the data integration process. Applications, Technologies, Techniques and Management.
Applications and Technologies
Data integration architecture should define all the application and the relevant technologies to be used for the solution implementation. This must be arrived at based solely on the technical and functional requirements of that BI project but the data integration architecture should lay the foundation based on which the decisions are to be taken. The designed architecture should be very scalable to process all data sources to meet the future reporting needs of the organisation.
The applications that are chosen for the data integration should be capable of connecting to all industry standard databases and data sources. Some of the top data integration tools are Oracle Warehouse Builder (OWB) / Data Integrator (ODI), SAS Data Integration Studio (DIS), SAP Business Objects Data Integrator, IBM Data Stage, Informatica and Microsoft’s SQL Server Integration Services. All these tools are reasonably easy to understand and can connect to any standard database to push and pull data. Although they will never generate the most efficient processing code, they will encourage consistent development methods are applied meaning future maintenance should be simpler. When choosing the best application for your organisation pay special attention to record processing performance and experienced resource availability in your market.
Best practice data integration techniques are well publicised, and perhaps best defined in Ralph Kimballs ETL subsystems. As these are all documented elsewhere we wont go into the detail here other than to say they are mostly all required in a data integration system of medium to high complexity.
One area that warrants discussion is data cleansing. xxxxxxx
Cleansing can be done on the name and addresses, which need to be standardised, duplicates identified and removed. Data cleansing should not be applied to areas of data derivation that changes original source data. Changing data removes the traceability to the source system and can actually reduce confidence in the data in the case where front line staff see different detail data in reports from what exists on the application screen they entered the data into! Similar to the Integration and ETL tools, there are specialized data cleansing tools like SAS Dataflux and the Firstlogic suite of products which provides address standardization and duplicate identification services. These tools can be customized to suit each organizations data cleansing and data quality needs.
The data integration architecture should also be created in a manner where the data management is handled and stored in a way that can be easily accessed by all the enterprise users. This step is known as the Metadata management. This will facilitate a smooth data flow and allow data integrity across the organisation. Having Metadata stored in the ETL process will allow interoperability and help in analyzing the impact of any changes that will affect the solution and related business applications etc can easily be tracked. Logging all these activities is a generally considered to be basic and a best practice of am ETL process.
The ETL design should be capable of tracking all data changes throughout the applications lifetime. The changes should be tracked for each and every transaction made during an ETL run. It is always wiser to scheduled bulk ETL jobs during lean data base access times.
Keys to a successful implementation
They key to any successful data integration and ETL implementation is to understand the organisations requirements, their infrastructural capabilities and to architect a solution that will address all the reporting needs. It is also very important that the tools selected for the data integration and ETL processes are studied in detail and match the business and technical requirements.