Search This Blog

Tuesday, January 29, 2013

ETL Best Practices

One of the critical phase in any ETL project is its design phase. An ETL architect should understand the best practices before finishing off the initial designing phase of the ETL. Implementing the project in a traditional Waterfall model usually requires a particular phase to be finished completely before proceeding unto the next phase such as Requirement gathering, High level Design, Detail design, Coding etc. Since a freezed design requires lot of managerial effort to get it altered in case a design enhancement is suggested at a later stage in the project. Projects implementing Scrum Agile methodology have an advantage over these kind of issues.

Anyways, either Traditional or Scrum Agile development kind of projects, building an optimal, scalable DW model is very important for the success of any ETL project. To keep this in view, it is often good if we can practice certain proven methodologies when dealing with these projects. Given below are some of the ETL best practices that one can think during the start up of any ETL project. I invite the readers to suggest more best practices that they have gained over years of their experience.

ETL Best Practices

1. Better Analysis of the Source Systems
    
     During the extraction phase of an ETL project, it is common that one usually connects to clients legacy RDBMS systems and extracts the required data. If one can spend considerable amount of time in analyzing these source systems in terms of the data quality, available times, availability of auditing columns in the source tables, permissions required to extract the data, up and down times of the source servers and individual tables etc., it can help us to avoid many extraction errors in the first part of the data flow itself. One need to have guidance from the source system experts to make this process easy.

2. Analysis of the Source Data

    The usefulness of the final BI reports lies on the very existence of the clean data in the system. So, one should avoid getting unnecessary, redundant data from the source systems by using sufficient filtering constraints in the source queries itself which will help to minimize the cleansing work again in the BI platform.

3. Staging the Data

    Though it appears like data can be extracted, cleased and apply various transformations on the fly in one go, in practical it is advised to stage the data first after extraction from source systems before going for processing the data. This is usually due to availability of various data sources at various times and also for reconciliation and auditing purposes. The transformation procedures might require to use data from different points of sources including flat files, spreadsheets etc., and so once all the data is staged one can start processing them. It is also a good practice to have a Pre-Staging area where the data is a replica of the source systems, a Staging area where the data is loaded after performing all the required transformations. This additional staging area gives us enough assurance of the correctness of the transformations, extra layer of auditing etc., before the data can be put into the enterprise data warehouse or datamart.

4. A Scalable DataWarehouse Design

    One should build the datawarehouse design keeping in view of the future business plans of the organization. One should make use of sufficient data modelling technique suitable for the user requirement of fast query response vs normalized data. 

5. Good Audit Logging mechanism

    A must functionality for an ETL project is to have sufficient logging mechanism in place to tell about the errors, processing times, no. of records processed etc. Either custom logging or the logging functionalites of the ETL tools itself can be used for this purpose.

6. Optimal Data Loading and Processing times

   One should also use sufficient indexes over the tables, make use of the parallel execution support of the ETL tool, minimizing page out /page ins by making use of available memory through buffers etc, which will help reducing the down time of the ETL system. One can also make use of the parallel processing or Change data capturing(CDC) functionalities available with the ETL tools and Database servers. In an enterprise DW scenario, it is not possible to refresh all the data everyday and so it is good approach to capture only the changed records from the source systems or change/insert those records which are modifed/new from the source system.

7. Security

    Ensure sufficient access privileges at different servers for the required personnel to run or update configuration values which affect the ETL daily outcomes. Sufficient configuration files should be in place to make ease while moving the project through different phases development, QA, Production.





No comments:

Post a Comment