Search This Blog

Wednesday, May 27, 2015

Binary Trees in SQL Server

We often work with data in SQL Server. We write SELECT, INSERT, UPDATE, DELETE etc. commands easily to fetch, manipulate the data. However, if we get to know how SQL Server Database Management System actually stores this data in its underlying file system, what methodologies it uses to perform various operations on data, it is possible for us to store the data more efficiently.

Well, one may question whether the DBMS doesn't know how to store the data efficiently? Yes, the DBMS does store the data efficiently in its own way, but it doesn't know your query requirements. It means which columns we use in a select statement, which fields you are going to update frequently, what conditions you use in your where clause in order to fetch the required data.

We often make use of indexes to provide this information to database so that the database is well prepared to store the data efficiently based on our index which in turn helps to perform DDL, DML operations on the data efficiently. While indexes are of many kinds, the two basic types of indexes are Clustered Indexes and Non Clustered Indexes. Once an index of any of these kind is built, internally SQL Server prepares a structure which is known as Binary Tree Structure, popularly known as B-Tree. SQL Server uses this tree to locate any requested row through a command.

Binary trees can be visualized as an inverted tree which has its roots up and leaves down. A picuture is included below which shows how a typical binary tree structure looks like



The above shown binary tree is a model that gets created for an index that is created on a column that contains values 1 to 200. If we query a row that contains the column value '111', then sql server looks first at the root level if that value falls in the range. It then takes the path into intermediate levels where the node 101-200 exists, then further 101-150 and then further narrows down to 101-125 wherein it fetches the required row. If the index that is created is a clustered index, the leaf level nodes are nothing but the data rows themselves. So, the database directly gets the data once it reaches the leaf node. However, if the index that you have created is a non-clustered one, then the leaf nodes hold the address of the actual data or the corresponding clustered index node (if the table has one), where the column value of '111' resides in the underlying disk and then further fetches the data by pointing the read head of the disk to that particular address. 

Hence it is very important to choose the columns included in either the clustered or non-clustered indexes based on frequency of inserts/update vs select queries that run on these columns.

Hope this gives the basic idea of a b-tree structure in sql server. I will publish more on various kinds of indexes and their impact in data retrievals soon.

Thank you,
Raja

Monday, February 4, 2013

Some Quick Tips

Katie and Emil Research Output - 


SSIS 6 Million Rows in 30 Seconds using standard laptop


Settings while Loading data unto heap

1) Enterprise Database Engine is faster than BI Database Engine on my laptop (I really don't know why).
Time: 52.8s 
Rows Per Seconds: 113k

2) Commit Size = 0
Time: 40.6s 
Rows Per Second: 147k

3) Packet size = 32767
Time: 38.2s 
Rows Per Second: 155k

4) Compression = Row
Time: 34.2s 
Rows Per Second: 175k

5) Balance Data Distributor = 2 outputs (more than 2 is slower)
Time: 29.3s 
Rows Per Second: 204k

Some of you might be thinking it is fast because it is not first run and it cached data? Well after complete machine restart with best config it took 32.1 seconds.

Tuesday, January 29, 2013

ETL Best Practices

One of the critical phase in any ETL project is its design phase. An ETL architect should understand the best practices before finishing off the initial designing phase of the ETL. Implementing the project in a traditional Waterfall model usually requires a particular phase to be finished completely before proceeding unto the next phase such as Requirement gathering, High level Design, Detail design, Coding etc. Since a freezed design requires lot of managerial effort to get it altered in case a design enhancement is suggested at a later stage in the project. Projects implementing Scrum Agile methodology have an advantage over these kind of issues.

Anyways, either Traditional or Scrum Agile development kind of projects, building an optimal, scalable DW model is very important for the success of any ETL project. To keep this in view, it is often good if we can practice certain proven methodologies when dealing with these projects. Given below are some of the ETL best practices that one can think during the start up of any ETL project. I invite the readers to suggest more best practices that they have gained over years of their experience.

ETL Best Practices

1. Better Analysis of the Source Systems
    
     During the extraction phase of an ETL project, it is common that one usually connects to clients legacy RDBMS systems and extracts the required data. If one can spend considerable amount of time in analyzing these source systems in terms of the data quality, available times, availability of auditing columns in the source tables, permissions required to extract the data, up and down times of the source servers and individual tables etc., it can help us to avoid many extraction errors in the first part of the data flow itself. One need to have guidance from the source system experts to make this process easy.

2. Analysis of the Source Data

    The usefulness of the final BI reports lies on the very existence of the clean data in the system. So, one should avoid getting unnecessary, redundant data from the source systems by using sufficient filtering constraints in the source queries itself which will help to minimize the cleansing work again in the BI platform.

3. Staging the Data

    Though it appears like data can be extracted, cleased and apply various transformations on the fly in one go, in practical it is advised to stage the data first after extraction from source systems before going for processing the data. This is usually due to availability of various data sources at various times and also for reconciliation and auditing purposes. The transformation procedures might require to use data from different points of sources including flat files, spreadsheets etc., and so once all the data is staged one can start processing them. It is also a good practice to have a Pre-Staging area where the data is a replica of the source systems, a Staging area where the data is loaded after performing all the required transformations. This additional staging area gives us enough assurance of the correctness of the transformations, extra layer of auditing etc., before the data can be put into the enterprise data warehouse or datamart.

4. A Scalable DataWarehouse Design

    One should build the datawarehouse design keeping in view of the future business plans of the organization. One should make use of sufficient data modelling technique suitable for the user requirement of fast query response vs normalized data. 

5. Good Audit Logging mechanism

    A must functionality for an ETL project is to have sufficient logging mechanism in place to tell about the errors, processing times, no. of records processed etc. Either custom logging or the logging functionalites of the ETL tools itself can be used for this purpose.

6. Optimal Data Loading and Processing times

   One should also use sufficient indexes over the tables, make use of the parallel execution support of the ETL tool, minimizing page out /page ins by making use of available memory through buffers etc, which will help reducing the down time of the ETL system. One can also make use of the parallel processing or Change data capturing(CDC) functionalities available with the ETL tools and Database servers. In an enterprise DW scenario, it is not possible to refresh all the data everyday and so it is good approach to capture only the changed records from the source systems or change/insert those records which are modifed/new from the source system.

7. Security

    Ensure sufficient access privileges at different servers for the required personnel to run or update configuration values which affect the ETL daily outcomes. Sufficient configuration files should be in place to make ease while moving the project through different phases development, QA, Production.





Saturday, January 26, 2013

New blog for MSBI Tips and Tricks

In our day to day work on MSBI suite of applications such as SSIS, SSAS, SSRS, PPS etc., we usually come up with various learning and best practices. This blog is introduced with the intention to share the tips, code samples, best practices, article links and anything related to MSBI and SQL Server.