Monday, July 16, 2012

ETL and E-LT ? Please Help !!

If you are going to face an interview for a job related to Data Integration, get  ready for this question. People are going to ask this question even if they themselves don't know the answers. :D
Data integration primarily involves pulling data, doing some cleanup activities and then putting it somewhere.
In DI jargon the three steps mentioned above are Extraction, Transformation and Loading. So thats ETL for you. You can read a lot about this on Wikipedia.

The question to be asked is : Where you do all the clean-up activities or where does this 'Transformation' take place ?
Once you have pulled the data from source, there has to be a temporary area to stage the information right ?
As an example,You can't change the format of the date column in your source directly because then you wont be left with the original data. It always makes sense to do all the transformations on a 'copy' of source and fire all the sql queries on that 'copy' before we put it in the target.
Now, where is this 'copy' or temporary 'staging' is created defines the approach : ETL or E-LT

The Neanderthal approach to this staging area where most DI tools didnt work well was to have their own proprietary Staging Area which would have huge storing capacity, high processing speed and could handle millions of rows of data efficiently. But the catch was, you would have to learn the syntax (query) of pulling data from  Source and putting it in the staging, learn the syntax of transforming the data in staging and then learn the syntax of putting the transformed data into your target. And with so many database technologies emerging with minor tweaks in the syntax, it became difficult to learn all things together. Besides that, these proprietary Staging areas were costly and didnt quite live up to the expectations as far as Bulk Loading capabilities were concerned.

So the geeks at Sunopsis thought 'Lets get rid of this staging area and use something else that is cost effective and efficient'. Then they thought, 'What if we use the source database/schema or target database/schema itself for transformation ?' example, I have a table called EMPLOYEE and then I create a temporary table called T$EMPLOYEE in the same schema on which I will do the transformation and load the data into target database, I can save millions of dollars by not buying a proprietary Staging area.
So, using this idea, they created this new data integrator called Sunopsis data Integrator and marketed it with the word 'E-LT' claiming that it didn't use a proprietary staging schema (I will come to that later). Sunopsis DI allowed the users to select the staging area whether it should be Source db or Target db. Another major factor that seperated this tool from others was its declarative approach using graphical UI (More on that later). This was a major hit and Oracle saw this as an opportunity of growth. Result, Sunopsis was bought by Oracle and soon it became Oracle Data Integrator.

Can the Staging Area be completely eliminated ?
The answer to this is No, though Oracle/Sunopsis Data Integrator claims to use E-LT approach there are some bitter times when you have to use a separate Staging area when both your source as well as target dont have the ability to perform SQL activities. Example: If you want to read data from a flat (text) file and put it in a text file, you will have to use a staging schema which can do sql. Sunopsis Memory engine or the new In-Memory engine is their own staging technology and can be used in such cases, along with other technologies. This basically is your JVM in action and it is known that this staging area is not able to handle large volumes of data. Personally this is the last staging area technology I would use for my integration projects.

No comments: