What is ETL?
ETL stands for Extract, Transform and Load, a process used to collect data from various input sources, transform the data depending on business rules/needs and load the data into a destination source. The need for this process comes from the fact that in modern computing, business data lives in many distributed locations and in multiple formats. For example, data is saved by organizations in various formats such as a word doc, PDF, XLS, plain text, etc. or may be kept in any of the commercial database servers like MS SQL Server, Oracle, and MySQL. Managing this business information efficiently is a great challenge and ETL plays an important role in solving this problem.
The ETL process has three main steps, which are Extract, Transform, and Load
Extract – The first step in the ETL process is extracting the data from various sources. The data in each source can be in any of the formats like flat files or some database files.
Transform – Once the data has been extracted by various filters, validations, aggregate functions or some other business logic, it can be applied to the data to get the output in the desired format.
Load – This is the final step where the ‘transformed’ data is loaded in the target destination which may again be a flat file or some predefined RDBMS file.
Why and Where is ETL Required
Companies or organizations with years of history and/or a global presence will inevitably go through technological changes at some point; ranging from manual systems to simple in-house applications and data storages ranging from flat files to RDBMS. This can potentially create subprocesses within the big process (business) with completely different applications running on suitable hardware and architectural platforms.
In such scenarios, the organization’s unit in location “X” might be using mainframes and another unit at location “Y” would be using the SAP system to manage operations related data. In this type of setup, if an organization’s top management needs a consolidated report of all the assets of the company, it can be a challenge to gather all the data and reports. Collecting the right data for reports from disparate systems, then consolidate them manually can be a cumbersome process that could take days to deliver a final report to management. A more efficient way would be to have a system that fetches data from these disparate sources, stores it in a data warehouse environment and generate a report whenever needed.
So how do you fetch the data from these different systems, make it coherent, and load it into a data warehouse?
To do this, we need a methodology or a tool that can extract the data, cleanse it and load it into a data warehouse application. In order to consolidate the historical information from all disparate sources, we set up an ETL system, which transforms the data from the smaller databases into the more meaningful long-term databases.
ETL is useful when
- Companies need a way to analyze their data for critical business decisions.
- The transactional database cannot always answer complex business queries.
- You need to capture the flow of transactional data.
- There is a need to adjust data from multiple sources to be used together.
- To structure data to be used by the various Business Intelligence (BI) tools.
- To enable subsequent business/analytical data processing.
There are a variety of ETL tools available in the market. Some of the prominent ones are:
||List of ETL Tools
||Oracle Data Integrator
||Data Integrator (BODI)
||SAP Business Objects
||SAS Data Integration
||Pentaho Data Integration
||Pervasive Data Integrator
||Actian / Pervasive Software
Advantages of the ETL tool
- ETL tools normally provide for better performance even for large datasets.
- They have built-in connectors for all the major RDBMS systems.
- They help to reuse complex programs for validations etc.
- They offer intuitive visual integrated development environment.
- They also offer performance optimization options such as parallel processing, load balancing etc.
At, Sofbang I have worked with Talend Open Studio, an open source project for managing various facets of ETL (Extract, Transform, Load) process for BI and data warehousing. It is one of the most innovative data integration solution in the market today.
It’s open source, free to use, and community-supported. It summarizes every operation that loads, retrieves, transforms and shapes data, and provides very easy to use ‘drag and drop’ UI components to enable intuitive and faster UI development a shown below:
Fig: Talend IDE Screen
For example, let’s try this with an ‘Excel Sheet’ as a raw input, which needs some validations, and filters to apply to data. Based on that information we should get our desired data in the ‘output’ Excel.
Step 1: The sample input Excel is shown below which contains some invalid names and other details of employees.
Step 2: Drag and drop the respective components (in this case for processing Excel) from the components palette on the right-hand side, put them on the screen and draw the output connections as shown below:
Step 3: Now define the validations and filters to be applied to input data, by clicking on the ‘map’ component as shown below: In this case, we define our filters and validation as;
- Names should be valid
- Date of birth should be greater than ’01-JAN-2012’
- All employees drawing salary greater than 20000 should be filtered and stored separately.
Step 4: Click on the ‘Run’ button to execute the job and get the results.
Step 5: When clicking ‘Run’ button we will get the following screen:
Step 6: The resulting ‘filtered’ and ‘validated’ Excel is shown below:
Fig: Excel with Valid names and Salary > 20000
Fig: Excel with Valid names and DOB > ’01-JAN-2012’