Sofbang Tech Team Tips Series: Data Extraction ETL – An effective method to manage data

ETL Article Header Image

What is ETL?

ETL stands for Extract, Transform and Load, a process used to collect data from various input sources, transform the data depending on business rules/needs and load the data into a destination source. The need for this process comes from the fact that in modern computing, business data lives in many distributed locations and in multiple formats. For example, data is saved by organizations in various formats such as a word doc, PDF, XLS, plain text, etc. or may be kept in any of the commercial database servers like MS SQL Server, Oracle, and MySQL. Managing this business information efficiently is a great challenge and ETL plays an important role in solving this problem.

The ETL process has three main steps, which are Extract, Transform, and Load

 

Extract – The first step in the ETL process is extracting the data from various sources. The data in each source can be in any of the formats like flat files or some database files.

Transform – Once the data has been extracted by various filters, validations, aggregate functions or some other business logic, it can be applied to the data to get the output in the desired format.

Load – This is the final step where the ‘transformed’ data is loaded in the target destination which may again be a flat file or some predefined RDBMS file.

 

Why and Where is ETL Required

Companies or organizations with years of history and/or a global presence will inevitably go through technological changes at some point; ranging from manual systems to simple in-house applications and data storages ranging from flat files to RDBMS. This can potentially create subprocesses within the big process (business) with completely different applications running on suitable hardware and architectural platforms.

In such scenarios, the organization’s unit in location “X” might be using mainframes and another unit at location “Y” would be using the SAP system to manage operations related data. In this type of setup, if an organization’s top management needs a consolidated report of all the assets of the company, it can be a challenge to gather all the data and reports. Collecting the right data for reports from disparate systems, then consolidate them manually can be a cumbersome process that could take days to deliver a final report to management.  A more efficient way would be to have a system that fetches data from these disparate sources, stores it in a data warehouse environment and generate a report whenever needed.

So how do you fetch the data from these different systems, make it coherent, and load it into a data warehouse?

To do this, we need a methodology or a tool that can extract the data, cleanse it and load it into a data warehouse application. In order to consolidate the historical information from all disparate sources, we set up an ETL system, which transforms the data from the smaller databases into the more meaningful long-term databases.

ETL is useful when

  • Companies need a way to analyze their data for critical business decisions.
  • The transactional database cannot always answer complex business queries.
  • You need to capture the flow of transactional data.
  • There is a need to adjust data from multiple sources to be used together.
  • To structure data to be used by the various Business Intelligence (BI) tools.
  • To enable subsequent business/analytical data processing.

 

ETL graphic

 

There are a variety of ETL tools available in the market. Some of the prominent ones are:

No. List of ETL Tools ETL Vendors
1 Informatica PowerCenter Informatica Corporation
2 InfoSphere DataStage IBM
3 Oracle Data Integrator Oracle
4 Decision Stream IBM Cognos
5 Data Integrator (BODI) SAP Business Objects
6 SAS Data Integration SAS
7 Talend Studio Talend
8 Pentaho Data Integration Pentaho
9 Pervasive Data Integrator Actian / Pervasive Software
10 Clover ETL Javlin

 

Advantages of the ETL tool

  • ETL tools normally provide for better performance even for large datasets.
  • They have built-in connectors for all the major RDBMS systems.
  • They help to reuse complex programs for validations etc.
  • They offer intuitive visual integrated development environment.
  • They also offer performance optimization options such as parallel processing, load balancing etc.

At, Sofbang I have worked with Talend Open Studio, an open source project for managing various facets of ETL (Extract, Transform, Load) process for BI and data warehousing. It is one of the most innovative data integration solution in the market today.

It’s open source, free to use, and community-supported. It summarizes every operation that loads, retrieves, transforms and shapes data,  and provides very easy to use ‘drag and drop’ UI components to enable intuitive and faster UI development a shown below:

Fig: Talend IDE Screen

For example, let’s try this with an ‘Excel Sheet’ as a raw input, which needs some validations, and filters to apply to data.  Based on that information we should get our desired data in the ‘output’ Excel.

Step 1: The sample input Excel is shown below which contains some invalid names and other details of employees.

Step 2: Drag and drop the respective components (in this case for processing Excel) from the components palette on the right-hand side, put them on the screen and draw the output connections as shown below:

Step 3: Now define the validations and filters to be applied to input data, by clicking on the ‘map’ component as shown below: In this case, we define our filters and validation as;

  • Names should be valid
  • Date of birth should be greater than ’01-JAN-2012’
  • All employees drawing salary greater than 20000 should be filtered and stored separately.

Step 4: Click on the ‘Run’ button to execute the job and get the results.

Step 5: When clicking ‘Run’ button we will get the following screen:

Step 6: The resulting ‘filtered’ and ‘validated’ Excel is shown below:

Fig: Excel with Valid names and Salary > 20000

 

Fig: Excel with Valid names and DOB > ’01-JAN-2012’

Sofbang Tech Team Tips Series: How Oracle POS Communicates with a Centralized Balance System

What is POS?

A point-of-sale (POS) system is a computerized replacement for a cash register. Much more complex than the cash registers of even just a few years ago, the POS system can include the ability to record and track customer orders, process credit and debit cards, connect to other systems in a network, and manage inventory.

Oracle Retail POS Suite

An important part of the in-store experience, for both the customer and the store associate, is the delivery of the retailer’s brand. Designed with the highest degree of flexibility on the market, Oracle Retail Point-of-Service provides retailers with a user interface that can be easily modified to reflect an individual brands look and feel. Oracle Retail Point-of-Service can be configured to present a retailers branding, from the colour schemes, icons and text of the global and local navigation buttons, to the orientation of the prompt, response and status regions, the use of images, logos and colour schemes in the work area, though the support of technology such as touch screen. This same flexibility allows overall solution to support multiple brands

Today I am going to share my experience that how I integrated the Oracle POS with a client’s outdated Legacy Balance System.

Legacy Balance System

It was the centralized system to calculate the data about sold items for a brand/store in a day. It was a file based system with XML format (Extensible Markup Language) as well as maintained physical directories for each store with a dedicated Store ID to consume the incoming/transformed files from middleware side.

In the solution, three Oracle products were leveraged.  I have listed below what was used to provide the solution for integration between Oracle POS and Legacy Balance System:

  • Oracle SOA Suite
  • Oracle Service Bus
  • Oracle Data Integrator

Let’s begin with the roles of the Oracle Products that were used to deliver the solution and then take a deeper dive to technical architecture.

Oracle Service Bus

This product was majorly used for pickup and delivery of files to third party/remote locations which we call FTP (File Transfer Protocol) in technical term. It was also used for data transformation and data merging of small size files using XQuery.

Oracle SOA Suite

The reason behind using this Oracle Product was dealing with complex transformation rules which can be implemented in XSLT (Extensible Stylesheet Language Transformation). XSLT is a language for transforming the XML documents to other XML document or other formats such as plain text.

Oracle Data Integrator

This tool covers all of the data integration requirement even if we talk about high volume data, high-performance batch loads, event-driven and communicating with SOA Services. The reason behind using this tool was to deal with transformation of big volume data.

 

POS Integration Architecture

capture1

Business Use Case

Stores get opened up and sell their goods for whole day using Oracle POS system on each register. At the end of the day, when stores get closed and perform end of the day activity by closing each register. The data about the sold items get collected/wrapped up in the form of RTLog (Retail Transaction Log) and pushed to JMS Topic which is configured on Middleware’s WebLogic Server.

 

capture2

Oracle SOA Service consumes the message from JMS Topic, extract the jar file and write the data in file on local box (WebLogic System) as per Store Id which is sent in JMS Message’s Header. Once RTLog file gets written by SOA Service then ODI Service will be called to transform the RTlog’s data and prepare file as per required for Target system (Legacy Balance System).

capture3

Once the transformation part is done in ODI, the ODI Service sends a message to the OSB Service with required parameters saying that transformation is complete and file is ready to move to client’s box (FTP Location). In OSB, we get parameters with the values of file name and location of written file. OSB would perform FTP operation and move file to Legacy System’s directory as per Store ID.

capture4

The integration took place for a popular clothing brand with more than 1800 stores in the United States and Canada.

This was just one cycle of the integration that was performed from Oracle POS to Legacy Balance System. There are are several others that I look forward to discussing in future blog posts.

Sofbang Tech Team Tips Series: Configuring Database Cloud (DbaaS) with Java Cloud for Application Deployment

In this post of Tech Team Tips Series, I am going to show how to migrate existing ADF web application deployed on-premise WebLogic 12c to Oracle Java Cloud Service.

I will use an existing WebLogic 12c ADF sample web application for migration. Before deploying the application, it is essential to configure these important steps:

Create Schema in Database Cloud (DbaaS)

Configure Data source in JCS

I encountered few slips while configuring these pre-requisites at first time. The following is what I encountered and how you can avoid and/or troubleshoot these slips:

1. Schema Creation through SQL Developer:

a. The “ORA–65096” error:

SQL Error: ORA–65096: invalid common user or role name

The Cause: An attempt was made to create a common user or role with a name that was not valid for common users or roles. In addition to the usual rules for user and role names, common user and role names must start with C## or c## and consist only of ASCII characters.

b. The “ORA–6048” error:

 SQL Error: ORA–65048: error encountered when processing the current DDL statement in pluggable database PDB1
 ORA–01955: DEFAULT ROLE ‘CONNECT’ not granted to user
 65048.00000 - “error encountered when processing the current DDL statement in pluggable database %s”

The Cause: An error was encountered when executing a statement in one of the pluggable databases.

2. JNDI Creation Error from JCS WebLogic console.
a. Connection test failed. Error ORA–01045.

Error ORA–01045: User C##TEST lacks CREATE SESSION privilege; logon denied
 oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:466)
 oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:399)
 oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:391)

b. Connection test failed. Error ORA–01017:

Error ORA–01017: invalid username/password; logon denied
 oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:466)
 oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:399)
 oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:391)

Don’t worry about these errors, just follow the steps below to configure your cloud environment properly for smooth running.

Continue reading…