The advent of Hadoop has taken enterprises by storm. The majority of enterprises today have one or more Hadoop cluster at various stages of maturity within their organization. Enterprises are trying to cut down on infrastructure and licensing costs by offloading storage and processing to Hadoop. In almost all these cases, the warehouse area is the first candidate for Hadoop adoption primarily due to the fact that the data warehouse hosts the largest amount of data in the enterprise, and also because it is the most processor-heavy process in the enterprise.
Until fairly recently, the data warehouse area has been dominated by RDBMSes and traditional ETL tools. ETL processes form the backbone of all the data warehousing tools. This has been the way to process large volumes of data and prepare it for reporting and analysis. That notion, however, has been challenged of late with the rise of Hadoop. Traditional ETL tools are limited by problems related to scalability and cost overruns. These have been ably addressed by Hadoop. And while ETL processes have traditionally been solving data warehouse needs, the 3 Vs of big data (volume, variety and velocity) make a compelling use case to move to ELT on Hadoop.
Let us take a comparative look at the traditional ETL process vs ELT on Hadoop at a high level.
ETL stands for Extract, Transform and Load. The ETL process typically extracts data from the source / transactional systems, transforms it to fit the model of data warehouse and finally loads it to the data warehouse.
The transformation process involves cleansing, enriching and applying transformations to create the desired output. Data is usually dumped to a staging area after extraction. In some cases, the transformations might be applied on the fly and loaded to the target system without the intermediate staging area. The diagram below illustrates a typical ETL process.
The development process usually starts from output, backwards, as the data model for target system (i.e. data warehouse) is predefined. Since the data model for the data warehouse is predefined, only the relevant and important data is pulled from the source system and loaded to the data warehouse.
ELT stands for Extract, Load and Transform. As opposed to loading just the transformed data in the target systems, the ELT process loads the entire data into the data lake. This results in faster load times. Optionally, the load process can also perform some basic validations and data cleansing rules. The data is then transformed for analytical reporting as per demand. Though the ELT process has been in practice for some time, it is only getting popular now with the rise of Hadoop. The diagram below illustrates a typical ELT process on Hadoop.
Though the ETL process and traditional ETL tools have been serving the data warehouse needs, the changing nature of data and its rapidly growing volume have stressed the need to move to Hadoop. Apart from the obvious benefits of cost effectiveness and scalability of Hadoop, ELT on Hadoop provides flexibility in data processing environment.
Transitioning from traditional ETL tools and traditional data warehouse environments to ELT on Hadoop is a big challenge - a challenge almost all enterprises are currently facing. Apart from being a change in environment and technical skillset, it requires a change in mindset and approach. ELT is not as simple as rearranging the letters. On one hand you have developers with years of ETL tool experience and business knowledge; on the other hand you have the long term benefit of moving to ELT on Hadoop. Training the existing workforce, who is conversant with the drag-drop GUI based tools, to work on java programming is a time consuming challenge. In order to bridge this technology gap, Bitwise contributed to the development of Hydrograph, an open source ELT tool on Hadoop.
Hydrograph is a desktop based ELT tool with drag-drop functionalities to create data processing pipelines like any other legacy ETL tool. However, the biggest differentiator for Hydrograph is that it is built solely for ELT on the Hadoop ecosystem (including engines such as Spark and Flink). Hydrograph has a lean learning curve for existing ETL developers which enables enterprises to quickly migrate to ELT processing on Hadoop or Spark. Hydrograph's plug-and-play architecture makes the data processing pipelines independent of the underlying execution engine, thus making the ETL processes obsolescence proof.
Prabodh has wide experience ranging from web application development, warehouse development to big data development and governance. He has a unique blend of experience in application development and warehouse development. With his extensive experience, he has successfully lead the development and open sourcing of Bitwise's open source ETL tool, Hydrograph. As a Lead Consultant at Bitwise, Prabodh is responsible for delivering mission critical big data processing projects based on Hadoop technology stack to our Fortune 500 customers.