The majority of the Fortune 50 companies have made the leap to Hadoop. However, a significant majority still stay away from it or are still evaluating it. This comes 10 years after Google first came up with a way to track and analyze the search queries that you and I fired at it.
The fact is everybody should and needs to be leveraging Big Data processing to derive the most relevant business intelligence their data can provide. But change is something that does not come naturally to most. Through this post, we will try to scrape some compelling reasons to stay with ETL or look for some developer aids for Big Data Development Frameworks like MapReduce/Spike/Tez/Hive/Pig.
Hadoop means Big Data
Perception abounds that Hadoop is ideally suited for Tera if not Petabytes of data. If you are dealing with a couple of million records that change in your warehouse, you are better off using a traditional ETL or ELT model.
The volume of data changing on a day to day basis may not be enough to justify the move from the current warehouse implementation, Reporting & ETL tool combination to Hadoop.
The catch: what exactly is the tipping point for your data volume to go from medium data to big?
ETL best handled by a tool specifically designed to Extract, Transform and Load
Big Data Development Frameworks, like MapReduce, have only been around for 10 years or so; incubating in the open source community. It comes with a good backing from the likes of Google and Yahoo, however, when compared to the extensive support and research done by proprietary ETL Tool vendors, it is still at a nascent stage.
ETL Tool vendors across the board have always been at the cutting edge of BI needs. They are also constantly looking for feedback or “demands” from their customer base. Regular updates / upgrades incorporating these suggestions, along with backward compatibility, ensure continuity for business processes without the need for extensive retooling or relearning for technical and subject matter experts.
ETL tools have generally recognized the need for businesses to connect to HDFS and process Big Data. They are now providing the necessary components to do this within the development environment that developers are familiar with.
The self-documenting nature of tool-based ETL, over script-based Big Data ELT, leads to better or improved ramp up times for new developers. This fits in well with Human Resourcing Needs where the Hadoop framework still has to achieve a maturity level to be supported by a large pool of developers. As far as learning curves go, developers respond better to visual aids over extensive scripting. Their experience in training warehouse solution specialists has indicated a significantly better throughput with tool based ETL.
Let me take a leap of faith here and say Metadata Management is better with certain tools. Sure the NameNode does know where the data lies – but as far as a business analyst looking for Data Lineage or a batch support analyst looking for CDC, the GUI nature of tools creates a distinct edge in favor of tool based ETL.
Business rules management
A few ETL tools provide value added products to the base ETL suite that enable business analysts, data modelers or even end business users to effectively build the rules that govern how raw data is processed. The immediate and measurable effect this has is primarily seen in the high availability and quality of data. The ability to define and apply rules using the same tools that an ETL developer uses, provides for an effective collaborative medium for a project team. This becomes even more important when a BI requirement is run in an agile war room scenario.
The genesis of Hadoop lies in Google (or it’s developers) time and effort. As far as bottom line, driven corporations are concerned – why make the investment? Simply because there was nothing there in the market to tackle the volume of data that they were dealing with at the time.
ETL are biggies gearing up to handle big data and offering services that Hadoop is not (like metadata management). This convert to ETL from distributed applications still thinks the future is challenging enough for ETL Tools to continue to excite the IT specialist and business analyst. The key to recognizing whether or not your ETL tool is relevant is focusing on the value the ETL tool brings to the data.
BitWise has built a development platform on top of Hadoop named BHS. It is a developer and maintenance friendly ELT solution which addresses some of the developer concerns related to maintainability and ability to ramp up new teams. It provides abstraction layer on Hadoop and helps a normal ETL developer to build ETLs on Hadoop without any understanding of Big Data technologies.
Talk to us about how BHS can add value to your Big Data ETL.