Today, the Hadoop ecosystem has become a must have enterprise technology stack for organizations seeking to process and understand large-scale data in real time. Hadoop has multiple applications in enterprise like Data Lake, Analytics, ELT, Adhoc Processing, etc. and more such applications are being discovered at an increasingly fast pace.
The first step for any Hadoop data processing pipeline is to ingest data into Hadoop, making data ingestion the first hurdle to utilize the power of Hadoop.
What to Explore
Hadoop data ingestion has challenges like
- There could be different source types like OLTP systems generating events, batch systems generating files, RDBMS systems, web based APIs, and more
- Data may be available in different formats like ASCII text, EBCDIC and COMPs from Mainframes, JSON and AVRO
- Data is often required to be transformed before persisting on Hadoop. Some of the common transformations could be data masking, converting data to standard format, applying data quality rules, encryption etc.
- As more and more data is ingested into Hadoop, metadata plays an important role. There is no point in having large volumes of data without the knowledge of what is available. Discovery of data and other key aspects like format, schema, owner, refresh rate, source and security policy should be kept simple and easy. Features like custom tagging, data set registry, searchable repository can make life much easier. The need of the hour is a data set registry and data governance tool that can communicate with data ingestion tool to pass and use this metadata.
At present, there are many tools available for ingesting data into Hadoop. Some tools are good for specific use cases, for example Apache Sqoop is a great tool to export/import data from RDBMS systems, Apache Falcon is a good option for data set registry, Apache Flume is preferred to ingest real-time event stream of data and there are many more commercial alternatives as well. Few of the tools available are for general purposes like Spring XD (now spring cloud data flow) and Gobblin. The selection of options can be overwhelming and you certainly need the right tool for your job.
But none of these tools are capable of solving all the challenges, so enterprises have to use multiple tools for data ingestion. Overtime they also create custom tools or wrapper on top of existing tools to solve their needs. Furthermore all these tools have text based configuration files (mostly XML) which is not very convenient and user friendly to work with. All this results in lot of complexity and overhead to maintain data ingestion applications.
Looking at these gaps and to enable our clients to streamline Hadoop adoption, Bitwise has developed a GUI based tool for data ingestion and transformation on Hadoop. With convenient drag/drop GUI, it enables developers to quickly develop end to end data pipelines all through from single tool. Apart from multiple source and target options, it also has many pre-built transformations that ranges from usual data warehousing to machine learning and sentiment analysis. The tool is loaded with the following data ingestion features:
- Pluggable Source and Targets – As new source and target systems emerge, it’s convenient to integrate them with ingestion framework
- Scalability – It’s scalable to ingest huge amounts of data at a higher velocity
- Masking and Transforming On The Fly – It’s possible to apply transformations like masking and encryption on the fly as data can be ingested swiftly in the pipeline
- Data Quality – data quality checkpoints can be checked before data is published
- Data Lineage and Provenance – detailed data lineage and provenance can be tracked
- Searchable Metadata – datasets and their metadata can be searchable along with the option to apply custom tags
Bitwise’s Hadoop Data Ingestion and transformation tool can save enormous effort to develop and maintain data pipelines. Stay tuned for subsequent features that explore the other phases of the data value chain.
Editor's Note: The blog was originally posted on May 2016 and recently updated on March 2023 for accuracy.