Ab Initio and AWS Glue Fundamentals for Data Integration
To get started, let’s walk through the fundamentals of Ab Initio and AWS Glue for background and overview of each tool.
Ab Initio has over 25 years of experience with advanced distributed computing systems. You can run your operations on the cloud, on-premise, or in any combination. Whether you want to run across containers with Kubernetes, Unix/Linux boxes, Windows, or mainframes, Ab Initio does it all. You develop your code once and deploy it wherever you need it.
Ab Initio application is a general motive data processing platform for organization-class, mission-relevant applications comparable to data warehousing, batch processing, click stream processing, data movement, data transformation, and analytics. It helps the integration of arbitrary data sources and programs and supplies entire metadata management across the enterprise.
Ab Initio solves essentially the most challenging data processing issues for the leading organizations in telecommunications, finance, insurance, healthcare, e-commerce, retail, transport and different industries whether integrating disparate systems, managing big data, or supporting trade-critical movements. Ab Initio solutions are constructed and employed incredibly fast and provide the best performance and scalability, and are designed from the beginning to provide a single, cohesive technology platform for scalable, high performance data processing, integration, and governance.
Below we have listed out the main features of Ab Initio.
- Application specification, design and implementation can be done using Ab Initio graphs/psets in the Graphical Development Environment.
- Business rules specification and implementation can be specified in the BRE (Business Rules Engine).
- A single engine for all aspects of application execution in the Co>Operating System.
- Application orchestration can be performed in Conduct>It.
- Operational management can be achieved in the Technical Repository of Ab Initio.
- Metadata capture, analysis and display can be done in Metadata Hub.
- Federated queries across virtually any data sources can be performed through a high-performance scalable SQL engine, i.e. Query>It.
- Data management, including very large data storage (hundreds of terabytes to petabytes), data discovery, analysis, quality and masking can be done using the Testing Framework.
AWS Glue offers a server-less approach to various data-driven problems and computations such as analytics, machine learning, discovering, preparing, and combining data. It facilitates all data integration services to transform data for good use in a cost-efficient manner. It’s a cloud-based ETL service that is powered by a big data engine to provide data-intensive computation.
AWS Glue architecture consists of three major parts.
- Data Catalog: Data Catalog contains the layout for the heterogeneous data which is used as a source/target of ETL jobs. Data Catalog is a central repository that acts as an index to different data sources. It also contains location and runtime matrices.
- Scheduling: AWS Glue Studio comes with a scheduler that facilitates the users to automatically kick off their series of jobs as per their dependencies or on some trigger-based event.
- ETL Engine: AWS Glue is built on top of Apache Spark, which is powered by the big data Spark architecture. The Spark Cluster leverages efficient and data insensitive computing. It automatically generates the PySpark/Scala code for every ETL job built with the help of a Drag and Drop GUI. The GUI also provides the feature for Workflows which orchestrates a number of ETL jobs in the required order.
Below we have listed out the main features of AWS Glue.
- Glue Studio GUI offers GUI Based IDE.
- Job script code editor gives you the flexibility to edit and write custom ETL logic in the ETL jobs generated scripts. The IDE supports syntax and keyword highlighters and auto-completion for local words, python keywords, and code snippets.
- Crawler is a program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data and then creates metadata tables in the AWS Glue Data Catalog.
- Workflows are created to visualize complex ETL activities involving multiple crawlers, jobs, and triggers. A Workflow can be scheduled daily, weekly, monthly, or can be started manually from the AWS Glue console as per the requirement. With the help of Triggers within the workflows, we can create a large chain of independent jobs and crawlers.
Pros and Cons of Ab Initio and AWS Glue
Now that we have taken a detailed look at the features of Ab Initio and AWS Glue, let’s review some of the pros and cons of each.
Pros of Ab Initio
- Feature rich ETL tool for on-premise execution of the ETL workflow.
- The Ab Initio product suite provides a wide range of features using different products like Ab Initio GDE for building data pipelines, Ab Initio Conduct-IT for orchestration, etc. These tools integrate seamlessly with each other.
- If you have existing Ab Initio applications, migration from on-premises execution to cloud execution is straightforward, because the business logic and programming model stays the same no matter where these applications run.
Cons of Ab Initio
- Historically Ab Initio tool was developed for on-premise implementation and is trying to catch up with the cloud based implementation.
- Each Ab Initio product in the suite needs separate licensing. Additionally, this builds a huge dependency on a single vendor for your implementation needs.
- Documents and help content of Ab Initio is not accessible without licensing terms. They are confidential and classified as trade secrets. This makes it difficult to find information on public forums, thus makes training and resourcing difficult.
- Ab Initio has a custom pricing model which makes it difficult to pre-estimate the costs based on facts within the organization.
- For cloud-based implementation of Ab Initio, the cloud infrastructure cost is over and above the Ab Initio licensing cost. Thus, increasing the total cost of ownership.
Pros of AWS Glue
- AWS Glue is offered as SaaS where one can pay as you go. The costing is inclusive of the cloud infrastructure. This reduces the total cost of ownership.
- AWS Glue is developed considering cloud as the execution platform, thus all the advantages like scalability, serverless design, high availability, etc. come implicitly without needing to think about them separately.
- Glue Studio has an easy to use drag and drop development environment which generates open source code (PySpark/Scala) which can be further enhanced and executed independently.
- Organizations can use the various cost estimator services provided by AWS which can help get a good estimate of the cost upfront before getting into actual implementation.
Cons of AWS Glue
- AWS Glue is built for cloud implementation, thus for organizations having an existing on-premise implementation, there is a one-time effort to migrate/rewrite legacy ETL tool code into Glue.
- AWS Glue was introduced in the market in August 2017, which means some of the features are in an evolving state as compared to more established tools. However, the Glue team is proactively working on enhancing the product and providing support to their customers to ensure the tool meets all their data integration needs.
Choosing the Right Modern Data Integration Tool
As we have seen, Ab Initio and AWS Glue both offer advanced data integration capabilities. The choice between the two is relative to your business requirements, resources and modernization strategy.
Fit for Modern Data Architecture
Ab Initio has long been a trusted workhorse for organizations with data-intensive analytical requirements. Indeed, Ab Initio can encompass all your data needs like data profiling, business rules, data quality, data lineage, orchestration, and streaming with seamless integration.
As organizations focus on digital transformation to stay competitive in a rapidly changing marketplace, legacy tools like Ab Initio can be a major holdup to your cloud modernization strategy. This is what makes AWS Glue a compelling option for your modern data requirements. Since it is cloud native and built to fit cohesively in the AWS analytics ecosystem, Glue gives you the flexibility to scale to meet growing demand and seamlessly take advantage of the advanced analytics and AI/ML capabilities that are critical to your modernization objectives.
AWS Glue is a core component of lake house architecture in AWS. As a modern data architecture, the lake house approach is not just about integrating data lake and warehouse, but it’s about connecting your data lake, your data warehouse, and all your other purpose-built services into a coherent whole. So with the right solution design your AWS Glue data pipelines can work in concert with other key services like EMR for big data processing, Redshift for analytics consumption, and SageMaker for machine learning to enable prediction-based actions – all within an optimized cloud data ecosystem.
Where do we go from here?
For enterprises with a complex mix of data systems spanning on-premise, cloud and even hybrid environments, setting the strategy and selecting the right tools to meet current needs and future growth is only the first piece of the puzzle.
Understanding the impact of modernizing legacy workflows in terms of total cost of ownership, time-to-market, performance and usability can be a major challenge. To help organizations determine the best tools to meet analytics objectives and evaluate the optimal path for modernizing data integration, Bitwise offers expert consulting services based on 25 years of enterprise data management experience.