Today Hadoop is the most sought after technology in the “Big Data” world. Since its inception, it has come a long way, all thanks to its robust Hadoop community. Its significance has grown exponentially and has become an essential tool for any organization to unlock their business values by collecting, crunching and analyzing their data at scale. But, while Hadoop holds massive potential for your business, it’s not without challenges.
Most of the organizations start Hadoop adoption with a Proof of Concept (POC) or a smaller setup. Setups are mostly carried out by in-house Linux or DBA administrators who have learnt the basics of Hadoop administration which mostly involves installing a small cluster with experimental data. This approach definitely benefits the purpose of learning the basics, however, it’s inefficient in managing a production Hadoop cluster. Eventually, as data grows, the true Hadoop capabilities are basked in with the POC setup converted to a production cluster. It also exposes significant challenges of scaling, managing data and cluster administration as part of the production enterprise workload.
What to Explore
A Hadoop administration team’s responsibilities starts when a company kick-starts with the Hadoop POC. An experienced team like Bitwise comes up with a roadmap right at the beginning to help scale from POC to production with minimal wastage of initial investment and effective guidance on investment decisions, be it in-house infrastructure, POC or to choose PAAS options.
For any organization, understanding of the estimated investments is mandatory in the initial phases. Capacity planning/estimation is the next step after successful completion of the POC. Choosing the right combination of storage and computing hardware, interconnected network, operating system, storage configuration/disk performance, network setup etc. play an important role on the overall cluster performance. Similarly, special considerations are required for the master and slave node hardware configuration. The right balance of needs vs. greed can be achieved only after years of implementation experience.
Once you have the hardware defined and in place, the next stage is the planning and deployment of the Hadoop cluster. This involves configuring the OS with the recommended configuration changes to suite the Hadoop stack, configuration of SSH and Disk, choosing and installing a Hadoop distribution (Cloudera, Hortonworks, MapR or Apache Hadoop) as per the requirements, meeting the configuration requirements for Hadoop daemons for optimized performance. All of these setups vary based on the size of your cluster, so it’s imperative that you configure and deploy after covering all the aspects and pre-requisites.
Another important aspect is designing of cluster from development perspective, various environments (Dev, QA, Prod etc.) and usage perspective, i.e. access security and data security.
Managing a Hadoop Cluster
After implementation of the Hadoop cluster, the Hadoop admin team needs to maintain the health and availability of the cluster round the clock. Some of the common tasks include management of the name node, data nodes, HDFS and Mapreduce jobs which forms the core of the Hadoop eco system. Impact to any of the components can negatively affect the cluster performance. For e.g. unavailability of a data node, say due to a network issue, will cause the HDFS to replicate the under-replicated blocks which will bring a lot of overhead and cause the cluster to slow down or even make it inaccessible in case of multiple data node disconnections.
Name node is another important component in a Hadoop cluster and acts as a single point of failure. Consequently, it is important that a backup of fsimage and editlogs are taken periodically using the secondary name node so as to recover from a name node failure. The other administrative tasks include:
- Managing HDFS quota at application or user level
- Configuring scheduler (FIFO, Fair or Capacity) and resource allocation to different services like YARN, HIVE, HBASE, HDFS etc.
- Upgrading and applying patches
- Configuring logging for effective debugging in case of failures or performance issues
- Commissioning and decommissioning nodes
- User management
Hardening your Hadoop Cluster
Productionization of a Hadoop cluster mandates implementation of hardening measures. Hardening of Hadoop typically covers:
- Configuring Security: This is one of the most crucial and required configuration to make your cluster enterprise ready and can be classified at user and data level.
- User Level: User security addresses the authentication (who am I) and authorization (what can I do) part of the security implementation along with configuring access control over resource. Kerberos takes care of the authentication protocol between the client/server applications and is majorly used to sync with LDAP for better management. Different distribution recommends different authorization mechanism. For e.g. Cloudera has good integration with Sentry that provides a fine grained row level security to Hive and Impala. Further integration with HDFS ACL's percolates the same access to other services like Pig, HBASE, etc.
- Data Level: Data Security, HDFS transparent encryption provides another level of security for data at rest. This is one of the mandatory requirements for some of the organizations to be complied with different government and financial regulatory bodies. Having transparent encryption built into HDFS makes it easier for organizations to comply with these regulations.
- High Availability: Name node as mentioned earlier is a single point of failure and unavailability of the same results in making the whole cluster unavailable, which is not a recommended approach for a production cluster. Name node HA helps to mitigate this risk by having a standby node which automatically takes over from the primary name node in cases of failure.
- Name Node Scaling: This is mostly applicable in case of a large cluster. As name node stores data in memory with large volume of files, name node memory can become a bottleneck. HDFS federation helps in resolving the issues by facilitating multiple name nodes with each name node managing a part of the HDFS namespace.
MonitoringProactive monitoring is essential to maintain the health and availability of the cluster. General monitoring tasks includes monitoring cluster nodes and networks for CPU, memory, network bottlenecks, and more. The Hadoop administrator should be competent to track the health of the system, monitor workloads and work with the development team to implement new functionality. Failure to do so can have severe impact on the health of the system, quality of data and ultimately will affect the business user’s ease of access and decision making capability.
Performance Optimization and Tuning
Performance tuning and identifying bottlenecks is one of the most vital tasks for a Hadoop Administrator. Considering the distributed nature of the system and a manifold of configuration files and parameters, it may take hours to days to identify and resolve a bottleneck, if not get started in the right direction. Often it is found that the root cause is at a different end of the system rather that what is pointed out by the application. This can be counterbalanced with the help of an expert who can assist with a detailed understanding of the Hadoop ecosystem along with the application. Moreover, an optimized resource (CPU, Memory) is essential for an effective utilization of the cluster and aids in the distribution between different Hadoop components like HDFS, YARN, and HBASE etc. To overcome such challenges, it’s important to have the statistics in place in the form of benchmarks, tuning of the configuration parameters for best performance, strategies and tools in place for rapid resolutions.
This blog is part of the Hadoop administration blog series and aims to provide a high level overview of Hadoop administration, associated roles, responsibilities and challenges a Hadoop admin faces. In the future editions, we will dwell further into the above mentioned points, various aspects of Hadoop Infrastructure Management responsibilities and further understand how each phase plays an important role in administering an enterprise Hadoop cluster. For more on how we can help, visit
Editor's Note: The blog was originally posted on March 2017 and recently updated on March 2023 for accuracy.