Today Hadoop is the most sought after technology in the "Big Data" world. Since its inception, it has come a long way, all thanks to its robust Hadoop community. Its significance has grown exponentially and has become an essential tool for any organization to unlock their business values by collecting, crunching and analyzing their data at scale. But, while Hadoop holds massive potential for your business, it’s not without challenges.
Most of the organizations start Hadoop adoption with a Proof of Concept (POC) or a smaller setup. Setups are mostly carried out by in-house Linux or DBA administrators who have learnt the basics of Hadoop administration which mostly involves installing a small cluster with experimental data. This approach definitely benefits the purpose of learning the basics, however, it’s inefficient in managing a production Hadoop cluster. Eventually, as data grows, the true Hadoop capabilities are basked in with the POC setup converted to a production cluster. It also exposes significant challenges of scaling, managing data and cluster administration as part of the production enterprise workload.
A Hadoop administration team’s responsibilities starts when a company kick-starts with the Hadoop POC. An experienced team like Bitwise comes up with a roadmap right at the beginning to help scale from POC to production with minimal wastage of initial investment and effective guidance on investment decisions, be it in-house infrastructure, POC or to choose PAAS options.
For any organization, understanding of the estimated investments is mandatory in the initial phases. Capacity planning/estimation is the next step after successful completion of the POC. Choosing the right combination of storage and computing hardware, interconnected network, operating system, storage configuration/disk performance, network setup etc. play an important role on the overall cluster performance. Similarly, special considerations are required for the master and slave node hardware configuration. The right balance of needs vs. greed can be achieved only after years of implementation experience.
Once you have the hardware defined and in place, the next stage is the planning and deployment of the Hadoop cluster. This involves configuring the OS with the recommended configuration changes to suite the Hadoop stack, configuration of SSH and Disk, choosing and installing a Hadoop distribution (Cloudera, Hortonworks, MapR or Apache Hadoop) as per the requirements, meeting the configuration requirements for Hadoop daemons for optimized performance. All of these setups vary based on the size of your cluster, so it’s imperative that you configure and deploy after covering all the aspects and pre-requisites.
Another important aspect is designing of cluster from development perspective, various environments (Dev, QA, Prod etc.) and usage perspective, i.e. access security and data security.
After implementation of the Hadoop cluster, the Hadoop admin team needs to maintain the health and availability of the cluster round the clock. Some of the common tasks include management of the name node, data nodes, HDFS and Mapreduce jobs which forms the core of the Hadoop eco system. Impact to any of the components can negatively affect the cluster performance. For e.g. unavailability of a data node, say due to a network issue, will cause the HDFS to replicate the under-replicated blocks which will bring a lot of overhead and cause the cluster to slow down or even make it inaccessible in case of multiple data node disconnections.
Name node is another important component in a Hadoop cluster and acts as a single point of failure. Consequently, it is important that a backup of fsimage and editlogs are taken periodically using the secondary name node so as to recover from a name node failure. The other administrative tasks include:
Productionization of a Hadoop cluster mandates implementation of hardening measures. Hardening of Hadoop typically covers:
Performance tuning and identifying bottlenecks is one of the most vital tasks for a Hadoop Administrator. Considering the distributed nature of the system and a manifold of configuration files and parameters, it may take hours to days to identify and resolve a bottleneck, if not get started in the right direction. Often it is found that the root cause is at a different end of the system rather that what is pointed out by the application. This can be counterbalanced with the help of an expert who can assist with a detailed understanding of the Hadoop ecosystem along with the application. Moreover, an optimized resource (CPU, Memory) is essential for an effective utilization of the cluster and aids in the distribution between different Hadoop components like HDFS, YARN, and HBASE etc. To overcome such challenges, it’s important to have the statistics in place in the form of benchmarks, tuning of the configuration parameters for best performance, strategies and tools in place for rapid resolutions.
This blog is part of the Hadoop administration blog series and aims to provide a high level overview of Hadoop administration, associated roles, responsibilities and challenges a Hadoop admin faces. In the future editions, we will dwell further into the above mentioned points, various aspects of Hadoop Infrastructure Management responsibilities and further understand how each phase plays an important role in administering an enterprise Hadoop cluster. For more on how we can help, visit
Vijay has extensive experience and widespread knowledge in Database setup, Administration and Development for various Database stacks like Oracle, Sql Server, Netezza and more. At Bitwise, he has been actively contributing and developing leading-edge solutions on Hadoop. Currently, he is involved in the setup and administration of Hadoop clusters for enterprise clients.