cca131 Archives - Page 3 of 4

August 26, 2017

Hadoop

3 Comments

Rebalance the cluster

In HDFS, the blocks of the files are distributed among the datanodes as per the replication factor. Whenever you add a new datanode, the node will start receiving,storing the blocks of the new files. Though this sounds alright, the cluster is not balanced when you look at administrative point view. HDFS provides a balancer utility […]

Kannan AK

August 26, 2017

Hadoop

3 Comments

Configure proxy for Hiveserver2/Impala

A proxy server is a server or application that acts as an intermediary for requests from clients seeking resources from other servers/applications. A client connects to the proxy server, requesting some service or resource available from a different server and the proxy server evaluates the request and sends it to the intended server/service/application. The server’s […]

Kannan AK

August 25, 2017

Hadoop

1 Comment

Configure ResourceManager HA

The YARN ResourceManager is responsible for tracking the resources in a cluster and scheduling applications. Before CDH 5, the ResourceManager was a single point of failure in a YARN cluster. If Resource manager is down, then none of the jobs will run in the cluster. The ResourceManager high availability (HA) feature adds redundancy in the […]

Kannan AK

August 25, 2017

Hadoop

1 Comment

Configure NameNode HA

Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a secondary namenode. The secondary namenode […]

Kannan AK

August 25, 2017

Hadoop

1 Comment

Enable/configure log and query redaction

Data redaction is the suppression of sensitive data, such as any personally identifiable information (PII) such as credit card number, email address, social security number. Cloudera has a data redaction feature, which will mask the credit card, email address with random or custom strings(we specify), so that in queries, log files those random strings will […]

Kannan AK

August 23, 2017

Hadoop

1 Comment

Efficiently copy data within a cluster/between clusters

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying of HDFS data. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. […]

Kannan AK

August 23, 2017

Hadoop

1 Comment

Perform OS-level configuration for Hadoop installation

Before installing CDH in our server, we’ve to make the below configuration changes in OS level for successful installation. Disable SELINUX “Security-Enhanced Linux (SELinux) is a Linux kernel security module that provides a mechanism for supporting access control security policies” If SElinux is enabled, then cloudera server installation will fail in the server. To disable […]

Kannan AK

August 22, 2017

Hadoop

2 Comments

Define and install a rack topology script

In network terminology, all the physical servers in the network should’ve been present in a rack in the data center. In hadoop, the racks assignment is significant as it plays a vital role in terms of data locality, bandwidth etc. We can assign the rack of the hosts in the cluster in two ways. One […]

Kannan AK

August 19, 2017

Hadoop

1 Comment

Configure HDFS ACLs

Every file/folder in linux is owned by a owner and the group. If an user needs to access the file (read, write, modify) either the user has to be part of the group or the file has appropriate “others” permissions. In this model, we can’t set different permissions userwise, groupwise catering to our requirements. ACLs […]

Kannan AK

August 12, 2017

Hadoop

3 Comments

Set up a local CDH repository

This post will explain you how to set up a local YUM/CDH repository for your network. In Linux, /etc/yum.repos.d is the path for yum repos present in the server. For every repo , there will be a baseurl value which contains the link for the repository path. When you execute “yum install packagename” the […]