I took the Cloudera CCA131 Administrator exam last week and passed the certification. In this post, I’ll explain you about the exam pattern as described by Cloudera, how to prepare for the exam, the topics you should be aware of, study materials, tips, feedback etc., Little background about me: I have 3+ […]
Continue ReadingAuthor: Kannan AK
Create/restore a snapshot of an HDFS directory
HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a directory of the file system or the entire file system. To enable a snapshot on a specific directory, Go to CM – HDFS – File Browser Select the directory in the file browser, select ‘Enable Snapshots’ in the right […]
Continue ReadingConfigure Hue user authorization and authentication
When you access Hue after installation, by default the first user that logs into Hue becomes the first admin user. CM – Hue – Web UI After you logged in, go to admin tab – Manage Users Now provide the username of the user you want to provide access and add them in appropriate profile/groups. […]
Continue ReadingInstall new type of I/O compression library in cluster
File/data compression brings two major benefits: it reduces the space needed to store files and it speeds up data transfer across the network or to or from disk. When dealing with large volumes of data, both of these savings can be significant. Hadoop supports the following compression types and codecs: gzip – org.apache.hadoop.io.compress.GzipCodec bzip2 – […]
Continue ReadingCommission/decommission a node
When you want to remove the node from the cluster, you shouldn’t just delete the cloudera agents, services installed as it will impact the whole cluster. You should go for decommission first. Decommissioning a host decommissions and stops all roles on the host without requiring you to individually decommission the roles on each service. After […]
Continue ReadingRebalance the cluster
In HDFS, the blocks of the files are distributed among the datanodes as per the replication factor. Whenever you add a new datanode, the node will start receiving,storing the blocks of the new files. Though this sounds alright, the cluster is not balanced when you look at administrative point view. HDFS provides a balancer utility […]
Continue ReadingConfigure proxy for Hiveserver2/Impala
A proxy server is a server or application that acts as an intermediary for requests from clients seeking resources from other servers/applications. A client connects to the proxy server, requesting some service or resource available from a different server and the proxy server evaluates the request and sends it to the intended server/service/application. The server’s […]
Continue ReadingConfigure ResourceManager HA
The YARN ResourceManager is responsible for tracking the resources in a cluster and scheduling applications. Before CDH 5, the ResourceManager was a single point of failure in a YARN cluster. If Resource manager is down, then none of the jobs will run in the cluster. The ResourceManager high availability (HA) feature adds redundancy in the […]
Continue ReadingConfigure NameNode HA
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a secondary namenode. The secondary namenode […]
Continue ReadingEnable/configure log and query redaction
Data redaction is the suppression of sensitive data, such as any personally identifiable information (PII) such as credit card number, email address, social security number. Cloudera has a data redaction feature, which will mask the credit card, email address with random or custom strings(we specify), so that in queries, log files those random strings will […]
Continue Reading