Amazon EMR or AWS EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data.
EMR cluster refers to a group of AWS EC2 instances built on AWS ami. Each instance in the cluster is called a node. Each node has a role within the cluster, referred to as the node type.
There are three types of nodes present in AWS EMR cluster.
- Master node: A node that manages the cluster by coordinating the distribution of data(HDFS) and tasks(YARN) among other nodes for processing. The master node tracks the status of tasks and monitors the health of the cluster. All the master services such as Namenode, Resource manager, Zookeeper server are runs in master node. There are no HA setup in EMR.
- Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node.
- Task node: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional.
There are several different types of storage options available for EMR cluster.
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is a distributed, scalable file system for Hadoop. HDFS distributes the data it stores across instances in the cluster, storing multiple copies of data on different instances to ensure that no data is lost if an individual instance fails. HDFS is ephemeral storage that is data will be lost when you terminate a cluster.
Note that HDFS will be available only in core nodes.
EMR File System (EMRFS)
Using the EMR File System (EMRFS), Amazon EMR extends Hadoop to directly access data stored in Amazon S3 as if it were a file system like HDFS. You can use either HDFS or Amazon S3 as the file system in your cluster. Most often, Amazon S3 is used to store input and output data and intermediate results are stored in HDFS.
One advantage in EMRFS is you’ve unlimited storage capacity and high availability for your data offered by S3. So there’s no replication required for your data. That said, you’ll take a performance hit of 5-10% compared to HDFS.
Local File System
The local file system refers to a locally connected disk (instance store volumes) to Ec2 instances . When you create a Hadoop cluster, each node is created from an Amazon EC2 instance that comes with a preconfigured block of pre-attached disk storage called an instance store. Data on instance store volumes persists only during the lifecycle of its Amazon EC2 instance.
How different EMR cluster is from general Hadoop cluster?
In hadoop cluster, you have High availability setup for Namenodes, but in EMR there’s no HA. That is, if the master node goes down, your cluster will be terminated and you’ve to spin up new cluster.
Here in EMR, task nodes are optional and you can run the jobs in core nodes itself. Core nodes perform both datanode role and nodemanager role.
Task nodes don’t store HDFS data and they can be used as nodemanagers to run your tasks. You can use spot instances for task nodes with autoscaling which will run your efficiently and save you a lot of money.
Short lived or long running cluster
One interesting use case of AWS EMR, you can spin it up anytime you want to run your job.
Run your cluster as a transient process: one that launches the cluster, loads the input data, processes the data, stores the output results, and then BOOM, cluster automatically shuts down.
This is the standard model for a cluster that performs a periodic processing task. Shutting down the cluster automatically ensures that you are only billed for the time required to process your data. This will save you money and no management required for cluster.
Long-running cluster: In this model, you launch a cluster with a master and number of core nodes as per your requirement. This model is helpful, when you have multiple jobs to run over a day.
Then you load the data from S3, run your jobs or schedule your jobs and let the cluster persist. You can setup autoscaling for task nodes to scale up/down based on resource requirements whenever you run jobs.