How to increase the HDFS capacity of AWS Elastic Mapreduce EMR cluster

emr hdfs

In this tutorial, we’re going to see how to increase the hdfs capacity of a running EMR cluster. Sometime back, we received an alert that HDFSutilization was high on one of our cluster. Upon checking, the usage is an expected one but we under provisioned the storage capacity during the creation of the cluster and … Continue reading How to increase the HDFS capacity of AWS Elastic Mapreduce EMR cluster

AWS EMR Uniform Instance groups

In this post, I wrote about the AWS EMR uniform instance groups overview, advantages and caveats of using it. AWS EMR architecture contains master node, core node(s) and task nodes.  If you’re new to EMR, refer https://www.hadoopandcloud.com/aws/amazon-emr/  for a quick introduction. While creating the cluster, you have two configuration options for the nodes - instance … Continue reading AWS EMR Uniform Instance groups

Script to delete thousands of delete marker in a S3 bucket

s3 delete marker

Last week, we got an incident that some of the data are missing in a versioning enabled S3 bucket. The S3 bucket in question has lifecycle policy enabled, which expires the objects after 90 days (adds delete marker) and permanently delete them after 30 days of becoming previous versions. Reported missing data belong to the … Continue reading Script to delete thousands of delete marker in a S3 bucket

AWS EMR (Elastic MapReduce) – Introduction

Amazon EMR or AWS EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. Architecture: EMR cluster refers to a group of AWS EC2 instances built on AWS ami. Each instance in the cluster is called a node. Each node … Continue reading AWS EMR (Elastic MapReduce) – Introduction