Script to delete thousands of delete marker in a S3 bucket

s3 delete marker

Last week, we got an incident that some of the data are missing in a versioning enabled S3 bucket. The S3 bucket in question has lifecycle policy enabled, which expires the objects after 90 days (adds delete marker) and permanently delete them after 30 days of becoming previous versions.

Reported missing data belong to the month of September and it’s needed by the product team to generate an important report. Since we enabled the lifecycle policy only by last month, the data of September month were only expired and not permanently deleted as they’re versioned. If you delete an object in a versioned bucket, S3 simply adds a delete marker on top of the objects, likewise for expiration S3 adds a delete marker on top of the objects and after 30 days they are permanently deleted as per the lifecycle policy.

Read https://www.hadoopandcloud.com/aws/script-to-get-the-versioning-status-of-s3-buckets/ 
on how to check versioning status of bucket(s).
s3 delete marker
Delete markers in a versioned bucket

So, we thought, it’s just a simple activity of deleting couple of delete markers and retrieve the data. Little did we know that bucket has around 30000+ objects under separate folders and this has to be done for 3 buckets. To delete 30000+ delete markers in a console is a herculean task which would take atleast 2-3 days and no way one can do it without going mad. As the report needs to be generated by the eod, we had no other option than to find a solution via script.

Also, the tricky part here is that we have to retrieve the data in the subfolders of the S3 bucket, not the whole bucket.

Stackoverflow for the rescue

After furious googling, we created the below script using inputs from stackoverflow, which did the job seamlessly for us.

$ sh delete_deletemarker.sh “kannan-blog” “cloud/aws/yr=2018/mon=09/”

#!/bin/bash
BUCKET=$1
PREFIX=$2
aws s3api list-object-versions --bucket $BUCKET --prefix $PREFIX --output text | 
 grep "DELETEMARKERS" | while read OBJECTS
   do
        KEY=$( echo $OBJECTS| awk '{print $3}')
        VERSION_ID=$( echo $OBJECTS | awk '{print $5}')
        echo $KEY
        echo $VERSION_ID
        aws s3api delete-object --bucket $BUCKET --key $KEY --version-id $VERSION_ID
    done
list-object-versions
deletemarkers metadata

How the script works

In the script, we pass the bucket name and prefix (subfolder) name as command line arguments. In s3, folders are nothing but a key to the group of objects.

s3api list-object-versions will give you the metadata information about the objects. S3 list api call has 1000 objects limits, means you can’t list more than 1000 objects in a bucket in a single api call. But list-object-versions is a paginated operation, so it issues multiple API calls to retrieve the entire data set of results.

When you choose output as text, you will get the deletemarkers information in a single line which you can grep the objectname(key) and version id. Then you can run the delete-object call which will delete only the delete marker of the object. VOILA!

This script deleted one or two delete markers per second, so it will retrieve(i.e delete the delete marker which is literally retrieving the object) 120 objects per minute, 7200 objects per hours and it would take 4 hours to retrieve 30000 objects. I think that is not so bad considering you’ve to spend 4 days to do it manually! J

Leave a Reply

Your email address will not be published. Required fields are marked *