RequestLimitExceeded – The error message which almost every AWS user is familiar with. We often encounter this error in operations interacting with AWS services. AWS imposes a hard limit on the maximum number of api calls allowed for their services and it can’t be increased.
Last month, we encountered this error for EC2 operations in many of our EMR, Qubole clusters and due to that clusters were not scaling up, multiple production jobs failed.
As per AWS documentation, they throttle all calls to the EC2 api per AWS account on per region basis. Irrespective of origin of calls, be it cli, from an application or ec2 console, they ensure the calls don’t exceed the maximum allowed API request rate. This rate varies across region and they don’t share the limit number. This limit is a hard limit that cannot be increased and is the same for all customers. Only solution is we have to implement error retries/exponential backoffs in our operations interacting with EC2 apis.
Though this sounds reasonable, it didn’t help our case, as we didn’t know how many api calls we were making, what type of calls, which application was making most calls etc.,
After contacting support again to help us a way to identity/track the EC2 api calls being made from our account, they told that we can monitor EC2 API requests with Amazon CloudWatch.
Monitoring EC2 API requests via Cloudwatch
Aws provides an ‘opt in’ feature for monitoring the EC2 api requests via cloudwatch (Link). These metrics provide a simple way to track the usage and outcomes of the Amazon EC2 API operations over time.
The metric which we are focused on is,
The number of times the maximum request rate permitted by the Amazon EC2 APIs has been exceeded for your account.
Amazon EC2 API requests are throttled to help maintain the performance of the service. If your requests have been throttled, you get the
When we graphed the number of api calls(SuccessfulCalls and RequestLimitExceeded) made in EU (using cloudwatch), the numbers were extremely high.
Close to 7000 api calls were made every 5 mins and out of them 4.5K calls exceeded the api limits. That is an average of 1400 api calls made every minute. We were sure that none of our applications require such enormous number of api calls and somewhere something is abusing the system.
So we decided to dig deep to figure out the types of api calls (describe instances, create/delete, etc.,) being made from our account.
Cloudtrail logs and Athena to the rescue:
Cloudtrail stores the details of all the api calls made from an AWS account in an S3 bucket. As our use case required more granularity and analysis, we created an Athena table on top of the cloudtrail logs stored in S3 bucket.
Refer cloudonaut.io for an amazing article on exporting cloudtrail metrics to Athena.
We narrowed the api call event source to ec2, we found that event ‘DescribeVolumes‘ was the highest with the count of 116837724 in EU region in year 2019.
This is insanely high number for this event and we don’t have purpose to make describevolumes except from our ‘EBS Volumes snapshot’ script. Then we narrowed the scope to the list of scripts, applications, tools which could possibly make ‘describevolumes’ event.
A Blast from the Past
For our Prometheus monitoring setup, we are using telegraf agent as an collector. This agent is installed on all of the instances to gather host metrics such as cpu, memory, disk, network and expose it to prometheus. Telegraf gather these metrics on the interval of every 30 seconds.
The agent only collects the host level metrics, so one of our colleagues added a custom script which calls ‘aws ec2 describe-volumes’ 4 times to get the volume size, type, and status information of the volumes attached to the instances. This was done during the initial Prometheus POC and we forgot that over the period of time.
As the agent refresh interval duration is set as ’30 seconds’, this script made 4 ‘describe-volumes’ call every 30 seconds. We had atleast 150 instances configured with this script, so approximately 600 EC2 describe volumes api calls were made every 30 seconds, which converts to 1200+ api calls per minutes and 6000+ calls every 5 minutes.
We immediately deactivated the AWS access keys used in the script to prevent the api calls being made from all the instances (Request would fail when the access keys are not valid. Don’t ask why access keys are used in the script. 😀 )
After this change, the number of api calls being made is drastically reduced from 7000 to 150 on a 5 minutes average and we didn’t encounter ec2 requestlimitexceeded errors anymore.
So a small script which we added long time back, brought our production workloads to a standstill. As they say, in every failures there is an opportunity to learn – we did learn many important lessons.
- Always be mindful of the scripts, applications which are interacting with AWS (or any cloud provider for that matter).
- Ensure that your script, application is not making too much of api calls in a short time. Once you start hitting the maximum limit, there is no way forward only backward.
- Always ask yourself what could possibly go wrong with the script, how it could possibly abuse the system. This way you can foresee the issues which could arise in the future due to it.
- Document, document, document everything. So that you don’t have to break your head for hours thinking how, why, and where.
– You can create a cloudwatch alarm to notify if the requestlimitexceeded errors exceed defined threshold limits (say 1k requestlimitexceeded errors in 5 minutes).
– Using cloudtrail logs, identify which user/access key/resource making most of the api calls from your AWS account and validate if it’s justified.
– As AWS always says, implement error retries/exponential backoffs in our operations interacting with EC2 apis.
Please share your valuable comments, feedback in the comments section and share it using social buttons below. Subscribe to the blog for more interesting articles on AWS.