Token Bucket Rate Limiter
Token Bucket Rate Limiter
Standard Rate Limiting
A common requirement for many APIs is to limit the number of requests that can be made per unit of time. This can be to prevent abuse, to ensure fair usage, to differentiate between different tiers of users, or to simply prevent overloading the system.
Rate limiting in itself is a simple approach - - you set a limit of X requests per Y time period, and if a user exceeds that limit, you block them until the time period has passed. This works great when all your users make requests at a consitent frequency, however in most cases users tend to need to make requests in short bursts. Setting a fixed limit in this way means they have a limit when they need o get work done, but most of the time they’re wasting available transactions that they could be using.
Token Buckets
Token buckets are a more flexible approach. They still set an average rate limit, but they allow for limits to be exceeded for a time, as long as the average rate is maintained. This is done by using a “bucket” that fills with “tokens” at a certain rate. Each time a request is made, a token is removed from the bucket. If there are no tokens left, the request is blocked until a token becomes available.
For example, if you have a token bucket that fills at a rate of 1 token per second, and has a capacity of 10 tokens, then you can make up to 10 requests in quick succession, but after that, you will need to wait for the bucket to refill before you can make more requests. This allows for bursts of activity while still enforcing an average rate limit.
Combination Token Bucket Rate Limiting
The problem with a pure token bucket approach is that users can burst, but once their tokens are exceeded, they are limited to an extremely slow rate, e.g. 1 request per second.
A middle ground is to always ensure an absolute minimum number of tokens are available, e.g. 10 tokens, but then also have a token bucket that fills at a certain rate, e.g. 1 token per second with a maximum of e.g. 100 tokens. This allows for bursts of up to 100 requests, but ensures that even if the bucket is empty, users can still make 10 requests per second.
Implementation
To implement this we dont need to keep track of every request, we just need to keep track of the number of tokens in the bucket and the last time it was updated. We’re going to need a database record.
We have to consider that a new user wont have a record, and we dont want a table full of records relating to users that either never made a request, or made a request a long time ago and are no longer active. To solve this, we can use a “lazy” approach to creating records. When a user makes a request, we check if they have a record. If they dont, we create one with the default number of tokens and the current time as the last updated time. If they do have a record, we calculate how many tokens should have been added since the last update, add those to the bucket (up to the maximum), and then check if there are enough tokens to allow the request.
I am going to show this implementation using AWS DynamoDB which has a TTL feature that allows us to automatically delete records after a period we specify. This doesnt cost anything and happens auotmatically.
import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
import { DynamoDBDocumentClient, UpdateCommand } from '@aws-sdk/lib-dynamodb';
const client = new DynamoDBClient({});
const doc = DynamoDBDocumentClient.from(client);
async function checkRateLimit(userId: string) {
const nowSec = Date.now() / 1000;
const ONE_DAY_IN_SEC = 24 * 60 * 60;
const ttlValue = Math.floor(nowSec + ONE_DAY_IN_SEC);
const REFILL_RATE = 10; // 10 tokens per second
const MAX_BURST = 100; // Maximum tokens in the bucket
try {
const result = await doc.send(
new UpdateCommand({
TableName: 'UserRateLimits',
Key: { userId },
// We add 'ttl' to the SET clause
UpdateExpression: `
SET lastRefill = :now,
ttl = :ttl,
tokens = (
if_not_exists(tokens, :max) +
((:now - if_not_exists(lastRefill, :now)) * :rate)
) - :cost
`,
ConditionExpression: `(if_not_exists(tokens, :max) + ((:now - if_not_exists(lastRefill, :now)) * :rate)) >= 1`,
ExpressionAttributeValues: {
':now': nowSec,
':rate': REFILL_RATE,
':max': MAX_BURST,
':ttl': ttlValue,
},
ReturnValues: 'UPDATED_NEW',
}),
);
return { allowed: true, remaining: result.Attributes?.tokens };
} catch (err: any) {
if (err.name === 'ConditionalCheckFailedException') {
return { allowed: false, message: 'Rate limit exceeded' };
}
throw err;
}
}
When a request comes in from a user we can call checkRateLimit(userId) to check if they are allowed to make the request. If they are allowed, we can proceed with processing the request. If they are not allowed, we can return a response indicating that they have exceeded their rate limit.
Using an Update statement with a ConditionExpression allows us to atomically check the number of tokens and update the record in a single operation, which is important for ensuring that we don’t have race conditions where multiple requests from the same user could potentially allow them to exceed their rate limit.
We use if_not_exists to handle the case where a user does not have a record yet. This allows us to create a new record with the default number of tokens and the current time as the last updated time when they make their first request.
Active User Example
An active user will already have a record in the database. When they make a request, we will calculate how many tokens to add based on the time since the last update, add those to the bucket (up to the maximum), and then check if there are enough tokens to allow the request. If there are enough tokens, we will extract one token and update the record with the new number of tokens, the current time as the last updated time, and set the TTL to be 24 hours from now.
The record will look something like this:
{
"userId": "user123",
"tokens": 95,
"lastRefill": 1620000000,
"ttl": 1620086400
}
If a new request comes in for user123, and the timestamp is 1620000005, we would calculate the number of tokens to add as (nowSec - lastRefill) * REFILL_RATE which with values is:(1620000005 - 1620000000) * 10 = 50. Given that we already had 95 tokens we now have 145 tokens, however, we have a maximum burst of 100 tokens. We then extract one token giving us 99 tokens remaining. We also update the lastRefill to the current time and set the ttl to be 24 hours from now.
We will essentially attempt to create a new record that looks like this:
{
"userId": "user123",
"tokens": 99,
"lastRefill": 1620000005,
"ttl": 1620086405
}
New or Inactive User Example
An Inactive or New user will not have a record in the database. When they make a request, we will use if_not_exists to create a new record with the default number of tokens (which is the maximum burst) and the current time as the last updated time. We will then extract one token for the request and set the TTL to be 24 hours from now.
We will default them to the maximum burst of tokens, which in this case is 100 tokens. We will then extract one token for the request, giving them 99 tokens remaining. We will also set the lastRefill to the current time and set the ttl to be 24 hours from now.
The record will look something like this:
{
"userId": "User123",
"tokens": 99,
"lastRefill": 1620000005,
"ttl": 1620086405
}
Alternative Options
In our example we automatically give a user the maximum burstable amount of tokens. This may not be desirable in all cases. It may be that we have an initial balance. We can solve this with this change:
.
.
.
const INIITAL_BALANCE = 50; // Initial tokens for new users
.
.
.
ConditionExpression: `(if_not_exists(tokens, :initial) + ((:now - if_not_exists(lastRefill, :now)) * :rate)) >= 1`,
ExpressionAttributeValues: {
.
.
.
':initial': INIITAL_BALANCE,
.
.
.
},
This way a new record gets created with an initial balance of 50 tokens, and then refills at the same rate as before. This allows us to control the initial burst that a new user can have, while still allowing them to refill their tokens over time.
Real World Examples
This algorithm is used in the AWS EC2 service for burstable instances. These instances have a baseline performance level, but can burst above that level for short periods of time. The token bucket algorithm is used to determine how much burst capacity an instance has available at any given time, and to ensure that it does not exceed its limits. If you launch a T series instance, you get a certain number of CPU credits that allow you to burst above the baseline performance level. As you use those credits, they are consumed, and if you run out of credits, your instance will be limited to the baseline performance level until you earn more credits over time. This can be seen in a metric called CPUCreditBalance which shows how many credits you have remaining. If you have a high CPU usage and a low credit balance, you may see a drop in performance as your instance is limited to the baseline level until you earn more credits.