Adding Custom Rate Limiting to Your AWS Lambda API Using a Middleware

January 24, 2025

Rate limiting an API or service is a common use case. But how can you do this for your AWS Lambda? API Gateway offers usage planes and quotas, so you can put a API in front of the lambda, but quotas have some downsides there:

Usage plan throttling and quotas are not hard limits, and are applied on a best-effort basis. In some cases, clients can exceed the quotas that you set. Don’t rely on usage plan quotas or throttling to control costs or block access to an API.

Another option is to pair API Gateway, your Lambda and AWS WAF, but if you don’t want to find yourself in the AWS WAF configuration jungle, another simpler option could do the trick for you. Let’s explore how you can easily setup rate limits for your Lambda.

The above image shows the architecture of the rate-limiter. As I use Middy in almost all of my projects for shared middlewares, I built the rate-limiter as a middleware as well. The code is split in two parts:

Checking for the current quotas of the user and checking it against the allowed quotas in before phase
Increasing the rate limit after the actual invoke of my API in the after phase

For storage, I decided to go with something that as fast read and write accesses and in the best case supports TTLs. So a cache made the race and I did go with momento. You can of course use other caches like Redis or even a database to save the current quotas of a user.

That is the basics we need to implement rate-limiting. Of course, the allowed quotas have to be defined somewhere and the current quotas have to be stored somewhere. Before we touch on these topics, let’s explore why rate limiting is even needed.

ℹ️

I used the rate limits introduced here for an AppSync API, where I needed to have fixed quotas for certain mutations based on a user’s subscription to the service. Thus, there is no API Gateway involved. You can of course put the middleware code into API Gateway Lambda authorizers to check for quotas and increase them accordingly. Be aware with though, that authorizers are need some cache tweaks to make it work. An additional requirement for my implementation was, that quotas are only increased on successful invokes of the mutation.

What Is Rate Limiting and Why Implement It?

Rate limiting controls traffic to a network, app, or API. It does this by limiting the number of requests a client can make in a set time, like 100 requests per minute. It ensures efficient use of system resources. It also protects the infrastructure from misuse or overload.

Why implement rate limiting?

Rate limiting safeguards against malicious activities. These include Denial of Service (DoS) attacks, brute force login attempts, and excessive API scraping. It does this by capping the request frequency.
It prevents servers from becoming overwhelmed by limiting request volume. This ensures a consistent, reliable experience for all users.
In shared environments, rate limiting ensures fair access to resources. It prevents certain users from monopolizing bandwidth or server capacity.
Limiting traffic cuts costs by reducing server load. This is key for APIs that do resource-heavy tasks. It ensures they are used wisely.
Rate limiting can help monetize. For example, you can offer a free tier with a limit on requests. This will attract new users. Charge for any extra usage for resource-intensive APIs. These include those for complex data processing, machine learning, or real-time analytics. Heavy usage incurs high costs. Higher-tier plans can include elevated rate limits, providing premium access and encouraging upgrades.
For usage-based or subscription-based APIs, rate limiting ensures users stay within their plans. It helps avoid unexpected costs and encourages users to scale up their usage.

Rate limiting can act as a gateway to premium services. It offers a baseline of free requests. This lets users test an API's value before committing to higher prices.

Implementing the Middleware

In this post we explore an option to rate limit users based on their available quota. This information can come from some hard limit or be based on the current subscription of a user within our product.

Now, let’s get started with the implementation. First, we only want to rate limit some operations of our API. So we can define them for later use.

export const QUOTA_OPERATION = ["generate", "analyze"] as const;

export type QuotaOperation = (typeof QUOTA_OPERATION)[number];

As a first step, w define what rate limits a user can have based on the operation and entitlement. Let’s assume there arefree, advanced and enterprise. We can use a simple mapping between an entitlement a user can have within our app and the resulting amount of request they can make. Additionally, we want to save when these rate limits are resting (e.g. seconds, minutes, days or even weeks).

export const QUOTA_PERIOD = ["second", "minute", "hour", "day"] as const;

export type QuotaEntitlements = "free" | "advanced" | "enterprise";

export const RATE_LIMITING_QUOTAS: Record<
  QuotaEntitlements,
  Record<
    QuotaOperation,
    {
      limit: number;
      period: QuotaPeriod;
    }
  >
> = {
  free: {
    analyze: {
      limit: 100,
      period: "day",
    },
    generate: {
      limit: 10,
      period: "minute",
    },
  },
  advanced: {
    analyze: {
      limit: 1000,
      period: "day",
    },
    generate: {
      limit: 100,
      period: "minute",
    },
  },
  enterprise: {
    analyze: {
      limit: 1000,
      period: "hour",
    },
    generate: {
      limit: 100,
      period: "second",
    },
  },
};

The following is defined from the above code:

Every user on the free plan can call analyze 100 times a day and generate 10 time per minute
Every user on the advanced plan can call analyze 1000 times a day and generate 100 time per minute
Every user on the enterprise plan can call analyze 1000 times an hour and generate 100 time per second

In our case, that means it is important what the highest entitlement of the user is.

export const ENTITLEMENT_ORDER: Record<QuotaEntitlement, number> = {
  free: 1,
  advanced: 2,
  enterprise: 3,
} as const;

export const getHighestEntitlement = (
  entitlements: QuotaEntitlement[]
): QuotaEntitlement => {
  return entitlements.reduce(
    (highest, current) =>
      ENTITLEMENT_ORDER[current] > ENTITLEMENT_ORDER[highest]
        ? current
        : highest,
    "free"
  );
};

We need to store everything in a format so that we know, for which rate limit we need to check, what the current used quota is and when the quota will reset itself. In the case of our momento, this is rather easy. We can come up with a key structure that saves the most important thinks: userId and operation, use the automatic TTLs for resets and save the current used quota as the value:

const generateCacheKey = (args: {
  userId: string;
  operation: QuotaOperation;
}) => {
  return `user:${args.userId}:operation:${args.operation}`;
};

⚠️

Caution: momento (and some other caches) have an initial limit of 24hrs for any TTL. If you need higher TTLs for your use-case, consider increasing the limit, using another cache or another store altogether.

Now we can create a helper function that will do the following return the following things:

currentCount: Currently used quota, will be 0 if no entry is found on the cache
limit: How many invokes the user can currently make to this API
period: When the rate limit will reset as a unit
ttl: Remaining time until the quota will reset
cache: Cache related information.

const parseGetResponseValue = (response: CacheGet.Response) => {
  switch (response.type) {
    case CacheGetResponse.Hit:
      return { currentCount: parseInt(response.valueString()), exists: true };
    case CacheGetResponse.Miss:
      return { currentCount: 0, exists: false };
    case CacheGetResponse.Error:
      // In case of an error, we should throw and denie access
      throw new Error("Unable to get the user's quota");
  }
};

export const periodInSeconds = (period: QuotaPeriod): number => {
  switch (period) {
    case "second":
      return 1; // 1 second in seconds
    case "minute":
      return 60; // 1 minute in seconds
    case "hour":
      return 60 * 60; // 1 hour in seconds
    case "day":
      return 24 * 60 * 60; // 1 day in seconds
  }
};

export async function getCurrentQuotaOfUser(args: {
  operation: QuotaOperation;
  cacheClient: CacheClient;
  entitlement: QuotaEntitlement[];
  userId: string;
}) {
  // Get the current highest entitlement for the user
  const userEntitlement = getHighestEntitlement(args.entitlement);

  // Determinate the max rate limit based on the plan
  const operationLimit = RATE_LIMITING_QUOTAS[userPlan][args.operation];

  // Generate the cache key
  const cacheKey = generateCacheKey({
    userId: args.userId,
    operation: args.operation,
  });

  // Get the current count from Momento
  const response = await args.cacheClient.get("rate-limit-cache", cacheKey);

  // Parse the result
  const { currentCount, exists } = parseGetResponseValue(response);

  // Parse the current period to seconds for TTL usage
  let ttl: number = periodInSeconds(operationLimit.period);

  if (exists) {
    // Check for the current TTL if we found a entry in the cache
    const ttlResponse = await args.cacheClient.itemGetTtl(
      "rate-limit-cache",
      cacheKey
    );

    const ttlMillis = ttlResponse.remainingTtlMillis();

    // Calculate the remaining TTL for the operation's rate limit
    ttl = ttlMillis ? ttlMillis * 1000 : ttl;
  }

  return {
    currentCount,
    limit: operationLimit.limit,
    period: operationLimit.period,
    ttl,
    cache: {
      key: cacheKey,
      exists,
      ttl,
    },
  };
}

These helper functions are the base for our rate limiting middleware. Now, before we put the middleware together, we create a reusable momento client. Let’s write a small middleware that injects a client to our execution context of our lambda function.

import {
  CacheClient,
  Configurations,
  CredentialProvider,
} from "@gomomento/sdk";
import { logger } from "@instameal/lambda-logger";
import middy, { MiddlewareObj } from "@middy/core";

async function createCacheClient(secret: string) {
  return CacheClient.create({
    configuration: Configurations.Laptop.v1(),
    credentialProvider: CredentialProvider.fromString(secret),
    defaultTtlSeconds: 600,
  });
}

let cachedClient: CacheClient | null = null;

export type MomentoClientContext = {
  cacheClient: CacheClient;
};

export type WithMomentoClientCache<T> = T & {
  momento: MomentoClientContext;
};

type MomentoOptions = {
  secret: string;
};

const assignMomentoClientToContext = (
  options: MomentoOptions
): middy.MiddlewareObj => {
  const before: middy.MiddlewareFn = async (request) => {
    if (!cachedClient) {
      cachedClient = await createCacheClient(options.secret);
    } else {
      logger.debug("using cached MomentoClient client");
    }
    Object.assign(request.context, {
      momento: {
        // Access to the whole cache abstraction
        cacheClient: cachedClient,
      },
    });
  };

  return {
    before,
  };
};

/**
 * Middleware assigning an instance of the MomentoClientSDK on the context object.
 * The client instance is cached and reused for the same execution environment.
 */
export const momento: (options: MomentoOptions) => MiddlewareObj[] = (
  options
) => [assignMomentoClientToContext(options)];

The rate limiting middleware is now easily put together. We need to

Get the operation from the incoming event
Check if the operation is even rate limited. If not, we can execute the request directly
Use the getCurrentQuotaOfUser to get the required information
Check if the user has exceeded the limit. If so, we will throw a LimitExceededException
Assign some information we have gathered to the context, so we can use it in the after phase
In the after phase, create or increment the current entry in our cache

type RateLimitingStash = {
  cacheKey: string;
  currentCount: number;
  exists: boolean;
  ttl: number;
};

const rateLimitingMiddleware = (): MiddlewareObj<
  AppSyncResolverEvent<unknown, unknown>,
  APIGatewayProxyResult,
  Error,
  WithMomentoClientCache<LambdaContext>
> => {
  return {
    before: async (handler) => {
      const { context, event } = handler;
      const {
        momento: { cacheClient },
      } = context;

      // Get the operation name somewhere from the request
      // This could be either the path or a mutation name from AppSync
      const operation = mapFieldNameToOperation({
        fieldName: event.info.fieldName,
      });

      if (!operation || !QUOTA_OPERATION.includes(operation)) {
        logger.debug("No operation or operation is not rate-limited", {
          operation,
        });
        // If there's no operation, or the operation is not rate-limited, proceed
        return;
      }

      const { currentCount, limit, ttl, cache, period } =
        await getCurrentQuotaOfUser({
          cacheClient,
          environment,
          identity: event.identity,
          operation,
        });

      // Check if the current request exceeds the allowed limit
      if (currentCount >= limit) {
        throw new LimitExceededException(
          "You have reached the limit for the operation",
          {
            operation,
            limit,
            invokes: currentCount,
            resetsAt: DateTime.fromSeconds(ttl).toISO(),
          }
        );
      }

      // Save the information to the stash so we can use it after the API execution
      const rateLimitingStash: RateLimitingStash = {
        cacheKey: cache.key,
        currentCount,
        exists: cache.exstis,
        ttl,
      };

      Object.assign(handler.event.stash, {
        rateLimit: rateLimitingStash,
      });
    },

    after: async (handler) => {
      const { context, event, response } = handler;
      const {
        momento: { cacheClient },
      } = context;

      // Only increase the limit if the operation was executed succesfully
      if (response) {
        // Check if there's a valid response
        const rateLimitingStash = event.stash.rateLimit as
          | RateLimitingStash
          | undefined;

        // No rate limiting stash, so we don't need to update the cache
        if (!rateLimitingStash) {
          return;
        }

        const { cacheKey, currentCount, ttl, exists } = rateLimitingStash;

        // Entry already exists, we can use the atomic increment
        if (exists) {
          await cacheClient.increment(
            generateCacheName(environment, "rate-limit-cache"),
            cacheKey,
            1,
            { ttl }
          );
        } else {
          // Set a new value with a TTL for the period
          await cacheClient.set(
            generateCacheName(environment, "rate-limit-cache"),
            cacheKey,
            (currentCount + 1).toString(),
            { ttl }
          );
        }
      }
    },
  };
};

export default rateLimitingMiddleware;

Finally, we can enhance our Lambda handler with out two newly created middlewares. Now, any request that uses the analyze or generate mutations should be rate limited base on our defined quotas.

const hanlder = () => {
  // ... your implementation
};

export const main = middy(handler).use(momento()).use(rateLimitingMiddleware());

Final Thoughts

We have implemented a simple rate limiting middleware. It allows us to limit any execution of the operations we have defined in the timeframes we have defined. Also, this is not on a best-effort basis. It will block any request that exceeds the pre-defined limit.

We can enhance our implementation in many ways. We can make it more dynamic by storing more information. If we change our quotas in the future, existing accounts would not be affected. We can make the entitlements and operation mapping more modular. This would help apps with dynamic, not strict, plans and entitlements. Or, and this is a big decision, use a different storage engine. It can have benefits.

Choosing the correct storage can be tough. The advantage of momento and other caches is that they use atomic writes (Redis does as well). This comes in handy, as it prevents race conditions from different requests. You may also want to check the concurrency settings on your lambda in regard to that.

The decision should also depend on the throughput you are expecting for your API. Caches are fast and cheap, but have some other limitations. Databases can get expensive if you have many reads and writes to manage your API's rate limits. However, they offer other benefits.

As in so many cases, it really depends on your use case and what you want to achieve.

See you next time. 👋

Thanks for reading 👏

@codingfuchs

Freelance Fullstack Engineer | AWS Community Builder | Serverless & Frontend = ❤️