I’m writing this article because I recently realized how many misconceptions there are about idempotency keys and “exactly one execution.” I’m also writing it for myself, so I can refine my understanding and update the article over time.

In distributed systems, the guarantee of “exactly-once delivery” is not possible. The network is unreliable. Packets get dropped. Connections time out. Acknowledgments get lost. We cannot guarantee exactly one delivery, but we often must guarantee exactly one effect.

In this context even simple client <> server is a 100c1a1⁝ Distributed System.

1. The Reality of the Network

To understand why idempotency is necessary, we must look at the two layers involved in a request.

The Transport Layer (TCP/HTTP)

The network layer is inherently not reliable. It always operates on a principle of “best effort.” A request reaches the server, but the acknowledgment (ACK) is lost on the way back. To combat loss, clients (or TCP itself) retry requests. In consequence the server may receive the same instruction multiple times. We must accept “At-least-once delivery” as our baseline reality.

The Application Layer

The application layer (Handler + Database) has a different goal. If a user clicks “Pay 100” request three times due to retries, a naive application will charge the user R_1, R_2, R_3$ are actually the same command.


2. Implementation

The core mechanism is the Idempotency Key - a unique token representing a specific business intent. It’s send along with the request (for example in header)

Client-Side Responsibilities

The client acts as the source of truth for the operation’s uniqueness.

  1. Generation: The client generates a unique key. UUID v7 is recommended over v4 because it is time-ordered, which improves database index locality and insert performance. However beware that UUID v7 if public it may leak time information. So in some industries like healthcare if might be a big no.

  2. Lifecycle: The key is associated with a specific user action (e.g., “Checkout”).

  3. Retries: If the client receives a network error (timeout) or a 5xx server error, it must retry using the same Idempotency Key.

  4. New Operation: If the user changes the cart and tries again, this is a new intent. The client must generate a new key.

Server-Side Data Model

The server tracks these keys to manage state.

Table Schema: idempotency_entries

ColumnTypeDescription
idempotency_keyUUIDPrimary or Unique identifier.
tenant_idUUIDScope for the key (usually User ID or Org ID).
request_hashStringSHA-256 of (Method + Path + Normalized Body).
statusEnumIN_PROGRESS, COMPLETED, FAILED_FATAL.
response_codeIntegerThe HTTP status code returned (e.g., 201, 400).
response_bodyJSONThe cached payload to return on retries.
created_atTimestampFor TTL and audit.
last_heartbeat_atTimestamp(Optional) Used to detect “zombie” processes.

Critical Constraints:

You must enforce a Unique Index on (tenant_id, idempotency_key). This delegates the “locking” mechanism to the database engine.


3. The Execution Flow

The process relies on “Optimistic Locking” via the database constraint at the very start of the request.

Step 1: The Atomic Lock

When a request arrives, attempt to insert a row with status = IN_PROGRESS.

  • Success: No other thread is processing this key. Proceed.

  • Failure (Unique Violation): Another thread is working on this, or it has finished. Stop processing immediately.

      - _Action:_ Query the table to see the current state. If `COMPLETED`, return the saved result. If `IN_PROGRESS`, wait (spin-lock) or return a `409 Conflict` telling the client to retry later.
      
    

Step 2: Request Validation (Parameter Tampering)

If the key exists, we must ensure the client isn’t reusing a key for a different action.

  • Compare the incoming request’s hash against the stored request_hash.

  • Mismatch: If Key A was used for “Pay 50”, this is likely a bug or an attack.

  • Action: Return 422 Unprocessable Entity or 409 Conflict. Never return the cached result for a mismatched request.

Step 3: Processing & Updates

Perform the business logic. Once finished, update the idempotency_entries row.

  • On Success: Update status = COMPLETED, save response_body, and commit.

  • On Business Failure (e.g., Validation Error): Update status = FAILED_FATAL, save the error response.


4. Semantic Nuances: Handling Errors Correctly

This is where most implementations fail. It is vital to distinguish between Result Caching and Error Caching.

The “Strict Cache” Fallacy

A common mistake is: “Always return the stored result, regardless of what it is.”

  • The Scenario: The DB connection drops. The handler catches the exception and saves a 500 Internal Server Error to the idempotency table.

  • The Consequence: The client retries 5 seconds later. The server is healthy, but the idempotency check sees the saved 500 and returns it. The user is now permanently blocked from completing that action with that key.

Rule: Never cache transient system errors.

If an error is retryable (DB timeout, network blip), roll back the transaction or delete the idempotency row. Let the next retry attempt a fresh execution.

Retryable vs. Final Semantics

We can categorize outcomes into three buckets:

  1. Success (COMPLETED):

    • The work is done.

    • Behavior: Always return the cached payload.

  2. Fatal Business Failure (FAILED_FINAL):

    • Examples: “Insufficient Funds,” “Invalid Address.”

    • Even if we retry, the result will not change (deterministic failure).

    • Behavior: Store this error. Return the cached error payload to prevent the client from hammering the system.

  3. Transient Failure (No State / FAILED_RETRYABLE):

    • Examples: “Deadlock,” “Timeout,” “3rd Party API 503.”

    • Behavior: Do not commit a failure state to the idempotency table. Ideally, ensure the IN_PROGRESS row is removed or rolled back so the next request can claim the lock.


5. The Zombie Problem: Handling Crashes

What happens if the server crashes (power loss, OOM kill) after inserting IN_PROGRESS but before finishing?

The row remains stuck in IN_PROGRESS forever. Subsequent retries will see IN_PROGRESS and wait indefinitely or fail.

Solution: The Recovery Window

We cannot rely on the server to update the row (it’s dead). We must rely on the next request.

Modify the “Step 1” logic:

When a request encounters an existing IN_PROGRESS row:

  1. Check created_at (or last_heartbeat).

  2. If NOW() - created_at > MAX_PROCESSING_TIME (e.g., 60 seconds), assume the previous worker died.

  3. Steal the lock: Perform an atomic update:

    SQL

    UPDATE idempotency_entries
    SET created_at = NOW(), worker_id = 'me'
    WHERE idempotency_key = ? AND status = 'IN_PROGRESS' AND created_at < ?
    
  4. If the update affects 1 row, proceed with processing.


6. External Systems (The Outbox Pattern)

Idempotency inside your database is “easy” (ACID transactions). It gets hard when calling Stripe/Twilio/AWS.

If you charge a card via Stripe and then your database fails to commit, you have a “Ghost Record” (money taken, no order record).ShutterstockOdkrywaj

To solve this, use the Outbox Pattern or Idempotent Chains:

  1. Chaining: Pass your internal Idempotency Key (or a derivative of it) to the external provider.

    • Stripe API supports Idempotency-Key headers. Send your key to them. If you retry, Stripe handles the deduplication.
  2. The Outbox:

    • Instead of calling the API immediately, insert a row into an outbox_table in the same transaction as your business logic.

    • A background worker reads the outbox and calls the API (with retries).

7. Data Retention

Idempotency keys are not forever. UUID indices are heavy.

  • Window: Keep keys for 24–48 hours. This covers 99.9% of retry loops.

  • Cleanup: Use a TTL index (in Mongo/Dynamo) or a scheduled job (SQL) to delete old rows.

  • Constraint: If a client retries a key after it has been deleted, it is treated as a new request. This is an acceptable trade-off for system performance.


Summary Checklist

  1. Client: Uses UUID v7, retries on network/5xx errors, uses same key.

  2. Database: Unique constraint on (key, tenant).

  3. Locking: Insert IN_PROGRESS first. Fail fast on duplicate.

  4. Validation: Hash parameters to prevent key reuse on different data.

  5. Errors: Never cache transient (500) errors. Only cache Success or Final (400) failures.

  6. Recovery: Implement a timeout strategy for stuck IN_PROGRESS rows.

Would you like me to draft the SQL schema and a pseudo-code handler for the “Zombie Recovery” logic to make this concrete?

100c1⁝ Idempotency Keys