In distributed systems, there is a fundamental axiom often derived from the Two Generals’ Problem: it is mathematically impossible to guarantee exactly-once delivery of messages over an unreliable network. Acknowledgments get lost, connections time out, and retries are inevitable.
Because we cannot prevent duplicate message delivery, we must design systems that can withstand it. The goal, therefore, is not exactly-once delivery, but exactly-once processing.
1. The Core Mechanism: Idempotency
To achieve exactly-once processing, operations must be idempotent - meaning the result of performing the operation once is the same as performing it multiple times.
The standard pattern for this is the Idempotency Key.
- Tag: The producer assigns a unique key to every message.
- Check: Upon receipt, the consumer checks if this key has already been processed.
- Act:
- If seen: Discard the message (or return the previous result).
- If new: Process the message.
The Atomicity Requirement
A critical failure mode discussed in system design is the “check-then-act” race condition.
For this pattern to work, the processing of the business logic and the recording of the idempotency key must happen atomically.
- Correct: Wrap the state change (e.g.,
INSERT INTO orders) and the key storage (INSERT INTO processed_keys) in a single ACID database transaction. - Failure Mode: If you process the order, commit, and then try to save the key, a crash in between results in a duplicate order upon retry.
2. Choosing the Right Key Strategy
The “best” key depends entirely on your throughput, storage indexes, and producer architecture.
A. Random Identifiers (UUIDv4)
The producer generates a standard random UUID for every message.
- Pros: Stateless; producers don’t need to coordinate; trivial to implement.
- Cons: Infinite Storage Growth. To guarantee uniqueness, the consumer must store every UUID ever received.
- Mitigation: Use UUIDv7 or ULID. These embed a timestamp in the identifier. The consumer can then enforce a “retention window” (e.g., “reject any key older than 7 days”). While this technically breaks strict exactly-once guarantees for very old duplicates, it is a pragmatic tradeoff for most systems.
B. Deterministic Content Hashing (UUIDv5)
Instead of a random ID, the key is derived from the message content itself (a hash of the namespace + payload).
- Pros: If a producer unknowingly sends the same logical request twice, the key is identical. It enables “stateless” deduplication.
- Cons: False Positives. If a user legitimately wants to buy the same item twice in a row, a content-hash key might incorrectly reject the second purchase.
- Best Practice: Hash the intent (e.g.,
hash(cart_id + timestamp)), not just the payload.
C. Monotonic Sequences (The “High Watermark”)
The producer uses strictly increasing integers (1, 2, 3…).
- Pros: O(1) Storage. The consumer only needs to store the highest ID seen (the “High Watermark”). If , it’s a duplicate.
- Cons: Hard to generate. Producing strictly monotonic numbers in a distributed, multi-threaded environment creates a bottleneck.
3. Solving the Producer Concurrency Problem
If you choose Monotonic Sequences for their consumer efficiency, you must solve the producer bottleneck. If Thread A takes ID 100 and Thread B takes ID 101, but B finishes first, the consumer will see 101 and set the watermark. When 100 arrives later, it is incorrectly dropped.
Here are three ways to solve this:
The Hi/Lo Algorithm
To avoid hitting the database for every sequence number, use the Hi/Lo Algorithm:
-
Hi: The database provides a “block” of IDs (e.g., 1000 at a time) to a producer instance.
-
Lo: The producer increments IDs within that block in memory.
This reduces database contention significantly (1 request per 1000 messages) while maintaining uniqueness, though strict monotonicity across multiple producers requires careful partition management.
Log-Based Change Data Capture (CDC)
Instead of generating IDs in the application layer, use the database’s Transaction Log.
- Outbox Pattern: The producer writes the message intent to an
outboxtable in the same transaction as the business logic. - Derive Key: A CDC tool (like Debezium) reads the database Write-Ahead Log (WAL).
- Composite Key: In PostgreSQL, the Log Sequence Number (LSN) is monotonic. The idempotency key becomes {Commit LSN, Event LSN}.
This allows you to “have your cake and eat it too”—high throughput production with monotonic keys for the consumer.
Single-Threaded Partitioning
Tools like Kafka handle this by serializing messages per partition. The “Offset” acts as a naturally monotonic idempotency key. This shifts the complexity from the database to the message broker infrastructure.
4. The “Side Effect” Trap
The atomic transaction model works for database updates. But what if your message processing involves calling an external API (e.g., Stripe, Salesforce)? You cannot rollback a REST call inside a database transaction.
If the DB transaction rolls back but the API call succeeded, you have created a phantom state.
- Solution 1: Idempotency Propagation. Pass your idempotency key to the downstream service. (e.g., Stripe accepts an
Idempotency-Keyheader). - Solution 2: The Saga Pattern. Break the transaction into steps. If the local DB commit fails, trigger a “compensating transaction” (e.g., a refund) to undo the external side effect.
Summary Comparison
| Strategy | Storage Cost (Consumer) | Implementation Complexity | Best Use Case |
|---|---|---|---|
| UUIDv4 | High (Index everything) | Low | Low-to-medium volume; simple setups. |
| UUIDv7/ULID | Medium (Prunable index) | Low | High volume where “retention windows” are acceptable. |
| UUIDv5 | Zero (Deterministic) | Medium | Content-based deduplication; careful regarding false positives. |
| Monotonic / CDC | Very Low (High Watermark) | High (Requires CDC or complex producing) | Massive scale; systems requiring strict ordering; Kafka consumers. |
Note
While TCP guarantees packet ordering at the transport layer, it cannot solve application-level duplicates caused by crash-recovery cycles. Whether you use UUIDs with a TTL or complex CDC pipelines, the principle remains: assume the network will lie to you, and trust only your persisted state.
Implementation gotchas
1. The Danger of “Natural” or Business Idempotency Keys
A common mistake is to derive the idempotency key from business data (e.g., user_id + product_id) instead of requiring a random UUID. While this feels “cleaner,” it introduces a major risk: Semantic Drift.
The Collision Problem
Imagine a subscription service where the key is generated as membership_id + month.
- The Intent: Prevent charging a user twice for the same month.
- The Change: The business introduces “Add-on Packs” that can be purchased multiple times a month.
- The Failure: The old idempotency logic sees the same key and rejects the second purchase as a duplicate, even though it’s a valid, separate transaction.
Strategy: Versioning and Intent Scoping
If you cannot use opaque UUIDs and must rely on business data, you must include a Version or Intent prefix in your hashing logic.
Rule of Thumb: If the definition of a “unique action” changes, the key generation algorithm must change too.
2. The “Payload Mismatch” Trap
This is the most frequent implementation bug. It occurs when a client reuses an Idempotency Key but changes the request parameters.
-
The Scenario:
- Client sends
POST /transfer { amount: 100 }with Key:uuid-1. - Server processes it and saves the result.
- Client (due to a bug or malice) sends
POST /transfer { amount: 9000 }with Key:uuid-1.
- Client sends
-
The Gotcha: A naive implementation just checks “Does
uuid-1exist?” It sees “Yes, status: COMPLETED” and returns the saved success response from the $100 transfer.- Result: The client thinks they successfully transferred 100 moved.
-
The Fix: You must store a hash (checksum) of the request body alongside the key. If the key exists but the hash doesn’t match the current request, throw a 422 Unprocessable Entity or 409 Conflict.
3. The “Burned Key” on Failure
Deciding what to do when the first attempt fails is tricky.
-
The Scenario: Client sends a request. The database is temporarily down. The server catches the exception and records the key status as
FAILED. -
The Gotcha: The client retries the request (as they should for a 500 error). The server sees the key exists with status
FAILEDand returns the error again—forever. You have effectively “burned” the key on a transient error. -
The Fix:
- Transient Errors (Network/DB Connection): Do not save the key, or roll back the transaction entirely so the key is never persisted. Allow the retry to proceed as a fresh request.
- Terminal Errors (Validation, Business Logic): Save the key as
FAILED. If the client retries, they should get the same validation error.
4. Namespace Collisions (Data Leaks)
This is a critical security vulnerability.
- The Scenario: You rely solely on the
Idempotency-Keyheader for uniqueness.- User A sends Key:
order-1. - User B (malicious or accidental) sends Key:
order-1.
- User A sends Key:
- The Gotcha: The server sees
order-1is completed and returns the cached response. User B just received User A’s order confirmation details, including potential PII. - The Fix: Never use the Idempotency Key as the global primary key. The composite primary key must be
{user_id, idempotency_key}. User B cannot access User A’s keys.
5. The “Zombie Worker” (TTL Exhaustion)
This applies if you use Redis/Memcached locks instead of database transactions.
-
The Scenario:
- Worker A locks Key
Xwith a 30-second TTL (Time-To-Live). - Worker A gets stuck in a Garbage Collection pause or slow network call for 35 seconds.
- Redis expires the lock.
- Worker B picks up the retry, locks Key
X, and starts processing. - Worker A wakes up and finishes processing.
- Worker A locks Key
-
The Gotcha: Both workers process the transaction. You have double-charged the customer despite having an “idempotency” lock.
-
The Fix: Use a Fencing Token.
- When acquiring a lock, increment a token (e.g., version 1, version 2).
- When performing the final write/side-effect, check that the token hasn’t been superseded. If the database sees a write from Token 1 but Token 2 already exists, reject the write.
6. Client-Side Key Rotation
Idempotency relies on the client behaving correctly.
- The Scenario: The client sends a request. The server processes it but the response times out (network cut). The client library sees a timeout.
- The Gotcha: The client code catches the timeout, generates a new UUID, and retries.
- Result: Since the key is new, the server treats it as a new request. Exactly-once processing is broken because the client failed to hold onto the original key.
- The Fix: This is a documentation and client-library issue. You must educate consumers that retries must reuse the same key.
7. The “Resource Deleted” Race
Returning a cached response blindly can be confusing if the world changed in the meantime.
-
The Scenario:
- User creates
Order-1(idempotent). Success. - User deletes
Order-1. - User retries the creation of
Order-1(maybe an old browser tab refreshed).
- User creates
-
The Gotcha: The idempotency system sees the key
Order-1was “successfully created” in the past and returns200 OKwith the order details. The user thinks the order is back, but it doesn’t actually exist in theorderstable anymore. -
The Fix: This is a philosophical design choice.
- Strict Idempotency: Return the original success (technically correct: “At time T, this succeeded”).
- State-Aware Idempotency: Check if the resulting resource still exists. If not, return
404or410 Gone. (This is harder to implement as it breaks the separation of concerns).
Resource Deleted Race
Below some other ideas how to fix the problem with delete
- The “State Validation” Fix (Recommended)
Instead of blindly returning the cached response, the idempotency layer should perform a lightweight check against the primary database to ensure the resource still exists.
-
The Logic: If the idempotency key exists and points to a “Success” state, verify the existence of the record in the
Orderstable. -
The Outcome:
- If the order exists: Return the cached
200 OK. - If the order is missing: Treat the request as a brand new request. Re-run the creation logic or return a
409 Conflictexplaining the ID was previously used and deleted.
- If the order exists: Return the cached
- The “Cascading Deletion” Fix
When a user deletes a resource, the system must also invalidate or delete the associated idempotency key.
- The Logic: Treat the idempotency record and the resource as a single unit. In your
DeleteOrderservice, wrap the database deletion and the idempotency cache eviction in a single transaction (or a reliable distributed routine). - The Outcome: When the user retries the creation, the idempotency store has no record of it. The system sees it as a fresh request and successfully recreates the order.
- The “Soft Delete” Strategy
Instead of physically removing the row from the orders table, we use a deleted_at timestamp.
- The Logic: The idempotency system returns the cached response. However, because the record still exists (just marked as deleted), you can design your API to:
- Automatically “un-delete” it.
- Return a
410 Goneor a409 Conflictspecifically stating that this resource was previously deleted and the ID cannot be reused.