Distributed Transactions

The ACID transactions familiar from a monolith are not natural in distributed environments. “Wrap payment and inventory deduction in one transaction” is a single line inside one DB but loses guarantees the moment it crosses two services. The A — Atomicity — of a single transaction disappears at the service boundary.

Distributed transactions are about how a single ACID transaction decomposes and how its pieces are reassembled. Two branches: bind the distributed commits in one synchronous round, or let each commit locally and recover from failures through compensation.

flowchart LR
  A[Monolith Single Transaction
ACID] --> B[Decomposed at Service Boundary]
  B --> C{Reassembly Strategy}
  C -->|Synchronous consensus| D[2PC]
  C -->|Local commits + compensation| E[Saga / TCC]
  E --> F[Outbox
DB ↔ Event Consistency]

2PC

2PC (Two-Phase Commit) attempts to commit distributed pieces at once. A coordinator asks every participant whether they are ready (prepare), and if all agree, sends “commit” (commit phase).

sequenceDiagram
    participant C as Coordinator
    participant P1 as Participant 1
    participant P2 as Participant 2
    participant P3 as Participant 3

    Note over C,P3: Phase 1 — Prepare
    C->>P1: prepare
    C->>P2: prepare
    C->>P3: prepare
    P1-->>C: vote yes
    P2-->>C: vote yes
    P3-->>C: vote yes

    Note over C,P3: Phase 2 — Commit
    C->>P1: commit
    C->>P2: commit
    C->>P3: commit
    P1-->>C: ack
    P2-->>C: ack
    P3-->>C: ack

The mechanism is clean but the cost is steep. After the prepare phase, each participant locks resources until commit or abort arrives — long lock waits. If the coordinator fails between phases, participants can wait indefinitely — a single point of failure. And two round trips of latency add up on every transaction.

Failure modes have a clear structure. If a participant votes no in prepare, or simply does not respond, the coordinator broadcasts abort and the protocol terminates cleanly. Voting yes binds the participant: it records “prepared” in its WAL and undertakes to commit when told to. The commit phase is light work, and a participant that crashes in the prepared state recovers by reading its WAL and querying the coordinator for the decision.

Two cases actually break consistency. If the coordinator dies after sending commit to only some participants, blocking occurs. Or the participant fails to actually commit after voting yes, due to disk failure or hardware fault — that case is outside the protocol’s guarantees, a data-corruption scenario requiring manual recovery at the application or operations layer.

2PC achieves strong consistency, but the trade-off is explicit: availability and performance paid for consistency. It is rarely used in modern MSA, and when it is, only on narrow paths where strong consistency is essential.

Saga / TCC

Saga splits a business transaction into multiple local transactions across services and runs them sequentially. If a step fails, compensating transactions roll back the previous steps.

Take an order flow: create order → charge payment → reserve inventory → register shipment. If inventory reservation fails, payment is refunded (compensation), the order is canceled (compensation). The single-transaction rollback becomes explicit compensation logic spread across services.

Saga embraces eventual consistency. Brief gaps exist between steps, and during those gaps the system is in a temporarily inconsistent state. Whether that inconsistency is acceptable to the business is the core condition for adopting Saga.

Compensation failure is the hard part of Saga. Once a forward step has committed, its compensation must complete — Saga has no “give up” state. Compensations must be idempotent and retryable. If indefinite retry is not enough, recovery either pushes forward to another valid endpoint (forward recovery — credit a wallet when a card refund fails), or escalates to manual operations.

Choreography

Each service publishes an event, and other services subscribe to advance their step. There is no central coordinator. When “OrderCreated” is published, the payment service consumes it, charges, and publishes “PaymentCompleted.” The inventory service consumes that, reserves stock, and publishes “InventoryReserved.” Each service knows its own trigger and compensation logic.

sequenceDiagram
    participant O as Order
    participant P as Payment
    participant I as Inventory
    participant S as Shipment

    O->>P: OrderCreated
    P->>I: PaymentCompleted
    I->>S: InventoryReserved

    Note over O,S: On failure — compensation events in reverse
    S-->>I: ShipmentFailed
    I-->>P: InventoryReleased
    P-->>O: PaymentRefunded

Orchestration

A central coordinator — a Saga state machine or orchestrator — controls the flow explicitly. It sends a command to the payment service, receives the response, then commands the inventory service. On failure, it issues compensation commands in reverse order.

sequenceDiagram
    participant Or as Orchestrator
    participant P as Payment
    participant I as Inventory
    participant S as Shipment

    Or->>P: charge
    P-->>Or: ok
    Or->>I: reserve
    I-->>Or: ok
    Or->>S: register shipment
    S-->>Or: failed

    Note over Or,S: On failure response — compensation commands in reverse
    Or->>I: release inventory
    Or->>P: refund payment

Trade-offs Between the Two

Coupling. Choreography has no direct service-to-service dependencies, but the event sequence is distributed across services. Orchestration centralizes flow knowledge in the coordinator while services remain unaware of each other.
Visibility. To see how far a transaction has progressed: in Choreography, you trace logs across multiple services; in Orchestration, the coordinator’s state is enough.
Debugging. Following compensation flow on failure is far more direct in Orchestration. Choreography becomes hard to trace as the event graph grows.
Cohesion of business logic. Concentrating the flow of one business transaction in one place is the strength of Orchestration. Choreography splits the flow across services.

In smaller systems Choreography is a light starting point, and migrating to Orchestration as flows grow complex tends to fit well. The orchestrator is another service, though, so its added complexity has to be acknowledged.

TCC

TCC (Try-Confirm-Cancel) adapts the same compensation principle into a business-level reservation. The Try phase reserves resources; if all Try calls succeed, Confirm runs; if any fails, Cancel — the compensation — is invoked.

sequenceDiagram
    participant C as Coordinator
    participant P1 as Participant 1
    participant P2 as Participant 2
    participant P3 as Participant 3

    Note over C,P3: Phase 1 — Try
    C->>P1: try
    C->>P2: try
    C->>P3: try
    P1-->>C: reserved
    P2-->>C: reserved
    P3-->>C: reserved

    Note over C,P3: Phase 2 — Confirm
    C->>P1: confirm
    C->>P2: confirm
    C->>P3: confirm
    P1-->>C: ack
    P2-->>C: ack
    P3-->>C: ack

The difference from Saga is lock duration. Saga commits each step locally and immediately, so other transactions freely see the intermediate state. TCC keeps a reservation in place — a “reserved seat” or a “balance held for processing” — that semantically locks part of the resource during the Try-to-Confirm window. Not a real DB lock, but a brief semantic one.

The cost is explicit: every participating service must expose a consistent reserve/confirm/cancel trio of APIs. It gives up some of Saga’s simplicity to shorten the lock window.

Outbox

Saga and other event-driven patterns share one core limitation: DB write and event publication are not in the same transaction.

Suppose state is saved to the DB and that fact is then published as an event. The two operations are separate. If the service fails after the DB write but before publishing, the event is lost. The reverse — publishing first, then writing — leaves an event announcing something that did not happen. Either order can break consistency.

The Outbox Pattern solves this by binding the two operations inside the same DB transaction. Alongside the business data write, the event to publish is recorded in a separate outbox table within the same transaction. When the transaction commits, the event is safely persisted in outbox. A separate publisher process then polls outbox or detects changes via CDC (Change Data Capture) and publishes to a message broker (commonly Kafka).

This setup naturally accepts at-least-once delivery. If the publisher fails after publishing but before deleting the outbox row, the same event can be published again. Consumers must therefore be idempotent, designed so that processing the same event twice yields the same result.

Outbox guarantees consistency between DB and broker.

Pattern Selection Criteria

The decomposition criteria — domain boundary, data ownership, scale pattern, failure boundary — connect to distributed transaction pattern selection.

When the domain split forces strong consistency between two domains that is a sign the boundary was drawn wrong. If strong consistency is genuinely required, merging into one domain is preferable; if separation is unavoidable, restrict 2PC to a narrow path.
When data ownership requires one service’s change to update another’s view or cache Saga (Choreography) plus Outbox fits well. Events propagate the update; Outbox guarantees consistency.
When the scale pattern demands burst absorption Saga (Choreography) depends on a queue to flatten load.
When critical and non-critical paths are separated by failure boundary non-critical paths accept eventual consistency via Saga; critical paths use strong consistency, or are redesigned into a boundary that allows a single transaction.
When multiple transactions contend for the same resource and over-allocation is not acceptable TCC’s reservation fits. A shorter semantic lock than Saga ensures correctness, without going to 2PC’s DB lock — a middle point.

Treating pattern choice as a natural consequence of boundary decisions rather than a technology preference turns the question “should we use Saga here” into “does the decomposition criterion demand this pattern.”

The Cost of Simplicity

Distributed transactions are not the design of the happy path; they are the design of failure recovery. The happy path looks simple under any pattern. Differences surface in failure cases — partial failures, lost messages, duplicate processing, data inconsistency. Evaluating a pattern by mentally running its failure paths, not its success paths, is the honest measure.

In a monolith, a single ACID transaction provided that recovery by default. In a distributed environment, the same simplicity becomes the result of explicit cost paid. Compensation logic, idempotency design, outbox infrastructure, orchestrator operations, debugging tools. Reach for a pattern without acknowledging that cost and the system runs fine on the happy path until the first partial failure breaks consistency.

So pattern choice is another name for failure recovery design. Sketch what failures can occur and what state the system should recover to, then choose the pattern that fits the sketch.

References

Microservices Architecture — Decomposition criteria (domain boundary, data ownership, scale, failure) and inter-service communication. The premise behind distributed transaction pattern selection.
Kafka Fundamentals and KRaft Mode — Kafka producer/consumer mechanics and partition/offset semantics. Background for where Outbox sits between DB and broker for consistency.

2PC#

Saga / TCC#

Choreography#

Orchestration#

Trade-offs Between the Two#

TCC#

Outbox#

Pattern Selection Criteria#

The Cost of Simplicity#

References#

2PC