When your SaaS calls third-party APIs on behalf of customers, you store their OAuth access and refresh tokens. You are a credential custodian: if your token store is breached, attackers reach not just your data but your customers' data on every connected platform. This post documents the token vault pattern, a centralized token service that owns all credential storage, decryption, refresh, and outbound mediation, so that application code never touches raw third-party tokens.

System description

A centralized token service encrypts and stores OAuth credentials per tenant, refreshes them under lock, and makes outbound API calls on behalf of application code. Application code sends requests in authenticated tenant context, specifying integration_id; it never receives, caches, or persists credentials itself.

After the initial OAuth consent exchange, the API passes tokens to the token service for encrypted storage. At runtime, every outbound call using customer credentials flows through the token service:

Application code never has a direct arrow to the third-party API.

Golden path

Build this first. Then relax constraints only if you have a specific reason:

Authenticate caller → derive tenant from auth context → resolve credential in token service → refresh under lock → make outbound call via token service → return response without credential

Each step is a gate. If the credential is missing, the lock cannot be acquired, or the provider rejects a refresh, the outbound call fails and the tenant is notified.

If you outsource token custody to an external broker, you are delegating this pattern rather than building it. Evaluate the broker's encryption, tenant isolation, and incident response posture the way you would evaluate your own. The trade-off shifts from "you own the implementation" to "you depend on a third party for every outbound call".

Minimal system context

  • API / control plane (authorization): authenticates tenants, handles OAuth consent flow, dispatches outbound requests to the token service

  • Token service (credential management, outbound mediation): stores, decrypts, refreshes, and uses third-party tokens

  • Encrypted token store (data plane): database table holding encrypted token records, scoped by tenant_id

  • KMS / HSM (key management): manages key encryption keys; the token service calls KMS to unwrap per-tenant data encryption keys

  • Distributed lock backend (refresh serialization): Redis, DB advisory locks, or equivalent. Prevents concurrent refresh of the same token

  • Cleanup worker (revocation): periodic job that catches zombie tokens for disabled or churned tenants

Core design

Token service (credential management and outbound mediation)

Application code calls two functions:

  • execute_request(auth_context, integration_id, request_spec): The primary interface. The token service extracts tenant_id from the caller's verified auth context, resolves the credential for this tenant and integration, refreshes under lock if expired, injects the Authorization header, makes the outbound HTTP call, and returns the response. Application code sends a request description (method, path, body); it never sees the token

  • store_token(auth_context, integration_id, token_response): Called after a successful OAuth exchange. Extracts tenant_id from auth context, encrypts the access and refresh tokens with the tenant's DEK, and persists the record

Large payloads: The execute_request proxy pattern is designed for standard API calls (JSON request / response). For large file uploads or bulk data exports, stream request and response bodies end-to-end without buffering, or use a dedicated network proxy (Envoy, nginx) where the token service supplies credentials at the edge rather than acting as the data plane itself.

Encrypted token store (data plane)

A database table holding encrypted token records. Minimum fields:

  • integration_id: Unique per tenant-provider connection

  • tenant_id: The isolation boundary (immutable after creation, included in every query)

  • provider: The OAuth provider (e.g., salesforce, google, slack)

  • encrypted_access_token, encrypted_refresh_token: Ciphertext (AES-256-GCM)

  • encrypted_dek: The data encryption key, wrapped by the tenant's KEK

  • scopes: The granted scope string (stored to detect drift)

  • expires_at, created_at, revoked_at

Envelope encryption keeps raw tokens out of the database:

  1. Each tenant gets a data encryption key (DEK), an AES-256-GCM symmetric key generated at tenant provisioning

  2. The DEK is encrypted by a key encryption key (KEK) in cloud KMS or HSM. The wrapped DEK is stored alongside the token record

  3. To decrypt a token, the token service calls KMS to unwrap the DEK, then decrypts the token ciphertext

KMS handles KEK rotation automatically: old ciphertexts stay decryptable, new encryptions use the latest key version. The token service never sees the KEK in plaintext. On integration disconnect, any cached plaintext DEK for that tenant-integration pair must be evicted before the next request cycle.

Threat model

Baseline assumptions

  • Your SaaS authenticates tenants and derives tenant_id from verified credentials, not request parameters

  • OAuth providers implement RFC 6749 correctly (authorization code flow, token endpoint, revocation endpoint)

  • Tokens are bearer credentials: anyone holding a valid access token can use it. Token binding (DPoP, mTLS) is not yet widely supported by third-party providers, so the architecture does not assume it

  • Standard OAuth flow hardening (exact-match redirect URIs, PKCE, CSRF-bound state parameter) is in place. This model focuses on what happens after tokens are acquired

  • Standard infra controls (TLS, WAF, database AuthN, SQLi prevention) are in place

A note on risk: you won’t fix everything

This table isn’t a checklist where every row must be fully eliminated. Focus on preventing the worst failures and limiting blast radius. In practice: ship prevention for the High rows first, then add monitoring and response for what you can’t realistically prevent.

Phase 1: Token storage and lifecycle

Focus: Preventing token exposure, enforcing tenant isolation, and managing credential lifecycle

Asset

Threat

Baseline Controls

Mitigation Options

Risk

Token store

Bulk exposure: Database breach, backup leak, or snapshot copy exposes all tokens

Database access controls

1. Envelope encryption: per-tenant DEK wrapped by KEK in KMS / HSM

2. Per-tenant keys: a single DEK compromise limits blast radius to one tenant

3. No plaintext path: verify no token value appears in logs, error messages, monitoring, or backups

High

Tenant isolation

Cross-tenant token access: Application bug or IDOR allows one tenant's code path to use another tenant's credential through the token service

Auth context

1. Tenant filter: every token service lookup includes WHERE tenant_id = ? with tenant_id from verified auth context, never from request parameters

2. Opaque errors: cross-tenant lookups return "integration not found", never "access denied"

High

Stored scopes

Scope overreach: Stored tokens carry broader scopes than the integration uses, so a vault compromise gives attackers more access than the feature requires

OAuth consent screen

1. Scope inventory: document required scopes per integration and compare against stored grants

2. Scope drift detection: compare stored scopes against provider response on each refresh, alert on unexpected expansion

Medium

Refresh token

Rotation race: Two concurrent requests trigger refresh simultaneously. The provider rotates the refresh token on first use; the second caller sends a stale token, and the provider revokes the entire token family

Centralized refresh

1. Distributed lock: acquire a per-integration lock before refreshing

2. Wait and re-read: if the lock is held, back off and read the (likely already refreshed) token

3. Atomic store: persist the new refresh token before releasing the lock

4. Fail closed: if the lock backend is unavailable, reject the outbound request rather than refreshing without coordination

Low

Encryption keys

KEK compromise: Attacker gains KMS access, making all DEK encryption ineffective

Cloud KMS access policies

1. Least privilege: restrict KMS decrypt to the token service's IAM role only

2. Audit: alert on decrypt calls from unexpected principals or unusual volume

3. Automatic KEK rotation in KMS

Medium

Token lifecycle

Zombie tokens: Tenant disconnects an integration or churns, but tokens persist and remain usable at the provider

Manual cleanup

1. Revoke at provider: call the revocation endpoint (RFC 7009) on disconnect

2. Delete local: purge the encrypted record and evict any cached plaintext

3. Sweep: periodic job catches tokens the disconnect flow missed

Low

Phase 2: Outbound token use

Focus: Preventing credential misrouting and leakage when the token service makes API calls on behalf of tenants

Asset

Threat

Baseline Controls

Mitigation Options

Risk

Outbound request

Confused deputy: Bug in the token service resolves the wrong tenant's credential for an outbound call, making an API request using another customer's token

Tenant context binding

1. Strict lookup: execute_request resolves credentials by tenant_id + integration_id; both must match the same record

2. No fallback: if the lookup returns no match, fail the request, never try a broader search

3. Audit: log tenant_id, integration_id, and target URL for every outbound call

High

Token service

Outage: The token service is unavailable, blocking all outbound integrations for all tenants

Redundant deployment

1. High availability: deploy the token service with replica count and health checks matching your SLA

2. Graceful degradation: application code receives a clear "integration unavailable" error, not a timeout

3. No bypass: application code must not cache or store tokens as a fallback when the service is down

Medium

Access tokens

Credential leakage: Access token appears in request logs, error stack traces, monitoring dashboards, or crash dumps from the token service process

Standard logging, short-lived cache

1. Header redaction: strip Authorization headers from all log outputs

2. Structured logging: field-level redaction rules in the logging framework

3. Short TTL: cache entries expire in minutes, evict on disconnect

4. Process isolation: restricted core dump settings for the token service

Medium

Verification checklist

  • Token encryption

    • Each tenant's tokens are encrypted with a distinct DEK

    • KMS decrypt permissions are restricted to the token service's IAM role

    • Decrypting a token record with a different tenant's DEK fails

  • Tenant isolation

    • Querying for a token with a valid integration_id but wrong tenant_id returns "not found"

    • Every query path includes WHERE tenant_id = ? with tenant_id from verified auth context

    • Cross-tenant lookups return identical responses whether the integration exists or not

  • Token lifecycle

    • Two concurrent requests for the same expired token result in exactly one provider refresh call

    • If the lock backend is unavailable, refresh fails closed and the outbound request is rejected

    • The new refresh token is persisted atomically before the lock is released

    • A failed refresh marks the integration as degraded and alerts the tenant

    • Stored scopes are compared on each refresh; scope changes trigger an alert

    • Disconnecting an integration revokes at the provider and deletes the local record

    • A periodic sweep catches zombie tokens for disabled or churned tenants

  • Outbound safety

    • The execute_request response contains the third-party API's response, not the credential used

    • Token values do not appear in application logs, error messages, or monitoring dashboards

    • Every outbound call is logged with tenant_id, integration_id, target host, and HTTP status

  • Detection

    • KMS throttling or timeout causes a controlled failure and alert, not unbounded retries

    • Alerts fire on: refresh failure rate exceeding threshold, KMS decrypt calls from unexpected principals, lock backend health degradation

    • Token usage logs support investigating "which tenant's credentials were used to call which API at what time"

Implementation & Review

The full threat model matrix, architectural diagrams, and a printable verification checklist for this pattern are available in the Secure Patterns repository. Use these artifacts to guide your design reviews and internal audits.

Keep reading