I want to run Stigg Sidecar & API in production

This guide explains how to deploy, monitor, and troubleshoot the Stigg Sidecar and Stigg status APIs in production.

Components and ownership

Your services and Stigg Sidecar

The Sidecar runs alongside your app or as a small service, proxying entitlement checks and handling caching with either in-memory storage or Redis. It can be scaled horizontally to match your application’s needs. The Sidecar is designed for internal use and should not be internet-exposed. The design goal is to keep the Sidecar service operational during network issues or outages while continuing to return valid responses to the main application. Sidecar reads from the local Redis cache and runs in fail-safe modes. If the status API becomes unavailable, Sidecar serves responses directly from cache. If Redis goes down, Sidecar will attempt to return a live status API response (if accessible), and if not, it will fall back to configurable static values as a final safeguard. For write paths (e.g., ReportUsage and ReportEvents), errors are returned explicitly and should be surfaced through your own observability and alerting layer using try/catch handling and structured logging. The Sidecar service is designed to operate synchronously. Read requests do not run asynchronously or in the background. The main application can call the sidecar and wait for a response in the same request path without risk of deferred execution. By default, reads are resolved from the local Redis cache and returned immediately. If a cache miss occurs, the sidecar attempts to retrieve entitlements from the Edge API, where typical latency is around 100 ms. To protect against latency spikes from upstream services, the sidecar uses a 10-second network timeout (configurable), after which it falls back to default values instead of hanging or retrying indefinitely. Short Stigg API latency spikes do not impact cached reads. Only uncached or Edge-fetched data paths are exposed to network conditions. If the Stigg API experiences sustained latency or connectivity issues, the sidecar continues serving reads from cache while enforcing the timeout guardrails on external calls to avoid cascading delays into the main application.

Status Stigg APIs

Status Stigg APIs are the central Stigg services that power your entitlement system. Monitor overall health and subscribe to incident updates via the Stigg Status Page via email, RSS feed or Slack. The public status page does not provide an official public API. However, the underlying metrics shown on the page can be retrieved by querying an internal endpoint, which returns data including component uptimes. Example request (last 3 months, UTC range calculated dynamically):

curl "https://status.stigg.io/proxy/status.stigg.io/component_impacts?start_at=$(date -u -v-3m '+%Y-%m-%dT%H:%M:%SZ')&end_at=$(date -u '+%Y-%m-%dT%H:%M:%SZ')"

Example response structure:

{
  "incident_links": [
    {
      "published_at": "2025-06-16T13:01:53.265Z",
      "id": "01JXWD9YQJFZ0BEE05XBFQD7CW",
      "name": "Increased API latency",
      "status": "resolved",
      "permalink": "https://statuspage.incident.io/stigg/incidents/01JXWD9YQJFZ0BEE05XBFQD7CW"
    }
  ],
  "component_uptimes": [
    {
      "component_id": "01JQV6ECFPKWSMXBQ81NHKKWQP",
      "data_available_since": "2024-07-08T06:58:00Z",
      "uptime": "100.00"
    },
    {
      "component_id": "01JQV6ECFP7YKP9DC886WCCD6B",
      "data_available_since": "2025-04-02T12:08:33Z",
      "uptime": "100.00"
    },
    {
      "component_id": "01JWWMF1XCHBJA90R2GBFDCPGC",
      "data_available_since": "2024-07-14T12:56:00Z",
      "uptime": "100.00"
    }
  ]
}

Component	ID
API (api.stigg.io)	`01JQV6ECFPKWSMXBQ81NHKKWQP`
Edge API (edge.api.stigg.io)	`01JQV6ECFP7YKP9DC886WCCD6B`
Infrastructure	`01JWWMF1XCHBJA90R2GBFDCPGC`
App (stigg.io)	`01JQV6ECFPD3JFJ29DX6E31K62`

Durable delivery via SQS

As an alternative to webhooks, Stigg provisions multiple AWS SQS queues for scenarios that require higher durability and volume handling. This approach provides more reliable delivery guarantees. Additionally, the persistent cache consumes a Stigg-managed SQS queue to keep Redis fresh and synchronized with the latest entitlement data.

Health, readiness and metrics

Sidecar

The Sidecar exposes standard health and readiness endpoints for monitoring and orchestration. Check health with GET /livez and readiness with GET /readyz, both returning {"status":"UP"} when operational. Metrics are available in Prometheus format at GET /metrics. Key metrics to monitor include:

sidecar_initialization_errors_total - indicates errors during sidecar initialization, such as an invalid API key or misconfiguration. It should always remain at 0. If the number of errors rises above 0 in production, it should immediately trigger an alert to the on-call engineer or roll back to the last working configuration
sidecar_invalid_api_key_errors_total - authentication failures
sidecar_network_request_errors_total - connectivity problems, increments on API errors. It should trigger a warning if any errors (>0) occur within a 5-minute window, and escalate if the error rate remains elevated for 10 minutes
sidecar_redis_client_errors_total - Redis connection issues, should be treated as a signal of connectivity or stability issues with Redis for the service. A reasonable threshold is to trigger a warning if any errors (>0) occur within a 5-minute window, and escalate if the error rate remains elevated for 10 minutes
sidecar_cache_hits_total and sidecar_cache_misses_total - cache performance

sidecar_network_request_errors_total metric is another indication that the API may be unreachable. For Sidecar, entitlement check responses containing isFallback: true means fallback values are used.

A Sidecar liveness probe should not trigger automatic Kubernetes pod restarts for network connectivity issues, Stigg API unreachability, or Edge failures. This is an intentional design choice to prevent the Sidecar from entering a restart loop, which would disrupt the main application whenever the status Stigg API is temporarily unreachable. The recommended approach is to keep the sidecar pod alive and rely on cached reads and fail-safe fallback modes while surfacing connectivity or write errors into your own observability and alerting stack, rather than coupling them to pod restarts.

Persistent cache service

When Redis is enabled, the persistent cache service provides its own monitoring endpoints: GET /livez, GET /readyz, and GET /metrics. Important metrics include:

persistent_cache_write_duration_seconds - write performance, high duration can indicate a delay in propagation of changes to Redis. Write duration >5 seconds sustained over a 5-minute window should trigger a warning, and escalate if the duration remains elevated for 15 minutes
persistent_cache_write_errors_total - write failures, should be treated as a signal of connectivity or stability issues with Redis for the service, same as above. Trigger a warning if any errors (>0) occur within a 5-minute window, and escalate if the error rate remains elevated for 10 minutes
persistent_cache_hit_ratio - overall cluster hit rate, a good indicator of Redis utilization, below 0.8 sustained for 15 minutes should trigger a warning, and a drop below 0.6 should be escalated over 60 minutes
persistent_cache_memory_usage_bytes - memory consumption
persistent_cache_hits_total and persistent_cache_misses_total - cache effectiveness
persistent_cache_messages_processed_total - throughput tracking

The hit ratio metric reflects the overall performance of the cluster and is typically a good indication of how well Redis is being utilized.

Recommended metrics and alert thresholds

These thresholds can be adjusted based on your production utilization patterns. Warnings are typically routed to the application team via Slack for investigation, while critical alerts should trigger PagerDuty notifications to the on-call engineer.

Metric	Purpose	Warning (Slack)	Error (Slack)	Critical (PagerDuty)	Remediation
`sidecar_initialization_errors_total` or `sidecar_invalid_api_key_errors_total`	Sidecar initialization failures	`> 0`	`> 0`	`> 0`	Note: often caused by a bad deployment. Roll back if possible. Check service error logs. Validate Stigg API key is correct and active. Review recent configuration changes. Roll back to last stable settings or image version if needed.
`sidecar_network_request_errors_total`	Stigg API unreachable	`> 0` over 5 min	Stays elevated > 10 min	Stays elevated > 15 min	Check error logs. Verify API reachability via status page. Notify the Stigg team if the issue persists.
`sidecar_redis_client_errors_total`	Redis unreachable from Sidecar	`> 0` over 5 min	Stays elevated > 10 min	Stays elevated > 10 min	Check Redis health, memory usage, and instance reachability.
`persistent_cache_write_errors_total`	Redis write failures	`> 0` over 5 min	Stays elevated > 10 min	Stays elevated > 10 min	Check Redis health, memory usage, and instance reachability.
`persistent_cache_write_duration_seconds`	Redis write latency	`> 5 seconds` over 5 min	Stays elevated > 15 min	Stays elevated > 30 min	Check Redis health and reachability. Monitor CPU/memory. If consistently high, consider scaling out.
`persistent_cache_hit_ratio`	Redis cache efficiency	`< 80%` over 15 min	`< 60%` sustained > 60 min	Not applicable	Check Redis health and reachability. Note: low ratio after restarts or cache clearing is expected and should recover as the cache repopulates.

For auto-scaling, monitor the service’s CPU and memory metrics:

Metric	Purpose	Trigger
process_cpu_seconds_total	CPU usage over time	> 60% avg over 5m
process_resident_memory_bytes	Memory used by the process	> 80% avg over 5m

Wrap all SDKs/API calls with try/catch blocks and log the errors.

Logging and observability

Sidecar logs

The Sidecar supports configurable log levels via the LOG_LEVEL environment variable (error|warn|info|debug). Forward these logs to your logging stack and configure alerts for error-level messages, as they typically indicate issues requiring immediate attention.

Application Logs

Instrument your application code to log Stigg API call failures at the call sites. Include the operation or endpoint name, HTTP status code when available, and the complete error object exposed by your Stigg client. This logging is essential for fast triage and allows you to correlate application-level issues with Sidecar metrics.

Webhooks

Webhook endpoints must return a 2xx status code within 30 seconds. If delivery fails, Stigg retries 3 times immediately, then 3 additional times at 30-second intervals. Use the messageId field for idempotency checks and deduplication in your webhook handler.

SQS

When using SQS for event delivery, monitor queue backlog, message latency, message age, and consumer health.

Scaling and high availability

Sidecar Scaling

Deploy the Sidecar either as a classic sidecar pattern (one per service or pod) or as a small shared service, scaling horizontally based on load. When using only in-memory cache, adding more replicas may increase cache misses since each instance maintains its own independent cache. For improved hit ratios and resilience, use Redis with the persistent cache service.

Persistent Cache

Run the persistent-cache-service to maintain Redis synchronization. This service consumes a Stigg-managed SQS queue and keeps Redis up to date with the latest entitlement data. The service is stateless and horizontally scalable. Prerequisites: Stigg-provisioned SQS queue and your own Redis instance.

Offline and fallback behavior

When the Stigg API is unreachable, SDKs and the Sidecar return results from their local cache or apply fallback strategies to keep your application operational. Usage data can be buffered and optionally persisted to disk, preventing data loss during extended outages.

Failure patterns and triage

Sidecar startup problems

You’ll typically spot Sidecar startup issues through a rising sidecar_initialization_errors_total metric and logs that mention invalid API keys or network connectivity problems. To diagnose, check the health and readiness endpoints (GET /livez and GET /readyz). Validate that your SERVER_API_KEY environment variable is correct and verify network egress to Stigg API/Edge endpoints. Also confirm you’re running a supported Sidecar image version. When the Sidecar encounters startup or network errors, it continues serving from the persistent cache if configured, or falls back to global fallback values to keep your application running.

Elevated API failures

Signs of API problems include increased non-2xx responses from Stigg APIs, a climbing sidecar_network_request_errors_total metric, and incidents appearing on the status page. Start by correlating the timing with the Stigg Status Page to see if there’s a platform-wide issue. Verify network connectivity from your services to Stigg, and ensure you have global and per-check fallback values configured for critical entitlement paths so your app can continue functioning during outages.

Redis and Persistent Cache issues

Redis problems manifest as a rising sidecar_redis_client_errors_total metric, along with persistent-cache metrics showing write errors or a declining hit ratio. Check the persistent-cache service health with GET /readyz to verify consumer and Redis state. Review GET /metrics for persistent_cache_* metrics to understand what’s failing. Double-check your Redis connection parameters including host, port, authentication, TLS settings, and database selection. Make sure the environment prefix is aligned across your SDK, Sidecar, and persistent-cache service, as mismatches here are a common source of issues.

Webhook / SQS delivery problems

For webhooks, remember that endpoints must return a 2xx status within 30 seconds, following the retry schedule. Consider setting up alerts for consecutive webhook failures or timeouts so you can respond quickly. For SQS-based delivery, monitor your SQS dashboard for queue backlog, message age, and throughput metrics. Scale your consumers as needed to handle load spikes. If you need durable queues for your use case, they can be provisioned by Stigg upon request.

API rate limits

Operation	Limit	Notes
`getEntitlements`, `get*Entitlement`	Unlimited	Edge API
`getPaywall`	Unlimited	Edge API
`getActiveSubscriptionsList`	Unlimited	Edge API
`reportUsage`	3000 / min
`reportEvent(s)`	1000 bulk / sec

Sidecar entitlement checks are served from local or persistent cache. On cache miss, the Sidecar queries the Edge API. The Entitlements endpoint itself is unlimited, so entitlement lookups are effectively not rate-limited in practice. If you require higher limits for other operations, contact Stigg Support.

Error handling, trace IDs and correlation

Every event in the Stigg Activity Log includes a trace ID that uniquely identifies the request that produced the event. This is invaluable for audits and troubleshooting. You can configure webhooks to receive notifications about changes and certain failure conditions, with documented delivery behavior, retries, and idempotency requirements. Note that you must set up the webhook endpoint on your side. Error responses from the API include both a trace ID and an error code (such as “plan not found”), which you can log and share for deeper investigation. The same trace ID propagates across related events and notifications, enabling end-to-end correlation.

Three reliable ways to investigate with trace IDs

1. Activity log

Navigate to the customer or entity in question, go to the Activity tab, locate the relevant event, and copy the trace ID or event link. Use this trace ID to correlate with your application logs and any webhook deliveries.

2. API error response

When your client surfaces an API error, extract and log at minimum: traceId, error_code, human-readable error_message, the operation or endpoint name, and HTTP status if available. Share the trace ID with Stigg Support if deeper server-side investigation is needed. Example error response:

{
  "errors": [
    {
      "message": "Invalid API key",
      "extensions": {
        "code": "Unauthenticated",
        "traceId": "429e58a3-c2ac-4854-8842-0b299acd2e85"
      }
    }
  ],
  "data": null
}

3. Webhook notifications

Configure webhooks for relevant failure notifications in your environment, such as subscription-related failures, billing-integration failures, or generic sync failures. Use the trace ID included in those notifications to correlate the webhook with the originating request and the activity log entry. Delivery and retry behavior is documented, and demos show consistent trace IDs across these signals.

Exact event names and payload fields vary by the events you enable in webhooks settings. You control which events are sent to your endpoint.

Common API error codes

Code	Meaning
`Unauthenticated`	Invalid API key
`InvalidArgumentError`	Invalid payload
`CustomerNotFound`	Customer doesn’t exist
`SubscriptionNotFound`	Subscription doesn’t exist
`PlanNotFound`	Plan doesn’t exist

What to log on your side

To enable effective troubleshooting, always log these fields when errors occur:

traceId
error_code
error_message
operation / endpoint
http_status

Suggested triage flow

When your application reports an error, follow this investigation sequence:

Capture the traceId, error_code, error_message, operation, and http_status from the error response.
Check the activity log for the same trace ID to see what happened and when the event occurred.
Check your webhook deliveries around that time, using the trace ID to match notifications to the originating request.

If you need additional context or server-side investigation, share the trace ID with Stigg Support for deeper log correlation.

Alerting ideas

These recommendations are policy-neutral, you should choose thresholds that align with your specific SLOs and operational requirements.

Sidecar health monitoring

Set up alerts if /readyz returns a non-UP status for more than N minutes, especially when combined with /livez failures. This indicates the Sidecar may not be able to serve traffic reliably.

Sidecar error tracking

Alert on growth in sidecar_*_errors_total metrics and sustained drops in cache hit ratio. These patterns often indicate underlying issues with connectivity, configuration, or cache effectiveness that need attention.

Persistent cache monitoring

Monitor persistent_cache_write_errors_total and alert when it begins climbing. Also watch for falling persistent_cache_hit_ratio values and drops in consumer count reported by /readyz. These signals can indicate Redis connectivity problems or insufficient consumer capacity.

Upstream service status

Subscribe to the Stigg Status Page and configure alerts for status changes. This ensures you’re notified immediately about platform-wide incidents that might affect your integration.

Delivery reliability

Alert on repeated webhook failures based on your retry policy, or when SQS backlog and message age breach your defined thresholds. Use the SQS dashboard mentioned earlier to track these metrics.

Entitlement latency (p95)

Measure latency at your application layer or APM around get*Entitlement and getEntitlements calls, and alert according to your SLO. The architecture is designed for low-latency reads through pre-computed, distributed caching and edge delivery.

Example alert configurations

To get started, consider these baseline alert policies:

Cache effectiveness: Alert if Sidecar cache hit ratio falls below 70% for 15 minutes. This suggests cache configuration issues or unusual access patterns.
API reliability: Alert if the error rate exceeds 1% sustained for 15 minutes. This indicates potential problems with API connectivity or request validity that warrant investigation.

Versioning and compatibility

Since Sidecar SDK v3.0.0, the TLS connection method has been deprecated in favor of non-TLS connections. You should use Sidecar image version 2.494.0 or later. Note that TLS self-signed certificates expire on January 26, 2026, so upgrading is strongly recommended to avoid service interruptions.

Platform team considerations

Monitoring

For both the Sidecar and Persistent Cache Service (PCS), implement standard Prometheus monitoring by scraping the /metrics endpoint. Additionally, set up regular health checks against /livez and /readyz endpoints to ensure services are operational and ready to handle traffic.

Autoscaling

Based on benchmark results, configure autoscaling policies to scale up when CPU utilization exceeds 70% for 5 minutes or when memory usage exceeds 90%. Apply these thresholds to both Sidecar and PCS deployments to maintain performance under varying load conditions.

Integration guides

Quick start guides

I want to...

​Components and ownership

​Your services and Stigg Sidecar

​Status Stigg APIs

​Durable delivery via SQS

​Health, readiness and metrics

​Sidecar

​Persistent cache service

​Recommended metrics and alert thresholds

​Logging and observability

​Sidecar logs

​Application Logs

​Webhooks

​SQS

​Scaling and high availability

​Sidecar Scaling

​Persistent Cache

​Offline and fallback behavior

​Failure patterns and triage

​Sidecar startup problems

​Elevated API failures

​Redis and Persistent Cache issues

​Webhook / SQS delivery problems

​API rate limits

​Error handling, trace IDs and correlation

​Three reliable ways to investigate with trace IDs

​1. Activity log

​2. API error response

​3. Webhook notifications

​Common API error codes

​What to log on your side

​Suggested triage flow

​Alerting ideas

​Sidecar health monitoring

​Sidecar error tracking

​Persistent cache monitoring

​Upstream service status

​Delivery reliability

​Entitlement latency (p95)

​Example alert configurations

​Versioning and compatibility

​Platform team considerations

​Monitoring

​Autoscaling

Components and ownership

Your services and Stigg Sidecar

Status Stigg APIs

Durable delivery via SQS

Health, readiness and metrics

Sidecar

Persistent cache service

Recommended metrics and alert thresholds

Logging and observability

Sidecar logs

Application Logs

Webhooks

SQS

Scaling and high availability

Sidecar Scaling

Persistent Cache

Offline and fallback behavior

Failure patterns and triage

Sidecar startup problems

Elevated API failures

Redis and Persistent Cache issues

Webhook / SQS delivery problems

API rate limits

Error handling, trace IDs and correlation

Three reliable ways to investigate with trace IDs

1. Activity log

2. API error response

3. Webhook notifications

Common API error codes

What to log on your side

Suggested triage flow

Alerting ideas

Sidecar health monitoring

Sidecar error tracking

Persistent cache monitoring

Upstream service status

Delivery reliability

Entitlement latency (p95)

Example alert configurations

Versioning and compatibility

Platform team considerations

Monitoring

Autoscaling