Components and ownership
Your services and Stigg Sidecar
The Sidecar runs alongside your app or as a small service, proxying entitlement checks and handling caching with either in-memory storage or Redis. It can be scaled horizontally to match your application’s needs. The Sidecar is designed for internal use and should not be internet-exposed. The design goal is to keep the Sidecar service operational during network issues or outages while continuing to return valid responses to the main application. Sidecar reads from the local Redis cache and runs in fail-safe modes. If the upstream API becomes unavailable, Sidecar serves responses directly from cache. If Redis goes down, Sidecar will attempt to return a live upstream API response (if accessible), and if not, it will fall back to configurable static values as a final safeguard. For write paths (e.g.,ReportUsage and ReportEvents), errors are returned explicitly and should be surfaced through your own observability and alerting layer using try/catch handling and structured logging.
The Sidecar service is designed to operate synchronously. Read requests do not run asynchronously or in the background — the main application can call the sidecar and wait for a response in the same request path without risk of deferred execution. By default, reads are resolved from the local Redis cache and returned immediately. If a cache miss occurs, the sidecar attempts to retrieve entitlements from the Edge API, where typical latency is around 100 ms. To protect against latency spikes from upstream services, the sidecar uses a 10-second network timeout (configurable), after which it falls back to default values instead of hanging or retrying indefinitely.
Short Stigg API latency spikes do not impact cached reads. Only uncached or Edge-fetched data paths are exposed to network conditions. If the Stigg API experiences sustained latency or connectivity issues, the sidecar continues serving reads from cache while enforcing the timeout guardrails on external calls to avoid cascading delays into the main application.
Upstream Stigg APIs
Upstream Stigg APIs are the central Stigg services that power your entitlement system. Monitor overall health and subscribe to incident updates via the Stigg Status Page via email, RSS feed or Slack. The public status page does not provide an official public API. However, the underlying metrics shown on the page can be retrieved by querying an internal endpoint, which returns data including component uptimes. Example request (last 3 months, UTC range calculated dynamically):| Component | ID |
|---|---|
| API (api.stigg.io) | 01JQV6ECFPKWSMXBQ81NHKKWQP |
| Edge API (edge.api.stigg.io) | 01JQV6ECFP7YKP9DC886WCCD6B |
| Infrastructure | 01JWWMF1XCHBJA90R2GBFDCPGC |
| App (stigg.io) | 01JQV6ECFPD3JFJ29DX6E31K62 |
Durable delivery via SQS
As an alternative to webhooks, Stigg provisions multiple AWS SQS queues for scenarios that require higher durability and volume handling. This approach provides more reliable delivery guarantees. Additionally, the persistent cache consumes a Stigg-managed SQS queue to keep Redis fresh and synchronized with the latest entitlement data.Health, readiness and metrics
Sidecar
The Sidecar exposes standard health and readiness endpoints for monitoring and orchestration. Check health withGET /livez and readiness with GET /readyz, both returning {"status":"UP"} when operational.
Metrics are available in Prometheus format at GET /metrics. Key metrics to monitor include:
sidecar_initialization_errors_total- indicates errors during sidecar initialization, such as an invalid API key or misconfiguration. It should always remain at 0. If the number of errors rises above 0 in production, it should immediately trigger an alert to the on-call engineer or roll back to the last working configurationsidecar_invalid_api_key_errors_total- authentication failuressidecar_network_request_errors_total- connectivity problems, increments on API errors. It should trigger a warning if any errors (>0) occur within a 5-minute window, and escalate if the error rate remains elevated for 10 minutessidecar_redis_client_errors_total- Redis connection issues, should be treated as a signal of connectivity or stability issues with Redis for the service. A reasonable threshold is to trigger a warning if any errors (>0) occur within a 5-minute window, and escalate if the error rate remains elevated for 10 minutessidecar_cache_hits_totalandsidecar_cache_misses_total- cache performance
sidecar_network_request_errors_total metric is another indication that the API may be unreachable.
For Sidecar, entitlement check responses containing isFallback: true means fallback values are used.A Sidecar liveness probe should not trigger automatic Kubernetes pod restarts for network connectivity issues, Stigg API unreachability, or Edge failures. This is an intentional design choice to prevent the Sidecar from entering a restart loop, which would disrupt the main application whenever the upstream Stigg API is temporarily unreachable. The recommended approach is to keep the sidecar pod alive and rely on cached reads and fail-safe fallback modes while surfacing connectivity or write errors into your own observability and alerting stack, rather than coupling them to pod restarts.
Persistent cache service
When Redis is enabled, the persistent cache service provides its own monitoring endpoints:GET /livez, GET /readyz, and GET /metrics. Important metrics include:
persistent_cache_write_duration_seconds- write performance, high duration can indicate a delay in propagation of changes to Redis. Write duration >5 seconds sustained over a 5-minute window should trigger a warning, and escalate if the duration remains elevated for 15 minutespersistent_cache_write_errors_total- write failures, should be treated as a signal of connectivity or stability issues with Redis for the service, same as above. Trigger a warning if any errors (>0) occur within a 5-minute window, and escalate if the error rate remains elevated for 10 minutespersistent_cache_hit_ratio- overall cluster hit rate, a good indicator of Redis utilization, below 0.8 sustained for 15 minutes should trigger a warning, and a drop below 0.6 should be escalated over 60 minutespersistent_cache_memory_usage_bytes- memory consumptionpersistent_cache_hits_totalandpersistent_cache_misses_total- cache effectivenesspersistent_cache_messages_processed_total- throughput tracking
The hit ratio metric reflects the overall performance of the cluster and is typically a good indication of how well Redis is being utilized.
Recommended metrics and alert thresholds
These thresholds can be adjusted based on your production utilization patterns. Warnings are typically routed to the application team via Slack for investigation, while critical alerts should trigger PagerDuty notifications to the on-call engineer.| Metric | Purpose | Warning (Slack) | Error (Slack) | Critical (PagerDuty) | Remediation |
|---|---|---|---|---|---|
sidecar_initialization_errors_total or sidecar_invalid_api_key_errors_total | Sidecar initialization failures | > 0 | > 0 | > 0 | Note: often caused by a bad deployment. Roll back if possible. Check service error logs. Validate Stigg API key is correct and active. Review recent configuration changes. Roll back to last stable settings or image version if needed. |
sidecar_network_request_errors_total | Stigg API unreachable | > 0 over 5 min | Stays elevated > 10 min | Stays elevated > 15 min | Check error logs. Verify API reachability via status page. Notify the Stigg team if the issue persists. |
sidecar_redis_client_errors_total | Redis unreachable from Sidecar | > 0 over 5 min | Stays elevated > 10 min | Stays elevated > 10 min | Check Redis health, memory usage, and instance reachability. |
persistent_cache_write_errors_total | Redis write failures | > 0 over 5 min | Stays elevated > 10 min | Stays elevated > 10 min | Check Redis health, memory usage, and instance reachability. |
persistent_cache_write_duration_seconds | Redis write latency | > 5 seconds over 5 min | Stays elevated > 15 min | Stays elevated > 30 min | Check Redis health and reachability. Monitor CPU/memory. If consistently high, consider scaling out. |
persistent_cache_hit_ratio | Redis cache efficiency | < 80% over 15 min | < 60% sustained > 60 min | Not applicable | Check Redis health and reachability. Note: low ratio after restarts or cache clearing is expected and should recover as the cache repopulates. |
| Metric | Purpose | Trigger |
|---|---|---|
| process_cpu_seconds_total | CPU usage over time | > 60% avg over 5m |
| process_resident_memory_bytes | Memory used by the process | > 80% avg over 5m |
Wrap all SDKs/API calls with try/catch blocks and log the errors.
Logging and observability
Sidecar logs
The Sidecar supports configurable log levels via theLOG_LEVEL environment variable (error|warn|info|debug). Forward these logs to your logging stack and configure alerts for error-level messages, as they typically indicate issues requiring immediate attention.
Application Logs
Instrument your application code to log Stigg API call failures at the call sites. Include the operation or endpoint name, HTTP status code when available, and the complete error object exposed by your Stigg client. This logging is essential for fast triage and allows you to correlate application-level issues with Sidecar metrics.Webhooks
Webhook endpoints must return a2xx status code within 30 seconds. If delivery fails, Stigg retries 3 times immediately, then 3 additional times at 30-second intervals. Use the messageId field for idempotency checks and deduplication in your webhook handler.
SQS
When using SQS for event delivery, monitor queue backlog, message latency, message age, and consumer health.Scaling and high availability
Sidecar Scaling
Deploy the Sidecar either as a classic sidecar pattern (one per service or pod) or as a small shared service, scaling horizontally based on load. When using only in-memory cache, adding more replicas may increase cache misses since each instance maintains its own independent cache. For improved hit ratios and resilience, use Redis with the persistent cache service.Persistent Cache
Run the persistent-cache-service to maintain Redis synchronization. This service consumes a Stigg-managed SQS queue and keeps Redis up to date with the latest entitlement data. The service is stateless and horizontally scalable. Prerequisites: Stigg-provisioned SQS queue and your own Redis instance.Offline and fallback behavior
When the Stigg API is unreachable, SDKs and the Sidecar return results from their local cache or apply fallback strategies to keep your application operational. Usage data can be buffered and optionally persisted to disk, preventing data loss during extended outages.Failure patterns and triage
Sidecar startup problems
You’ll typically spot Sidecar startup issues through a risingsidecar_initialization_errors_total metric and logs that mention invalid API keys or network connectivity problems.
To diagnose, check the health and readiness endpoints (GET /livez and GET /readyz). Validate that your SERVER_API_KEY environment variable is correct and verify network egress to Stigg API/Edge endpoints. Also confirm you’re running a supported Sidecar image version.
When the Sidecar encounters startup or network errors, it continues serving from the persistent cache if configured, or falls back to global fallback values to keep your application running.
Elevated API failures
Signs of API problems include increased non-2xx responses from Stigg APIs, a climbingsidecar_network_request_errors_total metric, and incidents appearing on the status page.
Start by correlating the timing with the Stigg Status Page to see if there’s a platform-wide issue. Verify network connectivity from your services to Stigg, and ensure you have global and per-check fallback values configured for critical entitlement paths so your app can continue functioning during outages.
Redis and Persistent Cache issues
Redis problems manifest as a risingsidecar_redis_client_errors_total metric, along with persistent-cache metrics showing write errors or a declining hit ratio.
Check the persistent-cache service health with GET /readyz to verify consumer and Redis state. Review GET /metrics for persistent_cache_* metrics to understand what’s failing. Double-check your Redis connection parameters including host, port, authentication, TLS settings, and database selection. Make sure the environment prefix is aligned across your SDK, Sidecar, and persistent-cache service, as mismatches here are a common source of issues.
Webhook / SQS delivery problems
For webhooks, remember that endpoints must return a 2xx status within 30 seconds, following the retry schedule. Consider setting up alerts for consecutive webhook failures or timeouts so you can respond quickly. For SQS-based delivery, monitor your SQS dashboard for queue backlog, message age, and throughput metrics. Scale your consumers as needed to handle load spikes. If you need durable queues for your use case, they can be provisioned by Stigg upon request.API rate limits
| Operation | Limit | Notes |
|---|---|---|
getEntitlements, get*Entitlement | Unlimited | Edge API |
getPaywall | Unlimited | Edge API |
getActiveSubscriptionsList | Unlimited | Edge API |
reportUsage | 3000 / min | |
reportEvent(s) | 1000 bulk / sec |
Sidecar entitlement checks are served from local or persistent cache. On cache miss, the Sidecar queries the Edge API. The Entitlements endpoint itself is unlimited, so entitlement lookups are effectively not rate-limited in practice. If you require higher limits for other operations, contact Stigg Support.
Error handling, trace IDs and correlation
Every event in the Stigg Activity Log includes a trace ID that uniquely identifies the request that produced the event. This is invaluable for audits and troubleshooting. You can configure webhooks to receive notifications about changes and certain failure conditions, with documented delivery behavior, retries, and idempotency requirements. Note that you must set up the webhook endpoint on your side. Error responses from the API include both a trace ID and an error code (such as “plan not found”), which you can log and share for deeper investigation. The same trace ID propagates across related events and notifications, enabling end-to-end correlation.Three reliable ways to investigate with trace IDs
1. Activity log
Navigate to the customer or entity in question, go to the Activity tab, locate the relevant event, and copy the trace ID or event link. Use this trace ID to correlate with your application logs and any webhook deliveries.2. API error response
When your client surfaces an API error, extract and log at minimum:traceId, error_code, human-readable error_message, the operation or endpoint name, and HTTP status if available. Share the trace ID with Stigg Support if deeper server-side investigation is needed.
Example error response:
3. Webhook notifications
Configure webhooks for relevant failure notifications in your environment, such as subscription-related failures, billing-integration failures, or generic sync failures. Use the trace ID included in those notifications to correlate the webhook with the originating request and the activity log entry. Delivery and retry behavior is documented, and demos show consistent trace IDs across these signals.Exact event names and payload fields vary by the events you enable in webhooks settings. You control which events are sent to your endpoint.
Common API error codes
| Code | Meaning |
|---|---|
Unauthenticated | Invalid API key |
InvalidArgumentError | Invalid payload |
CustomerNotFound | Customer doesn’t exist |
SubscriptionNotFound | Subscription doesn’t exist |
PlanNotFound | Plan doesn’t exist |
What to log on your side
To enable effective troubleshooting, always log these fields when errors occur:traceIderror_codeerror_messageoperation/endpointhttp_status
Suggested triage flow
When your application reports an error, follow this investigation sequence:- Capture the
traceId,error_code,error_message, operation, and http_status from the error response. - Check the activity log for the same trace ID to see what happened and when the event occurred.
- Check your webhook deliveries around that time, using the trace ID to match notifications to the originating request.
Alerting ideas
These recommendations are policy-neutral, you should choose thresholds that align with your specific SLOs and operational requirements.Sidecar health monitoring
Set up alerts if/readyz returns a non-UP status for more than N minutes, especially when combined with /livez failures. This indicates the Sidecar may not be able to serve traffic reliably.
Sidecar error tracking
Alert on growth insidecar_*_errors_total metrics and sustained drops in cache hit ratio. These patterns often indicate underlying issues with connectivity, configuration, or cache effectiveness that need attention.
Persistent cache monitoring
Monitorpersistent_cache_write_errors_total and alert when it begins climbing. Also watch for falling persistent_cache_hit_ratio values and drops in consumer count reported by /readyz. These signals can indicate Redis connectivity problems or insufficient consumer capacity.
Upstream service status
Subscribe to the Stigg Status Page and configure alerts for status changes. This ensures you’re notified immediately about platform-wide incidents that might affect your integration.Delivery reliability
Alert on repeated webhook failures based on your retry policy, or when SQS backlog and message age breach your defined thresholds. Use the SQS dashboard mentioned earlier to track these metrics.Entitlement latency (p95)
Measure latency at your application layer or APM aroundget*Entitlement and getEntitlements calls, and alert according to your SLO. The architecture is designed for low-latency reads through pre-computed, distributed caching and edge delivery.
Example alert configurations
To get started, consider these baseline alert policies:- Cache effectiveness: Alert if Sidecar cache hit ratio falls below 70% for 15 minutes. This suggests cache configuration issues or unusual access patterns.
- API reliability: Alert if the error rate exceeds 1% sustained for 15 minutes. This indicates potential problems with API connectivity or request validity that warrant investigation.
Sidecar entitlement checks are served from local or persistent cache. On cache miss, the Sidecar queries the Edge API. The Entitlements endpoint itself is unlimited, so entitlement lookups are effectively not rate-limited in practice. If you require higher limits for other operations, contact Stigg Support.
Versioning and compatibility
Since Sidecar SDK v3.0.0, the TLS connection method has been deprecated in favor of non-TLS connections. You should use Sidecar image version 2.494.0 or later. Note that TLS self-signed certificates expire on January 26, 2026, so upgrading is strongly recommended to avoid service interruptions.Platform team considerations
Monitoring
For both the Sidecar and Persistent Cache Service (PCS), implement standard Prometheus monitoring by scraping the/metrics endpoint. Additionally, set up regular health checks against /livez and /readyz endpoints to ensure services are operational and ready to handle traffic.
