Skip to main content

Sidecar startup problems

You’ll typically spot Sidecar startup issues through a rising sidecar_initialization_errors_total metric and logs that mention invalid API keys or network connectivity problems. To diagnose, check the health and readiness endpoints (GET /livez and GET /readyz). Validate that your SERVER_API_KEY environment variable is correct and verify network egress to Stigg API/Edge endpoints. Also confirm you’re running a supported Sidecar image version. When the Sidecar encounters startup or network errors, it continues serving from the persistent cache if configured, or falls back to global fallback values to keep your application running.

Elevated API failures

Signs of API problems include increased non-2xx responses from Stigg APIs, a climbing sidecar_network_request_errors_total metric, and incidents appearing on the status page. Start by correlating the timing with the Stigg Status Page to see if there’s a platform-wide issue. Verify network connectivity from your services to Stigg, and ensure you have global and per-check fallback values configured for critical entitlement paths so your app can continue functioning during outages.

Redis and Persistent Cache issues

Redis problems manifest as a rising sidecar_redis_client_errors_total metric, along with persistent-cache metrics showing write errors or a declining hit ratio. Check the persistent-cache service health with GET /readyz to verify consumer and Redis state. Review GET /metrics for persistent_cache_* metrics to understand what’s failing. Double-check your Redis connection parameters including host, port, authentication, TLS settings, and database selection. Make sure the environment prefix is aligned across your SDK, Sidecar, and persistent-cache service, as mismatches here are a common source of issues.

Webhook / SQS delivery problems

For webhooks, remember that endpoints must return a 2xx status within 30 seconds, following the retry schedule. Consider setting up alerts for consecutive webhook failures or timeouts so you can respond quickly. For SQS-based delivery, monitor your SQS dashboard for queue backlog, message age, and throughput metrics. Scale your consumers as needed to handle load spikes. If you need durable queues for your use case, they can be provisioned by Stigg upon request.

Alerting ideas

These recommendations are policy-neutral, you should choose thresholds that align with your specific SLOs and operational requirements.

Sidecar health monitoring

Set up alerts if /readyz returns a non-UP status for more than N minutes, especially when combined with /livez failures. This indicates the Sidecar may not be able to serve traffic reliably.

Sidecar error tracking

Alert on growth in sidecar_*_errors_total metrics and sustained drops in cache hit ratio. These patterns often indicate underlying issues with connectivity, configuration, or cache effectiveness that need attention.

Persistent cache monitoring

Monitor persistent_cache_write_errors_total and alert when it begins climbing. Also watch for falling persistent_cache_hit_ratio values and drops in consumer count reported by /readyz. These signals can indicate Redis connectivity problems or insufficient consumer capacity.

Upstream service status

Subscribe to the Stigg Status Page and configure alerts for status changes. This ensures you’re notified immediately about platform-wide incidents that might affect your integration.

Delivery reliability

Alert on repeated webhook failures based on your retry policy, or when SQS backlog and message age breach your defined thresholds. Use the SQS dashboard mentioned earlier to track these metrics.

Entitlement latency (p95)

Measure latency at your application layer or APM around get*Entitlement and getEntitlements calls, and alert according to your SLO. The architecture is designed for low-latency reads through pre-computed, distributed caching and edge delivery.

Example alert configurations

To get started, consider these baseline alert policies:
  • Cache effectiveness: Alert if Sidecar cache hit ratio falls below 70% for 15 minutes. This suggests cache configuration issues or unusual access patterns.
  • API reliability: Alert if the error rate exceeds 1% sustained for 15 minutes. This indicates potential problems with API connectivity or request validity that warrant investigation.
Sidecar entitlement checks are served from local or persistent cache. On cache miss, the Sidecar queries the Edge API. The Entitlements endpoint itself is unlimited, so entitlement lookups are effectively not rate-limited in practice. If you require higher limits for other operations, contact Stigg Support.