Skip to main content

Health

The service exposes two endpoints accessible via HTTP: GET /livez Returns 200 if the service is alive. Healthy response:
{ "status": "UP" }
GET /readyz Returns 200 if the service is ready. Healthy response:
{ "status": "UP" }

Metrics

The Sidecar exposes a GET /metrics endpoint that returns service metrics in Prometheus format. This endpoint includes both system-level and Sidecar-specific metrics. These metrics are helpful for monitoring the health and performance of your Sidecar service.

Sidecar

The Sidecar exposes standard health and readiness endpoints for monitoring and orchestration. Check health with GET /livez and readiness with GET /readyz, both returning {"status":"UP"} when operational. Metrics are available in Prometheus format at GET /metrics. Key metrics to monitor include:
  • sidecar_initialization_errors_total - indicates errors during sidecar initialization, such as an invalid API key or misconfiguration. It should always remain at 0. If the number of errors rises above 0 in production, it should immediately trigger an alert to the on-call engineer or roll back to the last working configuration
  • sidecar_invalid_api_key_errors_total - authentication failures
  • sidecar_network_request_errors_total - connectivity problems, increments on API errors. It should trigger a warning if any errors (>0) occur within a 5-minute window, and escalate if the error rate remains elevated for 10 minutes
  • sidecar_redis_client_errors_total - Redis connection issues, should be treated as a signal of connectivity or stability issues with Redis for the service. A reasonable threshold is to trigger a warning if any errors (>0) occur within a 5-minute window, and escalate if the error rate remains elevated for 10 minutes
  • sidecar_cache_hits_total and sidecar_cache_misses_total - cache performance
sidecar_network_request_errors_total metric is another indication that the API may be unreachable. For Sidecar, entitlement check responses containing isFallback: true means fallback values are used.
A Sidecar liveness probe should not trigger automatic Kubernetes pod restarts for network connectivity issues, Stigg API unreachability, or Edge failures. This is an intentional design choice to prevent the Sidecar from entering a restart loop, which would disrupt the main application whenever the upstream Stigg API is temporarily unreachable. The recommended approach is to keep the sidecar pod alive and rely on cached reads and fail-safe fallback modes while surfacing connectivity or write errors into your own observability and alerting stack, rather than coupling them to pod restarts.

Persistent cache service

When Redis is enabled, the persistent cache service provides its own monitoring endpoints: GET /livez, GET /readyz, and GET /metrics. Important metrics include:
  • persistent_cache_write_duration_seconds - write performance, high duration can indicate a delay in propagation of changes to Redis. Write duration >5 seconds sustained over a 5-minute window should trigger a warning, and escalate if the duration remains elevated for 15 minutes
  • persistent_cache_write_errors_total - write failures, should be treated as a signal of connectivity or stability issues with Redis for the service, same as above. Trigger a warning if any errors (>0) occur within a 5-minute window, and escalate if the error rate remains elevated for 10 minutes
  • persistent_cache_hit_ratio - overall cluster hit rate, a good indicator of Redis utilization, below 0.8 sustained for 15 minutes should trigger a warning, and a drop below 0.6 should be escalated over 60 minutes
  • persistent_cache_memory_usage_bytes - memory consumption
  • persistent_cache_hits_total and persistent_cache_misses_total - cache effectiveness
  • persistent_cache_messages_processed_total - throughput tracking
The hit ratio metric reflects the overall performance of the cluster and is typically a good indication of how well Redis is being utilized.
These thresholds can be adjusted based on your production utilization patterns. Warnings are typically routed to the application team via Slack for investigation, while critical alerts should trigger PagerDuty notifications to the on-call engineer.
MetricPurposeWarning (Slack)Error (Slack)Critical (PagerDuty)Remediation
sidecar_initialization_errors_total or sidecar_invalid_api_key_errors_totalSidecar initialization failures> 0> 0> 0Note: often caused by a bad deployment. Roll back if possible. Check service error logs. Validate Stigg API key is correct and active. Review recent configuration changes. Roll back to last stable settings or image version if needed.
sidecar_network_request_errors_totalStigg API unreachable> 0 over 5 minStays elevated > 10 minStays elevated > 15 minCheck error logs. Verify API reachability via status page. Notify the Stigg team if the issue persists.
sidecar_redis_client_errors_totalRedis unreachable from Sidecar> 0 over 5 minStays elevated > 10 minStays elevated > 10 minCheck Redis health, memory usage, and instance reachability.
persistent_cache_write_errors_totalRedis write failures> 0 over 5 minStays elevated > 10 minStays elevated > 10 minCheck Redis health, memory usage, and instance reachability.
persistent_cache_write_duration_secondsRedis write latency> 5 seconds over 5 minStays elevated > 15 minStays elevated > 30 minCheck Redis health and reachability. Monitor CPU/memory. If consistently high, consider scaling out.
persistent_cache_hit_ratioRedis cache efficiency< 80% over 15 min< 60% sustained > 60 minNot applicableCheck Redis health and reachability. Note: low ratio after restarts or cache clearing is expected and should recover as the cache repopulates.
For auto-scaling, monitor the service’s CPU and memory metrics:
MetricPurposeTrigger
process_cpu_seconds_totalCPU usage over time> 60% avg over 5m
process_resident_memory_bytesMemory used by the process> 80% avg over 5m
Wrap all SDKs/API calls with try/catch blocks and log the errors.