Health
The service exposes two endpoints accessible via HTTP:GET /livez
Returns 200 if the service is alive.
Healthy response:
GET /readyz
Returns 200 if the service is ready.
Healthy response:
Metrics
The Sidecar exposes aGET /metrics endpoint that returns service metrics in Prometheus format.
This endpoint includes both system-level and Sidecar-specific metrics. These metrics are helpful for monitoring the health and performance of your Sidecar service.
Sidecar
The Sidecar exposes standard health and readiness endpoints for monitoring and orchestration. Check health withGET /livez and readiness with GET /readyz, both returning {"status":"UP"} when operational.
Metrics are available in Prometheus format at GET /metrics. Key metrics to monitor include:
sidecar_initialization_errors_total- indicates errors during sidecar initialization, such as an invalid API key or misconfiguration. It should always remain at 0. If the number of errors rises above 0 in production, it should immediately trigger an alert to the on-call engineer or roll back to the last working configurationsidecar_invalid_api_key_errors_total- authentication failuressidecar_network_request_errors_total- connectivity problems, increments on API errors. It should trigger a warning if any errors (>0) occur within a 5-minute window, and escalate if the error rate remains elevated for 10 minutessidecar_redis_client_errors_total- Redis connection issues, should be treated as a signal of connectivity or stability issues with Redis for the service. A reasonable threshold is to trigger a warning if any errors (>0) occur within a 5-minute window, and escalate if the error rate remains elevated for 10 minutessidecar_cache_hits_totalandsidecar_cache_misses_total- cache performance
sidecar_network_request_errors_total metric is another indication that the API may be unreachable.
For Sidecar, entitlement check responses containing isFallback: true means fallback values are used.A Sidecar liveness probe should not trigger automatic Kubernetes pod restarts for network connectivity issues, Stigg API unreachability, or Edge failures. This is an intentional design choice to prevent the Sidecar from entering a restart loop, which would disrupt the main application whenever the upstream Stigg API is temporarily unreachable. The recommended approach is to keep the sidecar pod alive and rely on cached reads and fail-safe fallback modes while surfacing connectivity or write errors into your own observability and alerting stack, rather than coupling them to pod restarts.
Persistent cache service
When Redis is enabled, the persistent cache service provides its own monitoring endpoints:GET /livez, GET /readyz, and GET /metrics. Important metrics include:
persistent_cache_write_duration_seconds- write performance, high duration can indicate a delay in propagation of changes to Redis. Write duration >5 seconds sustained over a 5-minute window should trigger a warning, and escalate if the duration remains elevated for 15 minutespersistent_cache_write_errors_total- write failures, should be treated as a signal of connectivity or stability issues with Redis for the service, same as above. Trigger a warning if any errors (>0) occur within a 5-minute window, and escalate if the error rate remains elevated for 10 minutespersistent_cache_hit_ratio- overall cluster hit rate, a good indicator of Redis utilization, below 0.8 sustained for 15 minutes should trigger a warning, and a drop below 0.6 should be escalated over 60 minutespersistent_cache_memory_usage_bytes- memory consumptionpersistent_cache_hits_totalandpersistent_cache_misses_total- cache effectivenesspersistent_cache_messages_processed_total- throughput tracking
The hit ratio metric reflects the overall performance of the cluster and is typically a good indication of how well Redis is being utilized.
Recommended metrics and alert thresholds
These thresholds can be adjusted based on your production utilization patterns. Warnings are typically routed to the application team via Slack for investigation, while critical alerts should trigger PagerDuty notifications to the on-call engineer.| Metric | Purpose | Warning (Slack) | Error (Slack) | Critical (PagerDuty) | Remediation |
|---|---|---|---|---|---|
sidecar_initialization_errors_total or sidecar_invalid_api_key_errors_total | Sidecar initialization failures | > 0 | > 0 | > 0 | Note: often caused by a bad deployment. Roll back if possible. Check service error logs. Validate Stigg API key is correct and active. Review recent configuration changes. Roll back to last stable settings or image version if needed. |
sidecar_network_request_errors_total | Stigg API unreachable | > 0 over 5 min | Stays elevated > 10 min | Stays elevated > 15 min | Check error logs. Verify API reachability via status page. Notify the Stigg team if the issue persists. |
sidecar_redis_client_errors_total | Redis unreachable from Sidecar | > 0 over 5 min | Stays elevated > 10 min | Stays elevated > 10 min | Check Redis health, memory usage, and instance reachability. |
persistent_cache_write_errors_total | Redis write failures | > 0 over 5 min | Stays elevated > 10 min | Stays elevated > 10 min | Check Redis health, memory usage, and instance reachability. |
persistent_cache_write_duration_seconds | Redis write latency | > 5 seconds over 5 min | Stays elevated > 15 min | Stays elevated > 30 min | Check Redis health and reachability. Monitor CPU/memory. If consistently high, consider scaling out. |
persistent_cache_hit_ratio | Redis cache efficiency | < 80% over 15 min | < 60% sustained > 60 min | Not applicable | Check Redis health and reachability. Note: low ratio after restarts or cache clearing is expected and should recover as the cache repopulates. |
| Metric | Purpose | Trigger |
|---|---|---|
| process_cpu_seconds_total | CPU usage over time | > 60% avg over 5m |
| process_resident_memory_bytes | Memory used by the process | > 80% avg over 5m |
Wrap all SDKs/API calls with try/catch blocks and log the errors.
