> ## Documentation Index
> Fetch the complete documentation index at: https://docs.stigg.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Service monitoring

## Health

The service exposes two endpoints accessible via **HTTP**:

**`GET /livez`**

Returns `200` if the service is alive.

Healthy response:

<CodeGroup>
  ```json JSON theme={null}
  { "status": "UP" }
  ```
</CodeGroup>

**`GET /readyz`**

Returns `200` if the service is ready.

Healthy response:

<CodeGroup>
  ```json JSON theme={null}
  { "status": "UP" }
  ```
</CodeGroup>

## Metrics

The Sidecar exposes a **`GET /metrics`** endpoint that returns service metrics in [Prometheus](https://prometheus.io/) format.

This endpoint includes both system-level and Sidecar-specific metrics. These metrics are helpful for monitoring the health and performance of your Sidecar service.

### Sidecar

The Sidecar exposes standard health and readiness endpoints for monitoring and orchestration. Check health with `GET /livez` and readiness with `GET /readyz`, both returning `{"status":"UP"}` when operational.
Metrics are available in Prometheus format at `GET /metrics`. Key metrics to monitor include:

* `sidecar_initialization_errors_total` - indicates errors during sidecar initialization, such as an invalid API key or misconfiguration. It should always remain at 0. If the number of errors rises above 0 in production, it should immediately trigger an alert to the on-call engineer or roll back to the last working configuration
* `sidecar_invalid_api_key_errors_total` - authentication failures
* `sidecar_network_request_errors_total` - connectivity problems, increments on API errors. It should trigger a warning if any errors (>0) occur within a 5-minute window, and escalate if the error rate remains elevated for 10 minutes
* `sidecar_redis_client_errors_total` - Redis connection issues, should be treated as a signal of connectivity or stability issues with Redis for the service. A reasonable threshold is to trigger a warning if any errors (>0) occur within a 5-minute window, and escalate if the error rate remains elevated for 10 minutes
* `sidecar_cache_hits_total` and `sidecar_cache_misses_total` - cache performance

<Note>
  `sidecar_network_request_errors_total` metric is another indication that the API may be unreachable.
  For Sidecar, entitlement check responses containing `isFallback: true` means fallback values are used.
</Note>

<Note>
  A Sidecar liveness probe should not trigger automatic Kubernetes pod restarts for network connectivity issues, Stigg API unreachability, or Edge failures. This is an intentional design choice to prevent the Sidecar from entering a restart loop, which would disrupt the main application whenever the upstream Stigg API is temporarily unreachable. The recommended approach is to keep the sidecar pod alive and rely on cached reads and fail-safe fallback modes while surfacing connectivity or write errors into your own observability and alerting stack, rather than coupling them to pod restarts.
</Note>

### Persistent cache service

When Redis is enabled, the persistent cache service provides its own monitoring endpoints: `GET /livez`, `GET /readyz`, and `GET /metrics`. Important metrics include:

* `persistent_cache_write_duration_seconds` - write performance, high duration can indicate a delay in propagation of changes to Redis. Write duration >5 seconds sustained over a 5-minute window should trigger a warning, and escalate if the duration remains elevated for 15 minutes
* `persistent_cache_write_errors_total` - write failures, should be treated as a signal of connectivity or stability issues with Redis for the service, same as above. Trigger a warning if any errors (>0) occur within a 5-minute window, and escalate if the error rate remains elevated for 10 minutes
* `persistent_cache_hit_ratio` - overall cluster hit rate, a good indicator of Redis utilization, below 0.8 sustained for 15 minutes should trigger a warning, and a drop below 0.6 should be escalated over 60 minutes
* `persistent_cache_memory_usage_bytes` - memory consumption
* `persistent_cache_hits_total` and `persistent_cache_misses_total` - cache effectiveness
* `persistent_cache_messages_processed_total` - throughput tracking

<Note>
  The hit ratio metric reflects the overall performance of the cluster and is typically a good indication of how well Redis is being utilized.
</Note>

### Recommended metrics and alert thresholds

These thresholds can be adjusted based on your production utilization patterns. Warnings are typically routed to the application team via Slack for investigation, while critical alerts should trigger **PagerDuty** notifications to the on-call engineer.

| Metric                                                                                                                      | Purpose                         |          Warning (Slack) |              Error (Slack) |    Critical (PagerDuty) | Remediation                                                                                                                                                                                                                               |
| --------------------------------------------------------------------------------------------------------------------------- | ------------------------------- | -----------------------: | -------------------------: | ----------------------: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <span style={{whiteSpace: "nowrap"}}>`sidecar_initialization_errors_total` or `sidecar_invalid_api_key_errors_total`</span> | Sidecar initialization failures |                    `> 0` |                      `> 0` |                   `> 0` | Note: often caused by a bad deployment. Roll back if possible. Check service error logs. Validate Stigg API key is correct and active. Review recent configuration changes. Roll back to last stable settings or image version if needed. |
| <span style={{whiteSpace: "nowrap"}}>`sidecar_network_request_errors_total`</span>                                          | Stigg API unreachable           |         `> 0` over 5 min |    Stays elevated > 10 min | Stays elevated > 15 min | Check error logs. Verify API reachability via status page. Notify the Stigg team if the issue persists.                                                                                                                                   |
| <span style={{whiteSpace: "nowrap"}}>`sidecar_redis_client_errors_total`</span>                                             | Redis unreachable from Sidecar  |         `> 0` over 5 min |    Stays elevated > 10 min | Stays elevated > 10 min | Check Redis health, memory usage, and instance reachability.                                                                                                                                                                              |
| <span style={{whiteSpace: "nowrap"}}>`persistent_cache_write_errors_total`</span>                                           | Redis write failures            |         `> 0` over 5 min |    Stays elevated > 10 min | Stays elevated > 10 min | Check Redis health, memory usage, and instance reachability.                                                                                                                                                                              |
| <span style={{whiteSpace: "nowrap"}}>`persistent_cache_write_duration_seconds`</span>                                       | Redis write latency             | `> 5 seconds` over 5 min |    Stays elevated > 15 min | Stays elevated > 30 min | Check Redis health and reachability. Monitor CPU/memory. If consistently high, consider scaling out.                                                                                                                                      |
| <span style={{whiteSpace: "nowrap"}}>`persistent_cache_hit_ratio`</span>                                                    | Redis cache efficiency          |      `< 80%` over 15 min | `< 60%` sustained > 60 min |          Not applicable | Check Redis health and reachability. Note: low ratio after restarts or cache clearing is expected and should recover as the cache repopulates.                                                                                            |

For auto-scaling, monitor the service’s CPU and memory metrics:

| Metric                           | Purpose                    |           Trigger |
| -------------------------------- | -------------------------- | ----------------: |
| process\_cpu\_seconds\_total     | CPU usage over time        | > 60% avg over 5m |
| process\_resident\_memory\_bytes | Memory used by the process | > 80% avg over 5m |

<Note>
  Wrap all SDKs/API calls with try/catch blocks and log the errors.
</Note>
