Monitoring and Health Checks

This document provides a comprehensive overview of the monitoring and health check capabilities of the IT Ticketing Service. It is intended for developers and operations personnel responsible for maintaining the application's stability and performance. The application leverages the Spring Boot Actuator module to expose critical operational information.

For instructions on running the application to observe these endpoints, please refer to the Running Locally guide.

Spring Boot Actuator

The core of our monitoring strategy is built upon Spring Boot Actuator. This module adds several production-ready features to the application, primarily by exposing a set of HTTP endpoints for monitoring and management.

Actuator Endpoints Overview

Actuator provides numerous built-in endpoints. In this application, we have explicitly configured which endpoints are accessible over the web for security and simplicity. The configuration is managed in src/main/resources/application.properties:

properties

# Actuator Configuration
management.endpoints.web.exposure.include=health,info
management.endpoint.health.show-details=always

management.endpoints.web.exposure.include: This property is a comma-separated list of Actuator endpoint IDs to be exposed over HTTP. We have enabled health and info.
management.endpoint.health.show-details: This is configured to always, meaning the /actuator/health endpoint will include detailed information from all configured health indicators, even for unauthenticated requests.

Enabling Actuator Features

The Actuator dependency is included in the pom.xml, which enables these features automatically.

xml

<!-- pom.xml -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

Security Considerations

By default, Spring Boot Actuator exposes only the /health and /info endpoints. Exposing other endpoints, such as /env, /configprops, or /heapdump, can leak sensitive configuration details and pose a significant security risk. Our current configuration adheres to this best practice. If there is a future need to expose more sensitive endpoints, they must be secured using Spring Security or placed behind a firewall accessible only to administrative users.

Health Check Endpoint

The health check endpoint is the primary mechanism for determining the application's operational status. It is crucial for automated monitoring, load balancer integration, and container orchestration.

GET /actuator/health

This endpoint aggregates the status of all registered HealthIndicator beans and provides a consolidated health status.

Request:

bash

curl http://localhost:8080/actuator/health

Example Response (UP):

When all components are healthy, the service reports an overall status of UP. Due to management.endpoint.health.show-details=always, the response includes the status of individual components.

json

{
  "status": "UP",
  "components": {
    "db": {
      "status": "UP",
      "details": {
        "database": "MySQL",
        "validationQuery": "isValid()"
      }
    },
    "diskSpace": {
      "status": "UP",
      "details": {
        "total": 499963174912,
        "free": 108813365248,
        "threshold": 10485760,
        "exists": true
      }
    },
    "ping": {
      "status": "UP"
    },
    "rabbit": {
      "status": "UP",
      "details": {
        "version": "3.12.12"
      }
    }
  }
}

If a critical dependency like the database or message broker is down, the component status will be DOWN, and the overall status will also be DOWN.

Health Indicators

Spring Boot automatically configures health indicators for technologies found on the classpath. For this application, the key auto-configured indicators are:

Indicator	Dependency Trigger	Description
`db`	`spring-boot-starter-data-jpa`	Checks the connection to the MySQL database by executing a validation query.
`rabbit`	`spring-boot-starter-amqp`	Checks the connection to the RabbitMQ broker.
`diskSpace`	(Default)	Checks for adequate free disk space on the volume where the app is running.
`ping`	(Default)	A trivial indicator that always returns `UP`, confirming the app is responsive.

Custom Health Checks

While auto-configured indicators cover our main dependencies, custom health checks can be created to monitor other external services or internal application states. This is achieved by creating a Spring bean that implements the HealthIndicator interface.

Example (Hypothetical):

java

// This is an example and not present in the current codebase.
// It demonstrates how to add a check for a fictional external API.

@Component
public class ExternalApiServiceHealthIndicator implements HealthIndicator {

    @Override
    public Health health() {
        try {
            // Logic to check the status of the external service
            // e.g., make a HEAD request to a status endpoint
            int statusCode = checkExternalService();
            if (statusCode == 200) {
                return Health.up().withDetail("service", "Available").build();
            }
            return Health.down().withDetail("service", "Status code: " + statusCode).build();
        } catch (Exception ex) {
            return Health.down(ex).build();
        }
    }

    private int checkExternalService() {
        // ... implementation ...
        return 200;
    }
}

Readiness and Liveness Probes

For containerized environments like Kubernetes, Actuator provides distinct liveness (/actuator/health/liveness) and readiness (/actuator/health/readiness) probes.

Liveness Probe: Indicates if the application is running. A failure suggests a fatal, unrecoverable error, and the container should be restarted.
Readiness Probe: Indicates if the application is ready to accept new traffic. A failure might occur during startup while dependencies are being initialized, or if the application is temporarily overloaded. A failing readiness probe will cause the container to be removed from the load balancer's pool without being restarted.

These probes can be enabled and configured in application.properties if needed (e.g., management.health.probes.enabled=true).

Metrics and Monitoring

Actuator uses Micrometer to collect application metrics. While the /actuator/metrics endpoint is not exposed by default in our configuration, the underlying metrics are still collected.

Available Metrics

Micrometer provides a rich set of metrics out-of-the-box, including:

JVM Metrics: Memory usage, garbage collection, thread utilization.
System Metrics: CPU usage.
Web Server Metrics: Request latency, error rates for the embedded Tomcat server.
Data Source Metrics: Active, idle, and pending connections for the Hikari connection pool.
RabbitMQ Metrics: Connection and channel metrics.

Metrics Endpoints

To view available metrics, you would first need to expose the metrics endpoint:

properties

# WARNING: Exposing metrics can reveal internal operational data.
# Ensure this is only accessible to trusted monitoring systems.
management.endpoints.web.exposure.include=health,info,metrics

Once exposed, you can list all available metric names: GET /actuator/metrics

And query a specific metric: GET /actuator/metrics/http.server.requests

Integration with Monitoring Tools

The most effective way to use these metrics is to integrate them with a dedicated monitoring system like Prometheus and visualize them with Grafana. For the planned migration to Google Cloud, these metrics can be seamlessly exported to Google Cloud Monitoring.

To integrate with Prometheus, you would add the following dependency and expose the prometheus endpoint:

xml

<!-- pom.xml: Add for Prometheus integration -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

properties

# application.properties: Expose the Prometheus-formatted endpoint
management.endpoints.web.exposure.include=health,info,prometheus

Logging

Effective logging is critical for debugging and troubleshooting. For detailed troubleshooting steps, see the Troubleshooting Guide.

Application Logging Configuration

Log levels are configured in application.properties, allowing for granular control over log verbosity for different parts of the application.

properties

# src/main/resources/application.properties

# Logging Configuration
logging.level.com.slalom.demo.ticketing=INFO
logging.level.org.springframework.amqp=INFO
logging.level.org.springframework.web=INFO

logging.level.com.slalom.demo.ticketing=INFO: Sets the log level for our application's base package.
logging.level.org.springframework.amqp=INFO: Reduces verbosity from the RabbitMQ client libraries.
logging.level.org.springframework.web=INFO: Reduces verbosity from the Spring Web framework.

Log Levels

The standard log levels are TRACE, DEBUG, INFO, WARN, and ERROR. The current INFO level is suitable for production. For debugging, you can change the level to DEBUG either in the properties file or, if the /loggers endpoint is exposed, at runtime.

Log Aggregation Strategies

In a distributed environment, logs should be aggregated into a central location.

Current/On-Premises: The application logs to standard output. A log forwarder like Fluentd or Logstash should be configured on the host VM to ship logs to a central store like Elasticsearch.
Future/GCP: When deployed to Google Cloud Run or GKE, logs written to stdout and stderr are automatically collected by Google Cloud Logging, providing centralized viewing, searching, and alerting without any application changes.

Alerts and Notifications

Alerting is the responsibility of an external monitoring system, which consumes the health and metrics data exposed by the application.

Setting Up Alerts

An alerting system (e.g., Prometheus Alertmanager, Google Cloud Monitoring) should be configured to:

Periodically poll the /actuator/health endpoint.
Scrape metrics from the /actuator/prometheus endpoint.
Define rules with specific thresholds against this data.
Trigger notifications (e.g., email, Slack, PagerDuty) when a rule is breached.

Key Metrics to Monitor

The following metrics are critical for ensuring the health of the IT Ticketing Service:

Metric Category	Key Metrics	Potential Issue
Application Health	`health` endpoint status	Application or dependency is down.
HTTP Performance	`http.server.requests` (count by status `5xx`)	High server-side error rate.
	`http.server.requests.seconds.max` / `p95`	High request latency.
Database	`hikaricp.connections.active` (vs. max pool size)	Database connection pool exhaustion.
	`hikaricp.connections.pending`	High contention for database connections.
Messaging	`rabbitmq.messages.ready` (for `ticket.queue`)	Messages are not being consumed.
JVM	`jvm.memory.used.bytes` (vs. `jvm.memory.max.bytes`)	Potential memory leak.
	`process.cpu.usage`	High CPU utilization, performance bottleneck.

Performance Thresholds

Thresholds should be refined based on production baselines, but here are some recommended starting points for alerts:

Health: Alert immediately if /actuator/health status is DOWN.
Error Rate: Alert if the 5xx error rate exceeds 1% over a 5-minute window.
Latency: Alert if the 95th percentile (p95) API response time exceeds 800ms.
DB Connections: Alert if active database connections are at > 80% of the maximum pool size for more than 2 minutes.
Queue Depth: Alert if the ticket.queue contains more than 500 unacknowledged messages for more than 5 minutes.

Monitoring and Health Checks ​

Spring Boot Actuator ​

Actuator Endpoints Overview ​

Enabling Actuator Features ​

Security Considerations ​

Health Check Endpoint ​

GET /actuator/health ​

Health Indicators ​

Custom Health Checks ​

Readiness and Liveness Probes ​

Metrics and Monitoring ​

Available Metrics ​

Metrics Endpoints ​

Integration with Monitoring Tools ​

Logging ​

Application Logging Configuration ​

Log Levels ​

Log Aggregation Strategies ​

Alerts and Notifications ​

Setting Up Alerts ​

Key Metrics to Monitor ​

Performance Thresholds ​

Monitoring and Health Checks

Spring Boot Actuator

Actuator Endpoints Overview

Enabling Actuator Features

Security Considerations

Health Check Endpoint

GET /actuator/health

Health Indicators

Custom Health Checks

Readiness and Liveness Probes

Metrics and Monitoring

Available Metrics

Metrics Endpoints

Integration with Monitoring Tools

Logging

Application Logging Configuration

Log Levels

Log Aggregation Strategies

Alerts and Notifications

Setting Up Alerts

Key Metrics to Monitor

Performance Thresholds