Appearance
Monitoring and Health Checks
This document provides a comprehensive overview of the monitoring and health check capabilities of the IT Ticketing Service. It is intended for developers and operations personnel responsible for maintaining the application's stability and performance. The application leverages the Spring Boot Actuator module to expose critical operational information.
For instructions on running the application to observe these endpoints, please refer to the Running Locally guide.
Spring Boot Actuator
The core of our monitoring strategy is built upon Spring Boot Actuator. This module adds several production-ready features to the application, primarily by exposing a set of HTTP endpoints for monitoring and management.
Actuator Endpoints Overview
Actuator provides numerous built-in endpoints. In this application, we have explicitly configured which endpoints are accessible over the web for security and simplicity. The configuration is managed in src/main/resources/application.properties:
properties
# Actuator Configuration
management.endpoints.web.exposure.include=health,info
management.endpoint.health.show-details=always1
2
3
2
3
management.endpoints.web.exposure.include: This property is a comma-separated list of Actuator endpoint IDs to be exposed over HTTP. We have enabledhealthandinfo.management.endpoint.health.show-details: This is configured toalways, meaning the/actuator/healthendpoint will include detailed information from all configured health indicators, even for unauthenticated requests.
Enabling Actuator Features
The Actuator dependency is included in the pom.xml, which enables these features automatically.
xml
<!-- pom.xml -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>1
2
3
4
5
2
3
4
5
Security Considerations
By default, Spring Boot Actuator exposes only the /health and /info endpoints. Exposing other endpoints, such as /env, /configprops, or /heapdump, can leak sensitive configuration details and pose a significant security risk. Our current configuration adheres to this best practice. If there is a future need to expose more sensitive endpoints, they must be secured using Spring Security or placed behind a firewall accessible only to administrative users.
Health Check Endpoint
The health check endpoint is the primary mechanism for determining the application's operational status. It is crucial for automated monitoring, load balancer integration, and container orchestration.
GET /actuator/health
This endpoint aggregates the status of all registered HealthIndicator beans and provides a consolidated health status.
Request:
bash
curl http://localhost:8080/actuator/health1
Example Response (UP):
When all components are healthy, the service reports an overall status of UP. Due to management.endpoint.health.show-details=always, the response includes the status of individual components.
json
{
"status": "UP",
"components": {
"db": {
"status": "UP",
"details": {
"database": "MySQL",
"validationQuery": "isValid()"
}
},
"diskSpace": {
"status": "UP",
"details": {
"total": 499963174912,
"free": 108813365248,
"threshold": 10485760,
"exists": true
}
},
"ping": {
"status": "UP"
},
"rabbit": {
"status": "UP",
"details": {
"version": "3.12.12"
}
}
}
}1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
If a critical dependency like the database or message broker is down, the component status will be DOWN, and the overall status will also be DOWN.
Health Indicators
Spring Boot automatically configures health indicators for technologies found on the classpath. For this application, the key auto-configured indicators are:
| Indicator | Dependency Trigger | Description |
|---|---|---|
db | spring-boot-starter-data-jpa | Checks the connection to the MySQL database by executing a validation query. |
rabbit | spring-boot-starter-amqp | Checks the connection to the RabbitMQ broker. |
diskSpace | (Default) | Checks for adequate free disk space on the volume where the app is running. |
ping | (Default) | A trivial indicator that always returns UP, confirming the app is responsive. |
Custom Health Checks
While auto-configured indicators cover our main dependencies, custom health checks can be created to monitor other external services or internal application states. This is achieved by creating a Spring bean that implements the HealthIndicator interface.
Example (Hypothetical):
java
// This is an example and not present in the current codebase.
// It demonstrates how to add a check for a fictional external API.
@Component
public class ExternalApiServiceHealthIndicator implements HealthIndicator {
@Override
public Health health() {
try {
// Logic to check the status of the external service
// e.g., make a HEAD request to a status endpoint
int statusCode = checkExternalService();
if (statusCode == 200) {
return Health.up().withDetail("service", "Available").build();
}
return Health.down().withDetail("service", "Status code: " + statusCode).build();
} catch (Exception ex) {
return Health.down(ex).build();
}
}
private int checkExternalService() {
// ... implementation ...
return 200;
}
}1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Readiness and Liveness Probes
For containerized environments like Kubernetes, Actuator provides distinct liveness (/actuator/health/liveness) and readiness (/actuator/health/readiness) probes.
- Liveness Probe: Indicates if the application is running. A failure suggests a fatal, unrecoverable error, and the container should be restarted.
- Readiness Probe: Indicates if the application is ready to accept new traffic. A failure might occur during startup while dependencies are being initialized, or if the application is temporarily overloaded. A failing readiness probe will cause the container to be removed from the load balancer's pool without being restarted.
These probes can be enabled and configured in application.properties if needed (e.g., management.health.probes.enabled=true).
Metrics and Monitoring
Actuator uses Micrometer to collect application metrics. While the /actuator/metrics endpoint is not exposed by default in our configuration, the underlying metrics are still collected.
Available Metrics
Micrometer provides a rich set of metrics out-of-the-box, including:
- JVM Metrics: Memory usage, garbage collection, thread utilization.
- System Metrics: CPU usage.
- Web Server Metrics: Request latency, error rates for the embedded Tomcat server.
- Data Source Metrics: Active, idle, and pending connections for the Hikari connection pool.
- RabbitMQ Metrics: Connection and channel metrics.
Metrics Endpoints
To view available metrics, you would first need to expose the metrics endpoint:
properties
# WARNING: Exposing metrics can reveal internal operational data.
# Ensure this is only accessible to trusted monitoring systems.
management.endpoints.web.exposure.include=health,info,metrics1
2
3
2
3
Once exposed, you can list all available metric names: GET /actuator/metrics
And query a specific metric: GET /actuator/metrics/http.server.requests
Integration with Monitoring Tools
The most effective way to use these metrics is to integrate them with a dedicated monitoring system like Prometheus and visualize them with Grafana. For the planned migration to Google Cloud, these metrics can be seamlessly exported to Google Cloud Monitoring.
To integrate with Prometheus, you would add the following dependency and expose the prometheus endpoint:
xml
<!-- pom.xml: Add for Prometheus integration -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>1
2
3
4
5
2
3
4
5
properties
# application.properties: Expose the Prometheus-formatted endpoint
management.endpoints.web.exposure.include=health,info,prometheus1
2
2
Logging
Effective logging is critical for debugging and troubleshooting. For detailed troubleshooting steps, see the Troubleshooting Guide.
Application Logging Configuration
Log levels are configured in application.properties, allowing for granular control over log verbosity for different parts of the application.
properties
# src/main/resources/application.properties
# Logging Configuration
logging.level.com.slalom.demo.ticketing=INFO
logging.level.org.springframework.amqp=INFO
logging.level.org.springframework.web=INFO1
2
3
4
5
6
2
3
4
5
6
logging.level.com.slalom.demo.ticketing=INFO: Sets the log level for our application's base package.logging.level.org.springframework.amqp=INFO: Reduces verbosity from the RabbitMQ client libraries.logging.level.org.springframework.web=INFO: Reduces verbosity from the Spring Web framework.
Log Levels
The standard log levels are TRACE, DEBUG, INFO, WARN, and ERROR. The current INFO level is suitable for production. For debugging, you can change the level to DEBUG either in the properties file or, if the /loggers endpoint is exposed, at runtime.
Log Aggregation Strategies
In a distributed environment, logs should be aggregated into a central location.
- Current/On-Premises: The application logs to standard output. A log forwarder like Fluentd or Logstash should be configured on the host VM to ship logs to a central store like Elasticsearch.
- Future/GCP: When deployed to Google Cloud Run or GKE, logs written to
stdoutandstderrare automatically collected by Google Cloud Logging, providing centralized viewing, searching, and alerting without any application changes.
Alerts and Notifications
Alerting is the responsibility of an external monitoring system, which consumes the health and metrics data exposed by the application.
Setting Up Alerts
An alerting system (e.g., Prometheus Alertmanager, Google Cloud Monitoring) should be configured to:
- Periodically poll the
/actuator/healthendpoint. - Scrape metrics from the
/actuator/prometheusendpoint. - Define rules with specific thresholds against this data.
- Trigger notifications (e.g., email, Slack, PagerDuty) when a rule is breached.
Key Metrics to Monitor
The following metrics are critical for ensuring the health of the IT Ticketing Service:
| Metric Category | Key Metrics | Potential Issue |
|---|---|---|
| Application Health | health endpoint status | Application or dependency is down. |
| HTTP Performance | http.server.requests (count by status 5xx) | High server-side error rate. |
http.server.requests.seconds.max / p95 | High request latency. | |
| Database | hikaricp.connections.active (vs. max pool size) | Database connection pool exhaustion. |
hikaricp.connections.pending | High contention for database connections. | |
| Messaging | rabbitmq.messages.ready (for ticket.queue) | Messages are not being consumed. |
| JVM | jvm.memory.used.bytes (vs. jvm.memory.max.bytes) | Potential memory leak. |
process.cpu.usage | High CPU utilization, performance bottleneck. |
Performance Thresholds
Thresholds should be refined based on production baselines, but here are some recommended starting points for alerts:
- Health: Alert immediately if
/actuator/healthstatus isDOWN. - Error Rate: Alert if the 5xx error rate exceeds 1% over a 5-minute window.
- Latency: Alert if the 95th percentile (p95) API response time exceeds 800ms.
- DB Connections: Alert if active database connections are at > 80% of the maximum pool size for more than 2 minutes.
- Queue Depth: Alert if the
ticket.queuecontains more than 500 unacknowledged messages for more than 5 minutes.