Monitoring

1. Introduction

Monitoring is a crucial aspect of maintaining the health and performance of software systems. It involves observing the behavior of a system and its components over time to ensure they are functioning as expected. Monitoring provides insights into the system’s performance, helping developers identify and resolve issues before they escalate into critical problems.

In the context of software development, monitoring often involves the collection of three types of data: metrics, logs, and traces. These are collectively known as the three pillars of observability, providing a comprehensive view of a software system’s behavior.

Metrics are numerical values that represent some aspect of a system at a particular point in time. They are typically used to track resource usage, request rates, error rates, and other quantifiable aspects of system performance.
Traces provide a detailed view of the path of a request as it is processed by a system, particularly in a distributed or microservices architecture. They allow developers to see how a request is processed through multiple services, and how much time is spent in each service, helping to identify bottlenecks and performance issues.
Logs are records of events that a system produces while it is running. They can provide valuable insights into the behavior of the system and can be used to diagnose and troubleshoot issues.

OpenTelemetry is a set of APIs, libraries, SDKs, agents, and instrumentation that provide a single way to collect and analyze telemetry data from applications, regardless of the language or platform they are running on. It includes components for collecting and exporting metrics, logs, and traces, and provides a wide range of backends for storing and visualizing this data.

2. Monitoring Infrastructure

Sharemind HI’s monitoring infrastructure is comprised of several services for monitoring the system, using OpenTelemetry and other tools. Here’s a brief overview of each service:

OTel-Collector: This is the OpenTelemetry Collector, which receives, processes, and exports telemetry data. It uses the configuration file otel-collector-config.yaml to determine how to process and export the data.
Jaeger: This is a distributed tracing system. It receives trace data from the OpenTelemetry Collector and provides a user interface for viewing and analyzing the traces.
Prometheus: This is a monitoring system that collects and stores metrics data. It uses the configuration file prometheus.yaml to determine which metrics to collect and how to store them.
Grafana: This is a visualization tool that can display data from various sources, including Prometheus and Jaeger. It uses the configuration files in the provisioning directory to determine what data to display and how to visualize it.

By setting up these services in a Docker Compose file, one can easily start, stop, and manage the monitoring infrastructure for the system. This can be particularly useful in a development environment, where one may need to frequently start and stop services, or in a production environment, where one needs to ensure that the monitoring infrastructure is always available and up-to-date.

docker-compose.yaml:

services:

  # Collector
  otel-collector:
    container_name: otel-collector
    image: otel/opentelemetry-collector:0.83.0
    restart: unless-stopped
    ports:
      - "4318:4318"   # OTLP HTTP receiver
      - "8889:8889"   # Prometheus exporter metrics
    volumes:
      - ./config/otel-collector-config.yaml:/etc/otel-collector-config.yaml
    command: ["--config=/etc/otel-collector-config.yaml", "${OTELCOL_ARGS}"]

  # Jaeger - Traces
  jaeger:
    container_name: jaeger
    image: jaegertracing/all-in-one:1.48
    restart: unless-stopped
    ports:
      - "4317" # OTLP HTTP receiver
      - "16686:16686" # API and Frontend
    environment:
      - COLLECTOR_OTLP_ENABLED=true

  # Prometheus - Metrics
  prometheus:
    container_name: prometheus
    image: prom/prometheus:v2.46.0
    restart: unless-stopped
    ports:
      - "8889" # Exporter metrics
      - "9090:9090" # API and Frontend
    volumes:
      - ./config/prometheus.yaml:/etc/prometheus/prometheus.yml

  # Grafana - Dashboards
  grafana:
    container_name: grafana
    image: grafana/grafana:10.0.4
    restart: unless-stopped
    ports:
      - 3000:3000 # Frontend
    volumes:
      - grafana-storage:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning

volumes:
  grafana-storage:

To run the above stack:

docker-compose up -d

After the provisioning of the monitoring infrastructure, among others, one can specify the following configurations in Sharemind HI:

MeterFactoryType: OTLP_FACTORY
MeterFactoryConfiguration.ExportAddress: http://YOUR_IP_ADDRESS:4318/v1/metrics

TracerFactoryType: OTLP_FACTORY
TracerFactoryConfiguration.ExportAddress: http://YOUR_IP_ADDRESS:4318/v1/traces

The whole stack configuration, including the docker-compose file, services' configuration files, and additional documentation can be downloaded from: monitoring-stack.tar.xz.

3. Grafana Dashboard

The Grafana frontend can be accessed via: http://YOUR_IP_ADDRESS:3000.

An example of a Grafana dashboard can be found in the monitoring-stack.tar.xz/provisioning/dashboards/test-dashboard.json file, which can be imported to Grafana, if not automatically imported already.

Do not forget to import, as well, the "Connections > Data Sources".

Jaeger should be connected via http://jaeger:16686, and Prometheus via http://prometheus:9090.