Skip to main content

Observability and Monitoring

Module Structure

The observability module lives in src/observability/, matching Chive's layout:

FileResponsibility
logger.tsPinoLogger class with structured JSON output and PII redaction
telemetry.tsOpenTelemetry SDK initialization (initTelemetry)
tracer.tsTrace context helpers (get active span, inject context)
metrics-exporter.tsPrometheus metric definitions and registration
prometheus-registry.tsCentralized metric registry
freshness-metrics.tsDedicated staleness/freshness tracking metrics
index.tsBarrel exports

Logging

Pino Configuration

The appview uses Pino for structured JSON logging. Every log entry includes a request ID, timestamp, and OpenTelemetry trace/span IDs for correlation. The redaction list covers all sensitive field patterns matching Chive's comprehensive approach:

const logger = pino({
level: config.logLevel, // default: 'info'
formatters: {
level: (label) => ({ level: label }),
},
redact: [
'req.headers.authorization', 'req.headers.cookie',
'*.password', '*.token', '*.apiKey', '*.apikey',
'*.secret', '*.credential', '*.accessToken',
'*.refreshToken', '*.privateKey',
],
mixin() {
const span = trace.getActiveSpan();
if (span) {
const ctx = span.spanContext();
return { traceId: ctx.traceId, spanId: ctx.spanId };
}
return {};
},
});

Log Levels

LevelUsage
errorUnrecoverable failures (database connection lost, DLQ entry created)
warnRecoverable issues (validation failure, retry triggered, cache miss on expected key)
infoOperational events (server started, firehose connected, job completed)
debugDetailed tracing (record processing steps, query plans, cache operations)
tracePer-record field-level detail (only in development)

Request Context

Every HTTP request gets a unique requestId (UUID v4) injected by the request context middleware. This ID appears in all log entries, error responses, and distributed traces for the request's lifetime. Child loggers inherit the request context automatically.

Distributed Tracing

OpenTelemetry Setup

The appview uses OpenTelemetry 1.x (stable SDK) to instrument all I/O boundaries. The OTel Logs bridge can route Pino logs through the OTel Collector for unified observability. Grafana Alloy is the recommended next-gen collector, replacing the legacy Grafana Agent:

InstrumentationLibraryTraces
HTTP server@opentelemetry/instrumentation-httpIncoming request spans
PostgreSQL@opentelemetry/instrumentation-pgQuery spans with SQL text
Redis@opentelemetry/instrumentation-redisCommand spans
ElasticsearchCustom instrumentationSearch/index spans
Neo4jCustom instrumentationCypher query spans
BullMQCustom instrumentationJob processing spans

Traces are exported via OTLP HTTP to a collector (Jaeger in development, Grafana Tempo in production). A dedicated MetricsService class in src/services/metrics/ provides a clean API for recording business metrics (view counts, search clicks, annotation activity).

Span Hierarchy

A typical annotation search request produces the following span tree:

HTTP GET /api/v1/annotations?kind=span&subkind=ner
├── redis.get (cache check)
├── elasticsearch.search (faceted query)
│ └── elasticsearch.msearch (nested aggregation)
├── postgresql.query (cross-reference lookup)
└── redis.set (cache result)

Sampling Strategy

EnvironmentSampling RateRationale
Development100%Full visibility during debugging
Staging50%High visibility with moderate overhead
Production10%Sufficient for debugging, minimal overhead

Error traces are always captured regardless of sampling rate (tail-based sampling).

Metrics

Prometheus Exposition

The appview exposes Prometheus metrics at GET /metrics using prom-client.

Application Metrics

MetricTypeLabelsDescription
layers_http_request_duration_secondsHistogrammethod, route, statusRequest latency distribution
layers_http_requests_totalCountermethod, route, statusTotal request count
layers_db_query_duration_secondsHistogramdatabase, operationDatabase query latency
layers_db_connections_activeGaugedatabaseActive connection count per pool
layers_cache_hits_totalCountercacheRedis cache hits
layers_cache_misses_totalCountercacheRedis cache misses

Firehose Metrics

MetricTypeLabelsDescription
layers_firehose_cursor_lag_secondsGaugeTime behind the relay
layers_firehose_events_processed_totalCountercollectionEvents indexed per collection
layers_firehose_queue_depthGaugequeuePending jobs per queue
layers_firehose_dlq_entriesGaugeCurrent DLQ size

Business Metrics

These metrics are specific to Layers' annotation platform and have no equivalent in Chive:

MetricTypeLabelsDescription
layers_records_indexed_totalCountertypeTotal records indexed per record type
layers_annotation_layers_per_expressionHistogramDistribution of annotation layers per expression
layers_cross_references_totalCounterref_typeCross-references indexed per type
layers_knowledge_refs_totalCountersourceKnowledge base references per source (Wikidata, WordNet, etc.)
layers_corpora_totalGaugeNumber of indexed corpora
layers_active_personas_totalGaugeNumber of distinct annotator personas

Dashboards

Grafana Dashboard Suite

The appview ships with provisioned Grafana dashboards (JSON models stored in the repository).

System Overview Dashboard

  • Request rate and latency (p50, p95, p99)
  • Error rate by endpoint
  • Database connection pool utilization (PG, ES, Neo4j, Redis)
  • Memory and CPU usage

Firehose Ingestion Dashboard

  • Cursor lag (real-time gauge)
  • Events per second (stacked by collection)
  • Queue depth per queue
  • DLQ inflow rate
  • Validation failure rate by stage

Database Performance Dashboard

  • PostgreSQL query latency by operation (SELECT, INSERT, UPDATE, DELETE)
  • Elasticsearch search latency and indexing throughput
  • Neo4j query latency and active transactions
  • Redis command latency and memory usage

Business Metrics Dashboard

  • Records indexed over time (stacked by type)
  • Annotation density (layers per expression, histogram)
  • Cross-reference density (refs per record, histogram)
  • Knowledge grounding coverage (percentage of annotations with knowledgeRefs)
  • Corpus growth over time

Alerting

Alert Rules

AlertConditionSeverityAction
Firehose lag highcursor_lag_seconds > 60 for 5 minWarningCheck queue depth and worker health
Firehose lag criticalcursor_lag_seconds > 300 for 5 minCriticalImmediate investigation
DLQ growingdlq_entries > 100WarningReview DLQ entries for systematic failures
Error rate spike5xx rate > 1% for 5 minWarningCheck logs and traces
Database connection exhaustionconnections_active > 80% of poolWarningScale pool or investigate slow queries
Disk usage highdisk_usage > 80%WarningPlan storage expansion

Escalation

Alerts are routed through Grafana Alerting to the on-call channel (Slack, PagerDuty, or email, configurable per environment).

Health Checks

Liveness Probe

GET /health returns 200 OK if the process is running. Does not check downstream dependencies.

Readiness Probe

GET /ready returns 200 OK only if all four databases are reachable and the firehose consumer is connected. Returns 503 Service Unavailable otherwise, with a JSON body listing which dependencies are unhealthy.

{
"status": "unhealthy",
"dependencies": {
"postgresql": "ok",
"elasticsearch": "ok",
"neo4j": "timeout",
"redis": "ok",
"firehose": "ok"
}
}

See Also