mirror of https://github.com/langgenius/dify.git synced 2026-03-06 08:06:37 +08:00

Files

GareArc ff877ee39c fix(telemetry): add resolved_trace_id property to eliminate trace_id inconsistencies

Add computed property to BaseTraceInfo that provides intelligent fallback:
1. External trace_id (from X-Trace-Id header)
2. workflow_run_id (for workflow-related traces)
3. message_id (as final fallback)

This ensures attribute dify.trace_id always matches log-level trace_id,
eliminating inconsistencies where attribute was null but log-level had value.

Changes:
- Add resolved_trace_id property to BaseTraceInfo (trace_entity.py)
- Replace 4 direct trace_id attribute assignments with resolved_trace_id
- Add trace_id_source parameter to 5 emit_metric_only_event calls

Fixes trace_id inconsistency found in MESSAGE_RUN, TOOL_EXECUTION,
MODERATION_CHECK, SUGGESTED_QUESTION_GENERATION, GENERATE_NAME_EXECUTION,
DATASET_RETRIEVAL, and PROMPT_GENERATION_EXECUTION events.

All 78 telemetry tests passing.

2026-02-28 20:32:15 -08:00

entities

feat(telemetry): unify token metric label structure with Pydantic enforcement

2026-02-06 03:10:20 -08:00

__init__.py

feat(telemetry): add enterprise OTEL telemetry with gateway, traces, metrics, and logs

2026-02-05 23:10:30 -08:00

contracts.py

feat(telemetry): add enterprise OTEL telemetry with gateway, traces, metrics, and logs

2026-02-05 23:10:30 -08:00

DATA_DICTIONARY.md

feat(telemetry): add model provider and name tags to all trace metrics

2026-02-28 00:06:44 -08:00

draft_trace.py

feat(telemetry): add enterprise OTEL telemetry with gateway, traces, metrics, and logs

2026-02-05 23:10:30 -08:00

enterprise_trace.py

fix(telemetry): add resolved_trace_id property to eliminate trace_id inconsistencies

2026-02-28 20:32:15 -08:00

event_handlers.py

refactor(telemetry): move gateway to core as stateless module-level functions

2026-02-28 19:27:24 -08:00

exporter.py

feat(enterprise-telemetry): wire bearer token auth and configurable insecure flag into OTEL exporter

2026-02-09 01:44:21 -08:00

id_generator.py

feat(telemetry): add enterprise OTEL telemetry with gateway, traces, metrics, and logs

2026-02-05 23:10:30 -08:00

metric_handler.py

feat(telemetry): add missing ID fields for name attributes

2026-02-10 00:09:41 -08:00

README.md

docs(enterprise): split telemetry docs into README and data dictionary

2026-02-27 12:32:48 -08:00

telemetry_log.py

feat: add dedicated app event counters and convert event names to StrEnum

2026-02-06 02:38:19 -08:00

README.md

Dify Enterprise Telemetry

This document provides an overview of the Dify Enterprise OpenTelemetry (OTEL) exporter and how to configure it for integration with observability stacks like Prometheus, Grafana, Jaeger, or Honeycomb.

Overview

Dify Enterprise uses a "slim span + rich companion log" architecture to provide high-fidelity observability without overwhelming trace storage.

Traces (Spans): Capture the structure, identity, and timing of high-level operations (Workflows and Nodes).
Structured Logs: Provide deep context (inputs, outputs, metadata) for every event, correlated to spans via trace_id and span_id.
Metrics: Provide 100% accurate counters and histograms for usage, performance, and error tracking.

Signal Architecture

graph TD
    A[Workflow Run] -->|Span| B(dify.workflow.run)
    A -->|Log| C(dify.workflow.run detail)
    B ---|trace_id| C
    
    D[Node Execution] -->|Span| E(dify.node.execution)
    D -->|Log| F(dify.node.execution detail)
    E ---|span_id| F
    
    G[Message/Tool/etc] -->|Log| H(dify.* event)
    G -->|Metric| I(dify.* counter/histogram)

Configuration

The Enterprise OTEL exporter is configured via environment variables.

Variable	Description	Default
`ENTERPRISE_ENABLED`	Master switch for all enterprise features.	`false`
`ENTERPRISE_TELEMETRY_ENABLED`	Master switch for enterprise telemetry.	`false`
`ENTERPRISE_OTLP_ENDPOINT`	OTLP collector endpoint (e.g., `http://otel-collector:4318`).	-
`ENTERPRISE_OTLP_HEADERS`	Custom headers for OTLP requests (e.g., `x-scope-orgid=tenant1`).	-
`ENTERPRISE_OTLP_PROTOCOL`	OTLP transport protocol (`http` or `grpc`).	`http`
`ENTERPRISE_OTLP_API_KEY`	Bearer token for authentication.	-
`ENTERPRISE_INCLUDE_CONTENT`	Whether to include sensitive content (inputs/outputs) in logs.	`true`
`ENTERPRISE_SERVICE_NAME`	Service name reported to OTEL.	`dify`
`ENTERPRISE_OTEL_SAMPLING_RATE`	Sampling rate for traces (0.0 to 1.0). Metrics are always 100%.	`1.0`

Correlation Model

Dify uses deterministic ID generation to ensure signals are correlated across different services and asynchronous tasks.

ID Generation Rules

trace_id: Derived from the correlation ID (workflow_run_id or node_execution_id for drafts) using int(UUID(correlation_id))
span_id: Derived from the source ID using SHA256(source_id)[:8]

Scenario A: Simple Workflow

A single workflow run with multiple nodes. All spans and logs share the same trace_id (derived from workflow_run_id).

trace_id = UUID(workflow_run_id)
├── [root span] dify.workflow.run (span_id = hash(workflow_run_id))
│   ├── [child] dify.node.execution - "Start" (span_id = hash(node_exec_id_1))
│   ├── [child] dify.node.execution - "LLM" (span_id = hash(node_exec_id_2))
│   └── [child] dify.node.execution - "End" (span_id = hash(node_exec_id_3))

Scenario B: Nested Sub-Workflow

A workflow calling another workflow via a Tool or Sub-workflow node. The child workflow's spans are linked to the parent via parent_span_id. Both workflows share the same trace_id.

trace_id = UUID(outer_workflow_run_id)     ← shared across both workflows
├── [root] dify.workflow.run (outer) (span_id = hash(outer_workflow_run_id))
│   ├── dify.node.execution - "Start Node"
│   ├── dify.node.execution - "Tool Node" (triggers sub-workflow)
│   │   └── [child] dify.workflow.run (inner) (span_id = hash(inner_workflow_run_id))
│   │       ├── dify.node.execution - "Inner Start"
│   │       └── dify.node.execution - "Inner End"
│   └── dify.node.execution - "End Node"

Key attributes for nested workflows:

Inner workflow's dify.parent.trace_id = outer workflow_run_id
Inner workflow's dify.parent.node.execution_id = tool node's execution_id
Inner workflow's dify.parent.workflow.run_id = outer workflow_run_id
Inner workflow's dify.parent.app.id = outer app_id

Scenario C: Draft Node Execution

A single node run in isolation (debugger/preview mode). It creates its own trace where the node span is the root.

trace_id = UUID(node_execution_id)   ← own trace, NOT part of any workflow
└── dify.node.execution.draft (span_id = hash(node_execution_id))

Key difference: Draft executions use node_execution_id as the correlation_id, so they are NOT children of any workflow trace.

Content Gating

When ENTERPRISE_INCLUDE_CONTENT is set to false, sensitive content attributes (inputs, outputs, queries) are replaced with reference strings (e.g., ref:workflow_run_id=...) to prevent data leakage to the OTEL collector.

Reference String Format:

ref:{id_type}={uuid}

Examples:

ref:workflow_run_id=550e8400-e29b-41d4-a716-446655440000
ref:node_execution_id=660e8400-e29b-41d4-a716-446655440001
ref:message_id=770e8400-e29b-41d4-a716-446655440002

To retrieve actual content when gating is enabled, query the Dify database using the provided UUID.

Reference

For a complete list of telemetry signals, attributes, and data structures, see DATA_DICTIONARY.md.