Commit Graph

9028 Commits

Author SHA1 Message Date
eaea4ad6dd fix: use payload.id instead of undefined args in set_default_provider 2026-03-04 22:28:35 -08:00
7007aa3c61 Merge branch 'fix/enterprise-api-error-handling' into deploy/enterprise 2026-03-04 19:54:13 -08:00
2b739b9544 fix: handle enterprise API errors properly to prevent KeyError crashes
When enterprise API returns 403/404, the response contains error JSON
instead of expected data structure. Code was accessing fields directly
causing KeyError → 500 Internal Server Error.

Changes:
- Add enterprise-specific error classes (EnterpriseAPIError, etc.)
- Implement centralized error validation in EnterpriseRequest.send_request()
- Extract error messages from API responses (message/error/detail fields)
- Raise domain-specific errors based on HTTP status codes
- Preserve backward compatibility with raise_for_status parameter

This prevents KeyError crashes and returns proper HTTP error codes
(403/404) instead of 500 errors.
2026-03-04 19:53:43 -08:00
22e82297c5 fix(api): restore reg(ModelConfig) for Swagger schema generation 2026-03-04 17:34:08 -08:00
8049c90a38 Merge remote-tracking branch 'origin/release/e-1.12.1' into deploy/enterprise 2026-03-04 17:32:33 -08:00
3f771544b1 Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-04 17:31:51 -08:00
ee13650e3d fix(api): restore missing reg(ModelConfig) from 1.12.1 refactor 2026-03-04 17:31:19 -08:00
7ef139cadd Squash merge 1.12.1-otel-ee into release/e-1.12.1 2026-03-04 16:59:37 -08:00
9fa8f6235e Merge branch 'release/e-1.12.1' into 1.12.1-otel-ee 2026-03-04 16:59:21 -08:00
bf5a327156 fix(api): ensure enterprise workspace join occurs on account registration failure 2026-03-04 14:56:21 +08:00
d94af41f07 fix(api): ensure default workspace join occurs even if personal workspace creation fails 2026-03-04 14:56:21 +08:00
5d54c198c0 Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-02 20:01:15 -08:00
6536489195 fix(telemetry): restore TRACE_TASK_TO_CASE lookup broken by CE safety refactor
The CE safety commit (8a3485454a) converted module-level dicts to lazy
functions but forgot to update __init__.py, which still imported the
now-deleted TRACE_TASK_TO_CASE constant causing an ImportError at startup.

Add get_trace_task_to_case() to gateway.py as a lazy public wrapper
(inverse of _get_case_to_trace_task) and update __init__.py to call it.
2026-03-02 19:59:20 -08:00
8f1d2455f4 Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-02 18:50:39 -08:00
8a3485454a fix(telemetry): ensure CE safety for enterprise-only imports and DB lookups
- Move enqueue_draft_node_execution_trace import inside call site in workflow_service.py
- Make gateway.py enterprise type imports lazy (routing dicts built on first call)
- Restore typed ModelConfig in llm_generator method signatures (revert dict regression)
- Fix generate_structured_output using wrong key model_parameters -> completion_params
- Replace unsafe cast(str, msg.content) with get_text_content() across llm_generator
- Remove duplicated payload classes from generator.py, import from core.llm_generator.entities
- Gate _lookup_app_and_workspace_names and credential lookups in ops_trace_manager behind is_enterprise_telemetry_enabled()
2026-03-02 18:45:33 -08:00
8d8552cbb9 Merge branch 'fix/otel-upgrade-e-1.12.1' into release/e-1.12.1 2026-03-02 17:21:39 -08:00
cf15f0d681 Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-02 15:56:52 -08:00
d6de27a25a feat(telemetry): promote gen_ai scalar fields from log-only to span attributes
Move gen_ai.usage.*, gen_ai.request.model, gen_ai.provider.name, and
gen_ai.user.id from companion-log-only to span attributes on workflow
and node execution spans.

These are small scalars with no size risk. Having them on spans enables
filtering and grouping in trace UIs (Tempo, Jaeger, Datadog) without
requiring a cross-signal join to companion logs.

Data dictionary updated: span tables gain the new fields; companion log
'additional attributes' tables trimmed to only list fields not already
covered by 'All span attributes'.
2026-03-02 15:55:10 -08:00
11ab67c8cb Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-02 04:20:06 -08:00
fe741140d5 fix(telemetry): fix zero-value message and workflow duration histograms
Workflow RT: replace float(info.workflow_run_elapsed_time) with
(end_time - start_time).total_seconds() using workflow_run.created_at and
workflow_run.finished_at. The elapsed_time DB field defaults to 0 and can
be stale if the workflow_storage Celery task has not committed yet when the
trace fires. Wall-clock timestamps are more reliable; elapsed_time is kept
as fallback.

Message RT: change end_time from created_at + provider_response_latency to
message.updated_at when updated_at > created_at. The pipeline explicitly
sets message.updated_at = naive_utc_now() at the moment the LLM response
is complete, making it the canonical response-complete timestamp.
Falls back to the latency-based calculation for error/aborted messages.
2026-03-02 04:14:57 -08:00
9b5b355a4e fix(telemetry): gate ObservabilityLayer content attrs behind ENTERPRISE_INCLUDE_CONTENT
Add should_include_content() helper to extensions/otel/parser/base.py that
returns True in CE (no behaviour change) and respects ENTERPRISE_INCLUDE_CONTENT
in EE. Gate all content-bearing span attributes in LLM, retrieval, tool, and
default node parsers so that gen_ai.completion, gen_ai.prompt, retrieval.document,
tool call arguments/results, and node input/output values are suppressed when
ENTERPRISE_ENABLED=True and ENTERPRISE_INCLUDE_CONTENT=False.
2026-03-02 04:04:26 -08:00
ff35f1bfaa Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-02 02:28:30 -08:00
3364003f90 fix(telemetry): add credential_name lookup with async-safe fallback 2026-03-02 02:27:31 -08:00
e387d0205b Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-02 01:54:55 -08:00
6df00c83ae fix(telemetry): populate LLM credential info in node execution traces
- Add _lookup_llm_credential_info() to query Provider/ProviderModel tables
- Lookup LLM credentials when tool credential_id is null
- Fall back to provider-level credential if no model-specific credential
2026-03-02 01:47:39 -08:00
05cf2336ac docs(telemetry): add token consumption query patterns to data dictionary
Add token hierarchy diagram, common PromQL queries (totals, drill-down,
rates), and app name lookup via trace query.
2026-03-02 01:19:00 -08:00
b710c9ad59 fix(telemetry): populate missing fields in node execution trace
- Extract model_provider/model_name from process_data (LLM nodes store
  model info there, not in execution_metadata)
- Add invoke_from to node execution trace metadata dict
- Add credential_id to node execution trace metadata dict
- Add conversation_id to metadata after message_id lookup
- Add tool_name to tool_info dict in tool node
2026-03-02 01:18:59 -08:00
a2a5b02a53 docs(telemetry): add token consumption query patterns to data dictionary
Add token hierarchy diagram, common PromQL queries (totals, drill-down,
rates), and app name lookup via trace query.
2026-03-02 01:07:18 -08:00
1fcb05432d fix(telemetry): populate missing fields in node execution trace
- Extract model_provider/model_name from process_data (LLM nodes store
  model info there, not in execution_metadata)
- Add invoke_from to node execution trace metadata dict
- Add credential_id to node execution trace metadata dict
- Add conversation_id to metadata after message_id lookup
- Add tool_name to tool_info dict in tool node
2026-03-02 01:07:10 -08:00
9c148218fc Merge branch 'deploy/enterprise' of https://github.com/langgenius/dify into deploy/enterprise 2026-03-02 16:58:01 +08:00
02ab3a34b4 Merge branch 'release/e-1.12.1' into deploy/enterprise 2026-03-02 16:57:31 +08:00
58524fd7fd feat(enterprise): auto-join newly registered accounts to the default workspace (#32308)
Co-authored-by: Yunlu Wen <yunlu.wen@dify.ai>
2026-03-02 16:38:43 +08:00
aa7f648712 Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-01 22:30:09 -08:00
9d4b2715e8 fix(celery): register enterprise_telemetry_task in worker imports
Fixes Celery worker error where process_enterprise_telemetry task
was unregistered despite being dispatched from the app.

Added conditional import when ENTERPRISE_TELEMETRY_ENABLED=true
to ensure the task is available in the worker process.

Resolves: KeyError 'tasks.enterprise_telemetry_task.process_enterprise_telemetry'
2026-03-01 22:27:44 -08:00
e2fc3417be Merge branch 'fix/otel-upgrade-e-1.12.1' into deploy/enterprise 2026-03-01 21:48:37 -08:00
2d7bffcc11 fix: upgrade OpenTelemetry packages from 0.48b0 to 0.49b0
Fixes "Failed to detach context" error in production by upgrading to OTEL 0.49b0,
which includes None token guards in Celery instrumentor (PR opentelemetry-python-contrib#2927).

Package Updates:
- OTEL instrumentation: 0.48b0 → 0.49b0
- OTEL SDK/API: 1.27.0 → 1.28.0
- protobuf: 4.25.8 → 5.29.6 (required by opentelemetry-proto 1.28.0)
- Google Cloud packages upgraded for protobuf 5.x compatibility:
  - google-api-core: 2.18.0 → 2.19.1+
  - google-auth: 2.29.0 → 2.47.0+
  - google-cloud-aiplatform: 1.49.0 → 1.123.0+
  - googleapis-common-protos: 1.63.0 → 1.65.0+
  - google-cloud-storage: 2.16.0 → 3.0.0+
- httpx: 0.27.0 → 0.28.0 (required by google-genai 1.37+)

Also removed duplicate opentelemetry-instrumentation-httpx entry in pyproject.toml.
2026-03-01 21:47:51 -08:00
eb1b1eb09c Merge 1.12.1-otel-ee into deploy/enterprise 2026-03-01 19:37:06 -08:00
83f5850d0a refactor(telemetry): add resolved_parent_context property and fix edge cases
- Add resolved_parent_context property to BaseTraceInfo for reusable parent context extraction
- Refactor enterprise_trace.py to use property instead of duplicated dict plucking (~19 lines eliminated)
- Fix UUID validation in exporter.py with specific error logging for invalid trace correlation IDs
- Add error isolation in event_handlers.py to prevent telemetry failures from breaking user operations
- Replace pickle-based payload_fallback with JSON storage rehydration for security
- Update TelemetryEnvelope to use Pydantic v2 ConfigDict with extra='forbid'
- Update tests to reflect contract changes and new error handling behavior
2026-03-01 19:33:59 -08:00
3368d4cf02 Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-02 10:10:28 +08:00
7a92c1764f fix token label 2026-03-02 10:10:01 +08:00
5617d69ca7 try to fix exception logging 2026-03-02 09:53:11 +08:00
1a6aded8e0 Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-01 02:25:23 -08:00
9952a17fed fix(telemetry): use URL scheme instead of API key for gRPC TLS detection
- Change insecure parameter from API key-based to URL scheme-based detection
- https:// endpoints now correctly use TLS (insecure=False)
- All other endpoints (http://, no scheme) use insecure=True
- Update tests to reflect URL scheme-based logic
- Remove incorrect documentation claiming API key controls TLS
2026-03-01 02:24:25 -08:00
36ff9b447d Merge origin/release/e-1.12.1 into 1.12.1-otel-ee
Sync enterprise 1.12.1 changes:
- feat: implement heartbeat mechanism for database migration lock
- refactor: replace AutoRenewRedisLock with DbMigrationAutoRenewLock
- fix: improve logging for database migration lock release
- fix: make flask upgrade-db fail on error
- fix: include sso_verified in access_mode validation
- fix: inherit web app permission from original app
- fix: make e-1.12.1 enterprise migrations database-agnostic
- fix: get_message_event_type return wrong message type
- refactor: document_indexing_sync_task split db session
- fix: trigger output schema miss
- test: remove unrelated enterprise service test

Conflict resolution:
- Combined OTEL telemetry imports with tool signature import in easy_ui_based_generate_task_pipeline.py
2026-03-01 00:18:46 -08:00
1fa1960201 Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-02-28 20:34:15 -08:00
ff877ee39c fix(telemetry): add resolved_trace_id property to eliminate trace_id inconsistencies
Add computed property to BaseTraceInfo that provides intelligent fallback:
1. External trace_id (from X-Trace-Id header)
2. workflow_run_id (for workflow-related traces)
3. message_id (as final fallback)

This ensures attribute dify.trace_id always matches log-level trace_id,
eliminating inconsistencies where attribute was null but log-level had value.

Changes:
- Add resolved_trace_id property to BaseTraceInfo (trace_entity.py)
- Replace 4 direct trace_id attribute assignments with resolved_trace_id
- Add trace_id_source parameter to 5 emit_metric_only_event calls

Fixes trace_id inconsistency found in MESSAGE_RUN, TOOL_EXECUTION,
MODERATION_CHECK, SUGGESTED_QUESTION_GENERATION, GENERATE_NAME_EXECUTION,
DATASET_RETRIEVAL, and PROMPT_GENERATION_EXECUTION events.

All 78 telemetry tests passing.
2026-02-28 20:32:15 -08:00
370e1fa5e2 Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-02-28 19:30:49 -08:00
abcf14a571 refactor(telemetry): move gateway to core as stateless module-level functions
Move routing table, emit(), and is_enterprise_telemetry_enabled() from
enterprise/telemetry/gateway.py into core/telemetry/gateway.py so both
CE and EE share one code path. The ce_eligible flag in CASE_ROUTING
controls which events flow in CE — flipping it is the only change needed
to enable an event in community edition.

- Delete enterprise/telemetry/gateway.py (class-based singleton)
- Create core/telemetry/gateway.py (stateless functions, no shared state)
- Simplify core/telemetry/__init__.py to thin facade over gateway
- Remove TelemetryGateway class and get_gateway() from ext_enterprise_telemetry
- Single-source is_enterprise_telemetry_enabled in core.telemetry.gateway
- Fix pre-existing test bugs (missing dify.event.id in metric handler tests)
- Update all imports and mock paths across 7 test files
2026-02-28 19:27:24 -08:00
9bd938b4e1 Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-02-28 17:41:17 -08:00
5e57f73598 feat(telemetry): add model provider and name tags to all trace metrics
Add comprehensive model tracking across all OTEL metrics and logs:
- Node execution metrics now include model_name for LLM operations
- Suggested question metrics include model_provider and model_name
- Dataset retrieval captures both embedding and rerank model info
- Updated DATA_DICTIONARY.md with complete metric label documentation

This enables granular cost tracking, performance analysis, and usage monitoring per model across all operation types.
2026-02-28 00:06:44 -08:00