Observability for the OpenClaw Agent — sessions, skills, token usage & alerts, latency (P50/P95/P99), agent & harness (Project Context, OpenClaw Structure, …), model pricing, live logs, and real-time IM push (Feishu/DingTalk). Ships as a standalone NestJS + React app with an English/Chinese UI, deployable via PM2 or CLI.
Languages: English (this file) · 简体中文
| Capability | OpenClaw default management console | TraceFlow |
|---|---|---|
| Bundled with Gateway | Yes | No (separate app) |
Skill call tracing (inferred from read paths) |
— | Yes |
| Per-user skill stats | — | Yes |
| Token thresholds & rankings | Basic | Stronger |
| Agent & harness (OpenClaw-aligned labels) | — | Yes |
| Latency P50/P95/P99 | — | Yes |
| Gateway connection behavior | Long-lived WS | Long-lived WS (reused for status, usage, logs.tail, skills.status, etc.) |
| Deployment | With Gateway | PM2, own port |
| UI language | Mainly one language | English + Chinese |
| Automation friendliness | Basic | JSON HTTP APIs + log WebSocket streaming |
| IM push (Agent sessions to Feishu/DingTalk) | — | Yes — thread-aggregated, rate-limited, debounced (v1.1.0+; China-focused, extensible) |
| Statistical scope spelled out in-product (ℹ) | Rare | Yes — major blocks explain what is included/excluded (e.g. live *.jsonl vs *.jsonl.reset.*, active vs archived tokens, totalTokensFresh caveats) |
Operator-safe Gateway overview without operator.read |
N/A | Yes — path checks use connect snapshot; dashboard health/overview uses health RPC (scope-exempt) when backend WS has cleared scopes(详见下文 Gateway scopes) |
Dashboard overview — Gateway health, session distribution, token summary, latency, top skills/tools, recent sessions, and live logs.
Dashboard screenshot (update when ready with latest capture)
Session list — Paged list per agent, with recorded vs estimated tokens, participant identities, status filters, and sort.
Session list screenshot (update when ready)
Session detail — Single transcript view with head/tail loading for large files, messages/tools/events/skills tabs.
Session detail screenshot (update when ready)
Skills analysis — Call frequency top 10, user distribution, skill × tool breakdown, zombie/duplicate detection.
Skills screenshot (update when ready)
System Prompt & Harness — Workspace bootstrap files, Project Context, Skills snapshot, token breakdown, evaluation results.
System Prompt screenshot (update when ready)
Token monitor & Pricing — Threshold distribution, dual-track (recorded vs estimate) token metrics, model pricing configuration.
Token monitor / pricing screenshots (update when ready)
Screenshot assets live in
docs/traceFlowSnapshots/(currently:dashboard-1.png,sessionList.png,sessionDetail.png,skills.png,systemPrompt.png,tokenMonitor.png,models.png). These are referenced above and will be wired into the README once aligned with the current UI build.
| Requirement | Notes |
|---|---|
| Node.js | >= 20.11.0 (20 LTS recommended) |
| pnpm | >= 9.0.0 |
| PM2 | Optional but recommended for production (deploy:pm2) |
From the openclaw-traceflow directory (after cloning this repository):
pnpm run deploy:pm2This runs pnpm install, builds backend + frontend, then starts or reloads the process openclaw-traceflow under PM2. Open http://localhost:3001 (or your HOST/PORT).
Ensure the OpenClaw Gateway is reachable at OPENCLAW_GATEWAY_URL (default http://localhost:18789). Set token/password in Settings in the UI if your Gateway requires auth.
TraceFlow supports multiple deployment modes depending on your environment.
pnpm run deploy:pm2The deploy script (scripts/deploy-pm2.sh) handles install → build → PM2 start/reload in one step. The process is registered as openclaw-traceflow under PM2.
Common PM2 commands:
pm2 logs openclaw-traceflow --lines 100 # view logs
pm2 restart openclaw-traceflow # restart
pm2 stop openclaw-traceflow # stop
pm2 delete openclaw-traceflow # remove from PM2pnpm run build:all
pnpm run restart:prodrestart:prod starts the process under PM2 if not yet running, or restarts it with auto-restart (up to 10 retries, 3s delay).
# Backend + frontend hot-reload (two processes)
pnpm run dev
# Backend only
pnpm run start:dev
# Backend + frontend separately
pnpm run dev:backend # NestJS watch
pnpm run dev:frontend # Vite dev serverTraceFlow ships a CLI binary (bin/cli.js) registered as both openclaw-traceflow and openclaw-monitor:
openclaw-traceflow # start the service
openclaw-monitor # alias, same binary
pnpm run monitor # via package.jsonOn first launch, TraceFlow shows a Setup Wizard in the browser to configure OpenClaw data paths. You can skip this if Gateway auto-discovers paths or if you set OPENCLAW_STATE_DIR / OPENCLAW_WORKSPACE_DIR beforehand.
For production exposure, place TraceFlow behind Nginx / Caddy with auth. Only /api/setup/* is protected by OPENCLAW_ACCESS_MODE; other read APIs are not Bearer-protected by default.
- Backend: NestJS 11 + TypeScript
- Frontend: React 18 + Vite 5 + React Router 6 + Ant Design 5 + Pro Layout + react-intl(中/英)
- Charts: Recharts 3
- Realtime: Socket.IO(日志流;仪表盘 HTTP 轮询)
- Storage: sql.js(SQLite),
data/metrics.db - Gateway:
GatewayConnectionService+TraceflowGatewayPersistentClient(长驻 WS,配置变更时重建) - Logging: Winston +
winston-daily-rotate-file(log rotation + auto cleanup) - IM Push: Feishu (
@larksuiteoapi/node-sdk) + EventEmitter2 event-driven architecture
TraceFlow is a separate web service that talks to your running OpenClaw Gateway (default http://localhost:18789). It does not replace the Gateway or OpenClaw’s default management console; it complements them with operator-focused dashboards you can deploy on another host or port (default http://0.0.0.0:3001).
Data scope & honesty. Many agent consoles show numbers without saying where they come from or what they exclude. TraceFlow treats that as a product risk: the UI documents statistical scope on major panels (ℹ tooltips)—for example, Dashboard Skills / Tools Top 5 aggregate live transcripts (*.jsonl) only, not archived turns (*.jsonl.reset.*); Token views separate active vs archived usage; session/token copy calls out totalTokensFresh and index lag when relevant. The goal is fewer silent mismatches between what operators assume and what the pipeline actually measures.
Performance stance. Observability should not mean “re-read everything on every click.” TraceFlow already ships incremental session directory scans, fingerprint-based caching for per-session tool/skill aggregation when transcripts haven’t changed, head/tail parsing for very large JSONL transcripts, a single long-lived Gateway WebSocket for RPC, and a batched dashboard overview endpoint. Remaining hot paths (e.g. worst-case O(n) scans when many sessions exist) are tracked honestly in ROADMAP.md.
Product design: The harness-visible vision, platform vs user system prompt layering, and TraceFlow UX roadmap are in docs/agent-harness-and-system-prompt.md.
When TraceFlow connects to the Gateway as mode: backend without a paired device identity, OpenClaw may clear scopes on that connection after connect. RPCs that require operator.read (for example some skills.status / usage paths) can then fail with missing scope: operator.read.
TraceFlow’s approach (keep this behavior when changing code):
- Runtime path discovery should rely on the
connectsnapshot (stateDir/configPath), not onoperator.read-gated probes alone. - Dashboard health / overview should prefer the Gateway
healthRPC (treated as scope-exempt in practice) and map its payload into UI shapes.
Code anchors: src/openclaw/gateway-overview-health.ts, gateway-persistent-client.ts, gateway-ws-paths.ts.
TraceFlow is designed to work out of the box. In most local setups, you can run pnpm run deploy:pm2 and open http://localhost:3001 without setting anything.
| Variable | When to set it | Default |
|---|---|---|
OPENCLAW_GATEWAY_URL |
Your Gateway is not reachable at localhost/default port | http://localhost:18789 |
Set Gateway auth (token/password) in the Settings page when required.
| Variable | Purpose | Default |
|---|---|---|
HOST |
Bind address | 0.0.0.0 |
PORT |
HTTP port | 3001 |
DATA_DIR |
Local data (metrics DB, etc.) | ./data |
OPENCLAW_GATEWAY_TOKEN / OPENCLAW_GATEWAY_PASSWORD |
Gateway auth for WS/RPC | unset |
OPENCLAW_STATE_DIR / OPENCLAW_WORKSPACE_DIR |
Path overrides | auto |
OPENCLAW_LOG_PATH |
Fallback log file if Gateway logs are unavailable | unset |
OPENCLAW_ACCESS_MODE |
Protect /api/setup/* (local-only · token · none) |
none |
OPENCLAW_RUNTIME_ACCESS_TOKEN |
Bearer token used with OPENCLAW_ACCESS_MODE=token |
unset |
More detail: config/README.md and optional config/openclaw.runtime.json.
Pricing: Token cost estimates use built-in defaults; override with config/model-pricing.json (see config/model-pricing.example.json).
Regional note: This feature currently supports Feishu (飞书, the Chinese version of Lark) and DingTalk (钉钉), which are the dominant IM platforms in China's enterprise market. The architecture is designed to be channel-agnostic—contributions adding Slack, Microsoft Teams, Discord, or other global IM platforms are welcome. See the Extending channels section below.
TraceFlow can push real-time Agent session records to IM platforms (currently Feishu, with DingTalk scaffolded), organizing messages by conversation thread for easy search and review.
OpenClaw Gateway (agents/*/sessions/*.jsonl)
│
▼ (fs.watch on sessions/*.jsonl)
SessionManager (direct file system listener)
│
▼ (emit audit.session.* events)
ImPushService (push coordination + in-memory queue)
│
▼
FeishuChannel (Feishu API + rate limiting + debounce)
│
▼
Feishu audit bot (thread-aggregated messages)
Key design decisions:
- File system only — no dependency on OpenClaw WebSocket, HTTP API, or event system. Watches
agents/*/sessions/*.jsonldirectly. - In-memory queue — messages are serialized per session to avoid race conditions and ensure chronological order. No SQLite persistence (simpler, no migration needed).
- No history replay on restart — after a restart, only new messages are pushed. Historical messages are not backfilled to avoid message storms.
- Debounce — JSONL streaming writes are debounced to prevent Feishu API flooding.
- Rate limiting — token bucket algorithm (10 msg/s, burst capacity 20).
- Configure Feishu credentials in
config/openclaw.runtime.json:
{
"im": {
"enabled": true,
"channels": {
"feishu": {
"enabled": true,
"appId": "cli_xxx",
"appSecret": "xxx",
"targetUserId": "ou_xxx",
"pushStrategy": {
"sessionStart": false,
"sessionMessages": true,
"sessionEnd": true,
"errorLogs": true,
"warnLogs": false
}
}
}
}
}-
Get Feishu credentials: Visit Feishu Open Platform, create an enterprise app, obtain App ID/Secret, and configure bot messaging permissions.
-
Restart TraceFlow and verify push in the Feishu audit bot.
| Config | Description | Default |
|---|---|---|
sessionStart |
Push session start notification | false |
sessionMessages |
Push session messages (user/AI/skill) | true |
sessionEnd |
Push session end summary | true |
errorLogs |
Push ERROR log alerts | true |
warnLogs |
Push WARN logs | false |
| Endpoint | Method | Description |
|---|---|---|
/api/im/channels |
GET | List enabled channels |
/api/im/channels/health |
GET | Channel health status |
/api/im/channels/:type/enabled |
GET | Check if channel enabled |
/api/im/channels/:type/test |
POST | Send test message |
/api/im/broadcast/test |
POST | Broadcast test message |
New IM channels implement the ImChannel interface (initialize, send, healthCheck, destroy) and register in ImModule. The architecture is designed for global IM platforms—contributions for Slack, Microsoft Teams, Discord, WeCom (企业微信), or others are welcome. See docs/IM_CHANNELS_GUIDE.md for the full guide.
- IM_PUSH.md — feature overview and troubleshooting
- IM_CHANNELS_GUIDE.md — channel plugin guide
- IM_PUSH_STRATEGY.md — push strategy implementation details
- IM_OPENCLAW_INTEGRATION.md — OpenClaw integration architecture
TraceFlow uses Winston with daily rotating log files:
- Log file:
data/traceflow.log(current day) - Rotation: daily, with automatic cleanup of old files
- Timezone: Asia/Shanghai (Beijing Time) for all log timestamps
- View logs:
pm2 logs openclaw-traceflow --lines 100ortail -f data/traceflow.log
- HTTP health:
GET /api/health— returns Gateway connection status and runtime health - Dashboard polling: frontend polls
GET /api/dashboard/overviewevery ~10s when visible - Background metrics: Token usage snapshots every ~30s (configurable via code constants)
- IM channel health:
GET /api/im/channels/health— returns channel health status
- IM push config: Changes to
config/openclaw.runtime.jsonare picked up on next config read - Path configuration: Changes saved in the Settings UI take effect immediately (in-memory config sync)
- Gateway connection: Rebuilt automatically when Gateway URL/token/password changes
- Session watch: IM push monitors
agents/*/sessions/*.jsonlfiles viafs.watch - Session end detection: 5 minutes of inactivity triggers session completion
- Restart behavior: On restart, historical sessions are not backfilled; only new messages are pushed from the current position
| Data type | Source | Notes |
|---|---|---|
| Session transcripts | agents/*/sessions/*.jsonl |
Live + archived (*.jsonl.reset.*) |
| Token metrics | Local data/metrics.db (~30s snap) |
Active + archived dual-track |
| Gateway health | Gateway health RPC (WS) |
Scope-exempt, no operator.read required |
| IM push events | File system watch | Only OpenClaw data, not TraceFlow logs |
| TraceFlow application logs | Winston → data/traceflow.log |
Rotating, timezone Asia/Shanghai |
# Check process status
pm2 list
# View recent logs
pm2 logs openclaw-traceflow --lines 50
# View live log stream
tail -f data/traceflow.log
# Restart after config change
pm2 restart openclaw-traceflow
# Full redeploy (install + build + restart)
pnpm run deploy:pm2
# Clean build artifacts
pnpm run clean| Path | Purpose |
|---|---|
/ · /dashboard |
Overview: Gateway health, tokens, latency, tools, etc. |
/sessions · /sessions/:id · /sessions/:id/archives |
Session list, detail, and archived epochs |
/system-prompt (/agent-harness redirects here) |
Agent & harness: Project Context, OpenClaw Structure, skills snapshot, etc. |
/workspace |
Workspace bootstrap files (AGENTS.md / SOUL.md / IDENTITY.md / USER.md) |
/markdown-preview |
Rendered markdown preview for workspace bootstrap docs |
/pricing |
Model pricing |
/logs |
Live logs (Socket.IO) |
/settings |
Gateway URL, paths, access |
- One row is one conversation thread in OpenClaw (one
sessionId/ one transcript). In group chats, many people usually share the same session row. sessionKeyencodes routing/shape (provider, group/channel/DM, etc.); it is not the same thing as “who” appears in the participant column.agent:<agentId>:mainis OpenClaw’s default “main” DM bucket when direct chats use themainsession scope; TraceFlow labels it Main session (中文 UI: 主会话), not “heartbeat.” Scheduled heartbeat traffic may still land in the same transcript—the key shape alone does not mean “heartbeat session.”- Participant (list): TraceFlow scans each transcript JSONL for distinct sender identities (
Sender/Conversation infometadata blocks,senderLabel,message.sender, etc.). If there are multiple distinct human senders, the column showsfirstIdentity (+N)whereNis the count of additional identities (not the total headcount). - Participant (detail): When multiple identities exist, the detail page shows the first plus +N; click +N for a popover with the full deduped list (same source as the list scan). Group rosters may be larger than what appears in the transcript—only observed senders are listed.
- Session detail · Messages: single-column list. Each message is one line by default; click the row to expand the full body; use the arrow button to collapse (so selecting text in the expanded body does not collapse the row).
unknownusually means the index had no id or the first transcript lines could not infer one—see session detail help text.
TraceFlow targets single-host, small-to-medium session counts. In practice we optimize the steady state:
- Session list / storage:
FileSystemSessionStorageincrementally rescans changed transcript files and keeps a short-lived cache solistSessionsstays mostly in-memory work. - Dashboard tool/skill Top 5:
MetricsService.refreshToolStatsSnapshot()keeps a per-session fingerprint (lastActiveAt+ transcript size + status). If unchanged, it reuses cached tool/skill counts instead of re-parsing JSONL—idle or completed sessions that don’t churn stop paying full parse cost every refresh. - Session detail: large transcripts use a head/tail window instead of loading the entire file (see server constants / session detail UI).
- Gateway: one reused WebSocket client per configured URL+auth—avoid repeated handshakes for
health,status,logs.tail, etc. - Overview API:
GET /api/dashboard/overviewbundles health, sessions, logs, and metrics in one round trip for the React dashboard.
With very large session counts, worst-case work can still grow (notably full scans when many sessions are new or churning)—see ROADMAP.md for known bottlenecks and planned work.
Only /api/setup/* (first-time config, test connection, saved settings) is gated by OPENCLAW_ACCESS_MODE. Other read-style APIs are not uniformly Bearer-protected; do not expose TraceFlow to the public internet without network controls or a reverse proxy with auth.
| Mode | Behavior |
|---|---|
local-only |
Only local IPs may change settings |
token |
Changes require Authorization: Bearer <OPENCLAW_RUNTIME_ACCESS_TOKEN> |
none |
No check (trusted networks only) |
Useful for scripts and monitoring. Full list lives in src/**/*controller.ts.
| Path | Method | Description |
|---|---|---|
/api/health |
GET | Health + Gateway connection summary |
/api/status |
GET | Gateway status / usage JSON |
/api/dashboard/overview |
GET | Aggregated dashboard payload; optional ?timeRangeMs= |
/api/sessions |
GET | Session list |
/api/sessions/:id |
GET | Session detail |
/api/sessions/:id/kill |
POST | Kill session |
/api/sessions/:id/evaluations* |
GET/POST/DELETE | Session evaluations (latest, history, detail, create) |
/api/metrics/* |
GET | Latency, tools/skills, token summaries |
/api/prompts/:promptId/evaluations* |
GET/POST/DELETE | Prompt evaluations (latest, history, detail, create) |
/api/evaluation-prompt |
GET/PUT/DELETE | Session evaluation template |
/api/workspace-bootstrap-evaluation-prompt |
GET/PUT/DELETE | Workspace bootstrap evaluation template |
/api/workspace/* |
GET/PUT | Workspace file read/write APIs |
/api/logs |
GET | Recent log lines |
/api/setup/* |
GET/POST | Setup (protected by access mode) |
/api/im/channels |
GET | List enabled IM channels |
/api/im/channels/health |
GET | IM channel health status |
/api/im/channels/:type/test |
POST | Send IM test message |
/api/audit/snapshot |
GET | Contribution audit snapshot |
Socket.IO namespace logs: logs:subscribe · logs:unsubscribe · server push logs:new (timestamp, level, content).
| Issue | What to check |
|---|---|
| Gateway unreachable | OPENCLAW_GATEWAY_URL, firewall; set token/password in Settings |
Test connection fails with missing scope: operator.read |
TraceFlow uses a device-less backend WebSocket; Gateway clears scopes, so older builds that called skills.status after connect failed. Current code uses connect snapshot for path checks and health RPC for overview (scope-exempt). See repo root AGENTS.md. |
| Empty logs | Gateway logs.tail may be unavailable without operator scope (falls back to empty); or set OPENCLAW_LOG_PATH |
| Token metrics show zero; archived bucket empty on Dashboard | Confirm sessions produce usage; check /api/metrics/token-summary and /api/sessions/token-usage. Archived often stays zero (no /new, reset files without usage, etc.); see docs/token-metrics-dual-track-example.md for field traceability and sample JSON |
| IM push not working | Check im.enabled and channel enabled flags in config; verify Feishu credentials; check data/traceflow.log for Feishu API error; send a test message via POST /api/im/channels/feishu/test |
| IM push message storms / flooding | Debounce is enabled by default (v1.1.1+). If you still see flooding, check rateLimit in config (default 10 msg/s). See docs/IM_PUSH.md |
| Sessions not detected after restart | Session watch starts from the current file position on restart; historical sessions are not backfilled. New messages after restart will be detected. If a session is not in sessions.json, it may still be watched if its jsonl file exists |
See ROADMAP.md in this repository.
- v1.1.x — IM push to Feishu with thread aggregation, debounce, in-memory queue, circuit breaker, rate limiting
- v1.1.x — Winston logging with daily rotation and auto cleanup (Beijing Time)
- v1.1.x — Path configuration hot-reload; settings saved in UI take effect immediately
- v1.1.x — Session evaluation templates (eval-prompt-v1) + workspace bootstrap evaluation
- v1.1.x — Contribution audit integration with agent-audit companion skill
- v1.1.x — Setup wizard simplified to single-page configuration
- v1.1.x — Performance: fingerprint-based caching for tool/skill aggregation, head/tail window for large transcripts
Issues and PRs are welcome (bugs, features, docs, UI, tests).
MIT © slashhuang