From 1b9780e13a6ba0228f6b0e7260fa572a45ae88de Mon Sep 17 00:00:00 2001
From: vansin <smartflowaiteam@gmail.com>
Date: Sun, 7 Jun 2026 06:39:30 +0800
Subject: [PATCH 1/2] =?UTF-8?q?feat(demos/grok-news-fetcher):=20A=E7=AB=99?=
 =?UTF-8?q?=20ai-insight=20=E5=85=9C=E5=BA=95=E6=96=B0=E9=97=BB=E6=8A=93?=
 =?UTF-8?q?=E5=8F=96=E8=84=9A=E6=9C=AC?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

A 站 twitterapi.io 6-03 起 402 quota exceeded, 管线断流。Vincent
route grok-build CLI 0.2.x headless 模式当兜底源。

交付:
- fetch_news_via_grok.js: 串行调 grok -p ... --output-format json,
  从 envelope.text 抠 LLM 返的 JSON 数组, 规范化为 A站
  auto_update_news.js --fetch-only 一字不差的 12 字段 schema
  (account/name/user/tweetId/text/date/likes/views/url/source/
  category/imageUrl). 零 npm 依赖, 纯 Node.js stdlib.
- README.md: 用法 + cron 接入 + 实测输出 + gotcha + 红线

实测 (2026-06-07):
- 3 账号 (@sama/@OpenAI/@karpathy) 串行 3m26s
- 6/6 URL 真实可点开 https://x.com/<user>/status/<id>
- schema 12 字段对齐, exit 0
- 失败路径 (nonexistent account): 空 tweets + exit 1, stderr 提示
  downstream skip write
- stderr 噪音 (PermissionDenied/serde error) 只丢弃, 不影响 parse

红线:
- 本地/Docker 自测, 不碰 47.77.216.1 A站生产
- 绝不塞假数据 — 失败一律空数组 + exit≠0
- url 必须真链接 (正则校验) 否则那条 tweet 静默丢弃

Author-Agent: 通信demo马
Dispatched-By: 通信龙 (commhub task 58afee64-899b-468b-90e6-3b8fb379258d)
Helpers: 通信SDK马 (grok-x-search demo 参考实现)
---
 demos/grok-news-fetcher/README.md             | 199 +++++++++
 .../grok-news-fetcher/fetch_news_via_grok.js  | 385 ++++++++++++++++++
 2 files changed, 584 insertions(+)
 create mode 100644 demos/grok-news-fetcher/README.md
 create mode 100755 demos/grok-news-fetcher/fetch_news_via_grok.js
diff --git a/demos/grok-news-fetcher/README.md b/demos/grok-news-fetcher/README.md
new file mode 100644
index 00000000..3dd62ea4
--- /dev/null
+++ b/demos/grok-news-fetcher/README.md
@@ -0,0 +1,199 @@
+# demos/grok-news-fetcher — A站 ai-insight 兜底新闻抓取脚本
+
+> **Pitch**: 用 grok-build CLI 0.2.x 的 headless 模式 (`grok -p ... --output-format json`) 当 X (Twitter) 新闻抓取的*兜底源*，给 A站 [ai-insight](https://ai-insight.org) 在 twitterapi.io 402 额度耗尽后顶上。
+>
+> 输出 schema 跟 ai-insight 现有 `auto_update_news.js --fetch-only` **一字不差**，可直接接入其下游翻译 / 写库管线。
+
+## 背景
+
+- A 站 ai-insight 用 `twitterapi.io` 抓 10 个锚定账号的最新推文，翻译入库再发布。
+- 2026-06-03 起 twitterapi.io 返 **HTTP 402 quota exceeded**，管线断流，6-03 起每日抓取 0 条。
+- Vincent 决定用 grok-build (已实证支持 native X 搜索，见 [`demos/grok-x-search/`](../grok-x-search/)) 当**兜底**而非主源 — 频率低、容忍 LLM 输出微抖动。
+
+## 交付内容
+
+| 文件 | 作用 |
+|------|------|
+| [`fetch_news_via_grok.js`](./fetch_news_via_grok.js) | 主脚本 (Node.js, 无 npm 依赖) |
+| `README.md` | 本文 |
+
+零 npm 依赖、纯 Node.js stdlib (`child_process`, `fs`, `path`)，方便丢任意机器或 cron 上跑。
+
+## 前置
+
+- Node.js ≥ 18 (用 `node:child_process` 等内建模块)
+- `grok` 0.2.29+ on `$PATH`，已 `grok login` (`~/.grok/auth.json` 存在)
+- 网络可达 `grok-build` 的 LLM endpoint
+
+```bash
+grok --version       # 应当 grok 0.2.x (alpha 也行)
+grok login           # 一次性 OAuth
+```
+
+## 一句话跑
+
+```bash
+# 默认 10 锚定账号 + 最近 24h + 输出 ./news.json
+./fetch_news_via_grok.js
+
+# 指定账号 / 时间窗 / 输出位置
+./fetch_news_via_grok.js \
+  --accounts @OpenAI,@AnthropicAI,@sama \
+  --since 24h \
+  --max-per 5 \
+  --out /var/lib/ai-insight/fallback.json
+
+# 绝对日期窗 + stdout
+./fetch_news_via_grok.js --since 2026-06-06 --out -
+```
+
+## CLI 参数
+
+| Flag | 默认 | 说明 |
+|------|------|------|
+| `--accounts <list>` | 10 锚定账号 (见下) | 逗号分隔的 `@handle` 列表 |
+| `--since <window>` | `24h` | `Nh` 相对小时 (例 `24h`, `72h`) 或 `YYYY-MM-DD` 绝对日期 |
+| `--start-id <n>` | `1` | 输出 `nextId` 字段起点 |
+| `--out <path>` | `./news.json` | 输出文件路径，`-` = stdout |
+| `--max-per <n>` | `5` | 每账号最多抓几条 |
+| `--timeout <secs>` | `180` | 每次 grok 子进程超时 |
+| `-v, --verbose` | off | 每账号进度打 stderr |
+| `-h, --help` | — | 用法 |
+
+默认 10 锚定账号: `@OpenAI`, `@AnthropicAI`, `@GoogleDeepMind`, `@xai`, `@sama`, `@karpathy`, `@Alibaba_Qwen`, `@deepseek_ai`, `@Kimi_Moonshot`, `@nvidia`。
+
+## 输出 schema (跟 `auto_update_news.js --fetch-only` 对齐)
+
+```json
+{
+  "nextId": 7,
+  "tweets": [
+    {
+      "account": "@sama",
+      "name": "Sam Altman",
+      "user": "sama",
+      "tweetId": "2062661191969972645",
+      "text": "天啊，互联网的早期真的很特别。",
+      "date": "2026-06-04",
+      "likes": 0,
+      "views": 0,
+      "url": "https://x.com/sama/status/2062661191969972645",
+      "source": "@sama",
+      "category": "行业",
+      "imageUrl": ""
+    }
+  ]
+}
+```
+
+字段规则:
+- `user` = handle 不带 `@`
+- `account` / `source` = `@<user>` (两个字段同值，保留是兼容 A站老 schema)
+- `url` = 必须是真实 `https://x.com/<user>/status/<digits>` 链接，**否则那条 tweet 被脚本静默丢弃**
+- `tweetId` = 从 url 正则抠 (`/status/(\d+)`)
+- `date` = `YYYY-MM-DD`，LLM 返 `YYYY-MM` 自动补 `-01`，全空 fallback 今天
+- `likes` / `views` = `0` (grok native 拿不到 engagement metadata，X 平台限制)
+- `imageUrl` = `""` (grok 也拿不到媒体附件，留空)
+- `category` = `"行业"` 固定值
+
+## 失败语义 (A站硬要求)
+
+| 情况 | 输出 | exit |
+|------|------|------|
+| 至少 1 条 tweet 聚合成功 | 正常 JSON | `0` |
+| 所有账号失败 / 总数为 0 | `{nextId, tweets: []}` 仍写文件 | `1` |
+| CLI 参数错误 | stderr 报错 | `2` |
+| `grok` 不在 PATH | stderr 报错 | `3` |
+
+**绝对不塞假数据**。下游翻译看到空 tweets 应当跳过写库，按 ai-insight 现有约定。
+
+## 实测 (2026-06-07, 通信demo马)
+
+> 命令: `./fetch_news_via_grok.js --accounts @sama,@OpenAI,@karpathy --since 72h --max-per 3 -v`
+
+```
+[fetch-grok] accounts=3 since=72h max-per=3
+[fetch-grok] → @sama
+[fetch-grok]   ✓ 3 tweet(s)
+[fetch-grok] → @OpenAI
+[fetch-grok]   ✓ 3 tweet(s)
+[fetch-grok] → @karpathy
+[fetch-grok]   ✓ 0 tweet(s)
+[fetch-grok] wrote 6 tweet(s) to news.json
+
+real    3m26s
+exit    0
+```
+
+- 6/6 URL 实测真链接 (例如 `https://x.com/sama/status/2062661191969972645`, `https://x.com/OpenAI/status/2062927046448431587`)
+- schema 12 字段一字不差
+- 串行 (避免 `auth.json` / rate-limit 冲突)：3 账号 ~3.5min，10 账号大约 12-15min
+
+> 命令: `./fetch_news_via_grok.js --accounts @ThisAccountDefinitelyDoesNotExistFooBar2026 --since 1h --max-per 2`
+
+```
+[fetch-grok] zero tweets aggregated — exit 1 (downstream should skip write)
+exit    1
+news.json: {"nextId": 1, "tweets": []}
+```
+
+## Cron 接入示例
+
+```cron
+# /etc/cron.d/ai-insight-grok-fallback
+# 每 6h 抓一次，输出到 fallback 目录
+0 */6 * * * ai-insight cd /opt/ai-insight && /opt/ai-insight/scripts/fetch_news_via_grok.js \
+    --since 6h --out /var/lib/ai-insight/grok-fallback.json \
+    >/var/log/ai-insight/grok-fetch.log 2>&1 ; \
+    test -s /var/lib/ai-insight/grok-fallback.json && \
+    /opt/ai-insight/scripts/translate_and_upsert.js \
+        --in /var/lib/ai-insight/grok-fallback.json
+```
+
+下游应当先用 `jq '.tweets | length'` 或检查 exit code 决定是否走翻译入库。
+
+## 已知边界 / gotcha
+
+| 现象 | 原因 | 应对 |
+|------|------|------|
+| stderr 里有 `failed to watch root recursively` / `Error reading from stream: serde error` | grok-build CLI 给 `/tmp` 文件监视和流解析的告警噪音，跟实际任务无关 | 脚本只 parse stdout，stderr 全丢，正确行为 |
+| 某账号 `0 tweet(s)` 但实际有发 | LLM 用 `web_search` 索引可能滞后几小时；或 X 反爬丢 `web_fetch` | 增大 `--since` 时间窗；或重跑 |
+| 某账号 grok 异常退出 (timeout / non-zero exit) | LLM 卡顿 / endpoint 5xx | 脚本静默跳过该账号，继续下一个，最后 stderr 列出失败列表 |
+| `likes` / `views` 永远 `0` | X 不向匿名 / 非 logged-in 客户端开放 engagement metadata，grok web_search 也拿不到，正确诚实地不编造 | A 站下游若需要排序请用其他信号 (如发布时间) |
+| `imageUrl` 永远 `""` | 同上，grok web_search 不会自动拉媒体附件 URL | 需要图片请用 A 站现有的 twitterapi.io 路径，grok 兜底不覆盖 |
+| LLM 返回的 `date` 字段不是 `YYYY-MM-DD` | LLM 输出非确定性 | 脚本对 `YYYY-MM` 补 `-01`，全无效 fallback 今天，永远是合法日期串 |
+| 跑得慢 (10 账号 ~15min) | grok 单轮 LLM 调用 30-90s × 串行 10 次 | cron 调度别小于 30min；并发可加但要先解决 `auth.json` 锁 (本脚本未做) |
+
+## 与 A站 现有 `auto_update_news.js` 的关系
+
+```
+        ┌─ 主源: twitterapi.io (--fetch-only)
+        │
+auto_update_news.js  ──→  下游翻译 / 写库
+        │
+        └─ 兜底: fetch_news_via_grok.js (本脚本)
+                 输出文件路径相同的 schema, A站把它当 --fetch-only 的结果替代
+```
+
+集成方式由 A 站负责人决定:
+1. 在 ai-insight 的调度脚本里检测 twitterapi.io 返 402 → fallback 跑本脚本
+2. 或独立 cron 每 6-12h 跑本脚本，下游用 `nextId` 做去重
+
+## 红线 (开发自约束)
+
+- ❌ **不碰 A 站生产服务器 `47.77.216.1`** — 那是 A 站自己装 grok + 部署；本仓库只交付脚本 + 用法
+- ❌ **不内置任何 A 站凭据 / endpoint / 路径**
+- ✅ 本地 / Docker 自测，用本仓库 anet 团队的 grok login
+
+## 关联
+
+- 兜底起因: A 站 ai-insight twitterapi.io 402 quota exceeded
+- grok native X 搜索能力探测: [`demos/grok-x-search/README.md`](../grok-x-search/README.md)
+- E2E 实证: [`docs/tests/p-grok-native-xsearch-e2e/report.md`](../../docs/tests/p-grok-native-xsearch-e2e/report.md)
+- 通信demo马 dispatch chain: commhub task `58afee64-899b-468b-90e6-3b8fb379258d`
+
+## 维护
+
+- 主笔: 通信demo马 (claude-code-cli runtime)
+- Review: 通信龙
+- 下游对接: A 站负责人 (Vincent 指定)
diff --git a/demos/grok-news-fetcher/fetch_news_via_grok.js b/demos/grok-news-fetcher/fetch_news_via_grok.js
new file mode 100755
index 00000000..c7b7e1ed
--- /dev/null
+++ b/demos/grok-news-fetcher/fetch_news_via_grok.js
@@ -0,0 +1,385 @@
+#!/usr/bin/env node
+/**
+ * fetch_news_via_grok.js
+ *
+ * Headless grok-build X (Twitter) news fetcher.
+ *
+ * Drop-in fallback for A站 ai-insight's news pipeline when twitterapi.io quota
+ * is exhausted. Output schema is byte-identical to
+ * `auto_update_news.js --fetch-only`.
+ *
+ * Usage:
+ *   ./fetch_news_via_grok.js \
+ *     --accounts @OpenAI,@AnthropicAI,@sama \
+ *     --since 24h \
+ *     --start-id 1 \
+ *     --out news.json
+ *
+ *   ./fetch_news_via_grok.js --since 2026-06-06 --out -    # stdout
+ *
+ * Defaults:
+ *   --accounts   the 10 anchor accounts (see DEFAULT_ACCOUNTS below)
+ *   --since      24h
+ *   --start-id   1
+ *   --out        ./news.json     (use "-" for stdout)
+ *   --max-per    5               (max tweets per account)
+ *   --timeout    180             (per-account grok timeout, seconds)
+ *
+ * Exit codes:
+ *   0  success — at least 1 tweet aggregated
+ *   1  all accounts failed / total tweet count = 0 (output is empty tweets array)
+ *   2  bad CLI arguments
+ *   3  grok binary missing / not logged in
+ *
+ * Strictly:
+ *   - parses STDOUT only (stderr has PermissionDenied + serde noise, ignored)
+ *   - never fabricates data — failed accounts produce zero entries silently
+ *   - writes the canonical schema even on full failure (empty tweets array)
+ */
+
+const { spawnSync, execSync } = require("node:child_process");
+const fs = require("node:fs");
+const path = require("node:path");
+
+const DEFAULT_ACCOUNTS = [
+  "@OpenAI",
+  "@AnthropicAI",
+  "@GoogleDeepMind",
+  "@xai",
+  "@sama",
+  "@karpathy",
+  "@Alibaba_Qwen",
+  "@deepseek_ai",
+  "@Kimi_Moonshot",
+  "@nvidia",
+];
+
+const DEFAULTS = {
+  since: "24h",
+  startId: 1,
+  out: "./news.json",
+  maxPer: 5,
+  timeout: 180,
+};
+
+function parseArgs(argv) {
+  const out = {
+    accounts: null,
+    since: DEFAULTS.since,
+    startId: DEFAULTS.startId,
+    out: DEFAULTS.out,
+    maxPer: DEFAULTS.maxPer,
+    timeout: DEFAULTS.timeout,
+    verbose: false,
+    help: false,
+  };
+  for (let i = 2; i < argv.length; i++) {
+    const a = argv[i];
+    const next = () => argv[++i];
+    switch (a) {
+      case "--accounts":
+        out.accounts = next()
+          .split(",")
+          .map((s) => s.trim())
+          .filter(Boolean);
+        break;
+      case "--since":
+        out.since = next();
+        break;
+      case "--start-id":
+        out.startId = Number.parseInt(next(), 10);
+        break;
+      case "--out":
+        out.out = next();
+        break;
+      case "--max-per":
+        out.maxPer = Number.parseInt(next(), 10);
+        break;
+      case "--timeout":
+        out.timeout = Number.parseInt(next(), 10);
+        break;
+      case "-v":
+      case "--verbose":
+        out.verbose = true;
+        break;
+      case "-h":
+      case "--help":
+        out.help = true;
+        break;
+      default:
+        process.stderr.write(`unknown arg: ${a}\n`);
+        process.exit(2);
+    }
+  }
+  if (!out.accounts) out.accounts = DEFAULT_ACCOUNTS;
+  if (!Number.isFinite(out.startId) || out.startId < 0) out.startId = 1;
+  if (!Number.isFinite(out.maxPer) || out.maxPer < 1) out.maxPer = DEFAULTS.maxPer;
+  if (!Number.isFinite(out.timeout) || out.timeout < 30) out.timeout = DEFAULTS.timeout;
+  return out;
+}
+
+function printHelp() {
+  process.stdout.write(`fetch_news_via_grok.js — headless X fetcher (grok-build)
+
+Usage:
+  fetch_news_via_grok.js [options]
+
+Options:
+  --accounts <list>   Comma-separated @handles  (default: 10 anchor accounts)
+  --since <window>    Time window. Either "<N>h" (relative hours) or
+                      "YYYY-MM-DD" (absolute date). Default: 24h
+  --start-id <n>      Sets the "nextId" field in the output. Default: 1
+  --out <path>        Output file path. Use "-" for stdout. Default: ./news.json
+  --max-per <n>       Max tweets per account. Default: 5
+  --timeout <secs>    Per-account grok call timeout. Default: 180
+  -v, --verbose       Log per-account progress to stderr
+  -h, --help          Show this help
+
+Exit codes:
+  0  >=1 tweet aggregated
+  1  zero tweets after all accounts
+  2  bad CLI args
+  3  grok binary missing / not logged in
+`);
+}
+
+function preflight() {
+  try {
+    const v = execSync("grok --version", {
+      stdio: ["ignore", "pipe", "ignore"],
+      timeout: 5000,
+    })
+      .toString()
+      .trim();
+    if (!v.match(/^grok 0\.2\./)) {
+      process.stderr.write(`[warn] grok version ${v} is outside the tested 0.2.x band\n`);
+    }
+  } catch {
+    process.stderr.write(`[err] grok binary not on PATH — install grok-build CLI first\n`);
+    process.exit(3);
+  }
+  // Best-effort: warn if no auth.json — grok will still err with a clear message.
+  const home = process.env.HOME || process.env.USERPROFILE || "";
+  if (home && !fs.existsSync(path.join(home, ".grok", "auth.json"))) {
+    process.stderr.write(`[warn] ~/.grok/auth.json missing — run "grok login" before this script\n`);
+  }
+}
+
+function buildPrompt(handle, since, maxPer) {
+  const cleanHandle = handle.replace(/^@/, "");
+  let windowDesc;
+  if (/^\d+h$/i.test(since)) {
+    const hours = Number.parseInt(since, 10);
+    windowDesc = `最近 ${hours} 小时`;
+  } else if (/^\d{4}-\d{2}-\d{2}$/.test(since)) {
+    windowDesc = `自 ${since} 起 (since:${since})`;
+  } else {
+    windowDesc = since;
+  }
+  return [
+    `找一下 X (Twitter) 上 @${cleanHandle} ${windowDesc}内的帖子, 最多 ${maxPer} 条。`,
+    "",
+    "**步骤**:",
+    `1. 先用 web_search (allowed_domains=["x.com"]) 找 @${cleanHandle} 的最新原创帖子 (不含 retweet)`,
+    "2. 必要时用 web_fetch 拿正文",
+    "3. 最后只输出一个**严格 JSON 数组**, 不要 markdown, 不要解释, 不要 code fence",
+    "",
+    "数组每条字段:",
+    `  user      ${cleanHandle}    (handle 不带 @)`,
+    `  name      显示名 (例: Sam Altman)`,
+    `  text      原文或忠实的中文摘要 (~80 字内)`,
+    `  url       完整的 https://x.com/${cleanHandle}/status/<id> URL (必须是真链接)`,
+    `  date      YYYY-MM-DD (拿不到精确日子用大致月份: YYYY-MM)`,
+    "",
+    "硬要求:",
+    `  · url 字段必须以 "https://x.com/${cleanHandle}/status/" 开头,后接纯数字 status id`,
+    "  · 如果搜索不到任何符合时间窗的帖子,返回空数组 []",
+    "  · 绝对禁止编造 URL 或 status id",
+    "  · 绝对禁止返回 markdown / 解释文字 / code fence",
+    "",
+    "示例 (替换为真实数据):",
+    `[{"user":"${cleanHandle}","name":"...","text":"...","url":"https://x.com/${cleanHandle}/status/123","date":"2026-06-01"}]`,
+  ].join("\n");
+}
+
+function callGrok(prompt, timeoutSecs) {
+  const result = spawnSync(
+    "grok",
+    ["-p", prompt, "--output-format", "json", "--always-approve"],
+    {
+      stdio: ["ignore", "pipe", "pipe"],
+      timeout: timeoutSecs * 1000,
+      maxBuffer: 16 * 1024 * 1024,
+      encoding: "utf8",
+    },
+  );
+  if (result.error) {
+    return { ok: false, reason: `spawn: ${result.error.message}` };
+  }
+  if (result.signal === "SIGTERM" || result.signal === "SIGKILL") {
+    return { ok: false, reason: `timeout after ${timeoutSecs}s` };
+  }
+  if (result.status !== 0) {
+    return { ok: false, reason: `grok exit ${result.status}` };
+  }
+  return { ok: true, stdout: result.stdout };
+}
+
+function parseEnvelope(stdout) {
+  let envelope;
+  try {
+    envelope = JSON.parse(stdout);
+  } catch (e) {
+    return { ok: false, reason: `envelope JSON.parse: ${e.message}` };
+  }
+  if (envelope == null || typeof envelope !== "object") {
+    return { ok: false, reason: "envelope not an object" };
+  }
+  if (envelope.stopReason && envelope.stopReason !== "EndTurn") {
+    return { ok: false, reason: `non-EndTurn stopReason: ${envelope.stopReason}` };
+  }
+  if (typeof envelope.text !== "string") {
+    return { ok: false, reason: "envelope.text missing or non-string" };
+  }
+  let inner = envelope.text.trim();
+  if (inner.startsWith("```")) {
+    inner = inner.replace(/^```(?:json)?\s*/i, "").replace(/```\s*$/i, "").trim();
+  }
+  const firstBracket = inner.indexOf("[");
+  const lastBracket = inner.lastIndexOf("]");
+  if (firstBracket === -1 || lastBracket === -1 || lastBracket < firstBracket) {
+    return { ok: false, reason: "no JSON array bracket in envelope.text" };
+  }
+  inner = inner.slice(firstBracket, lastBracket + 1);
+  let raw;
+  try {
+    raw = JSON.parse(inner);
+  } catch (e) {
+    return { ok: false, reason: `inner JSON.parse: ${e.message}` };
+  }
+  if (!Array.isArray(raw)) {
+    return { ok: false, reason: "inner payload is not an array" };
+  }
+  return { ok: true, raw };
+}
+
+const URL_RE = /^https:\/\/x\.com\/([^/]+)\/status\/(\d+)/;
+
+function normalize(raw, handle) {
+  const cleanHandle = handle.replace(/^@/, "");
+  const out = [];
+  for (const t of raw) {
+    if (!t || typeof t !== "object") continue;
+    const url = typeof t.url === "string" ? t.url.trim() : "";
+    const m = url.match(URL_RE);
+    if (!m) continue;
+    const user = m[1];
+    const tweetId = m[2];
+    if (user.toLowerCase() !== cleanHandle.toLowerCase()) continue;
+    const text = typeof t.text === "string" ? t.text.trim() : "";
+    if (!text) continue;
+    const dateRaw = typeof t.date === "string" ? t.date.trim() : "";
+    let date = "";
+    if (/^\d{4}-\d{2}-\d{2}$/.test(dateRaw)) date = dateRaw;
+    else if (/^\d{4}-\d{2}$/.test(dateRaw)) date = `${dateRaw}-01`;
+    else date = new Date().toISOString().slice(0, 10);
+    const name = typeof t.name === "string" && t.name.trim() ? t.name.trim() : cleanHandle;
+    out.push({
+      account: `@${user}`,
+      name,
+      user,
+      tweetId,
+      text,
+      date,
+      likes: 0,
+      views: 0,
+      url,
+      source: `@${user}`,
+      category: "行业",
+      imageUrl: "",
+    });
+  }
+  return out;
+}
+
+function dedupe(tweets) {
+  const seen = new Set();
+  const out = [];
+  for (const t of tweets) {
+    const key = t.tweetId || t.url;
+    if (seen.has(key)) continue;
+    seen.add(key);
+    out.push(t);
+  }
+  return out;
+}
+
+async function main() {
+  const opts = parseArgs(process.argv);
+  if (opts.help) {
+    printHelp();
+    process.exit(0);
+  }
+  preflight();
+
+  const log = (msg) => {
+    if (opts.verbose) process.stderr.write(`[fetch-grok] ${msg}\n`);
+  };
+  log(`accounts=${opts.accounts.length} since=${opts.since} max-per=${opts.maxPer}`);
+
+  const failures = [];
+  const allTweets = [];
+
+  for (const handle of opts.accounts) {
+    log(`→ ${handle}`);
+    const prompt = buildPrompt(handle, opts.since, opts.maxPer);
+    const call = callGrok(prompt, opts.timeout);
+    if (!call.ok) {
+      failures.push({ handle, reason: call.reason });
+      log(`  ✗ ${call.reason}`);
+      continue;
+    }
+    const parsed = parseEnvelope(call.stdout);
+    if (!parsed.ok) {
+      failures.push({ handle, reason: parsed.reason });
+      log(`  ✗ ${parsed.reason}`);
+      continue;
+    }
+    const tweets = normalize(parsed.raw, handle);
+    log(`  ✓ ${tweets.length} tweet(s)`);
+    allTweets.push(...tweets);
+  }
+
+  const tweets = dedupe(allTweets);
+  const payload = { nextId: opts.startId + tweets.length, tweets };
+  const json = JSON.stringify(payload, null, 2);
+
+  if (opts.out === "-") {
+    process.stdout.write(json + "\n");
+  } else {
+    fs.writeFileSync(opts.out, json + "\n", "utf8");
+    log(`wrote ${tweets.length} tweet(s) to ${opts.out}`);
+  }
+
+  if (failures.length > 0) {
+    process.stderr.write(
+      `[fetch-grok] ${failures.length}/${opts.accounts.length} accounts failed:\n`,
+    );
+    for (const f of failures) {
+      process.stderr.write(`  - ${f.handle}: ${f.reason}\n`);
+    }
+  }
+
+  if (tweets.length === 0) {
+    process.stderr.write(
+      `[fetch-grok] zero tweets aggregated — exit 1 (downstream should skip write)\n`,
+    );
+    process.exit(1);
+  }
+  process.exit(0);
+}
+
+main().catch((e) => {
+  process.stderr.write(`[fetch-grok] fatal: ${e.stack || e.message}\n`);
+  process.exit(1);
+});

From 224969cade35334494ea7bdeda820f149a381a35 Mon Sep 17 00:00:00 2001
From: vansin <smartflowaiteam@gmail.com>
Date: Sun, 7 Jun 2026 07:18:37 +0800
Subject: [PATCH 2/2] =?UTF-8?q?feat(demos/grok-news-fetcher):=20v2=20recen?=
 =?UTF-8?q?cy=20+=20media=5Furl=20=E6=A0=A1=E9=AA=8C=E7=BA=A2=E7=BA=BF?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

通信龙 HIGH 验收红线 + 凑齐 spec ack 后追加 3 块:

1. **prompt recency 强制**: buildPrompt 注入今天日期 + cutoffISO +
   "you MUST return real publish DATE; do NOT include older posts;
   do NOT pad with today's date if unsure — drop instead" 4 点
   英文硬规则 (LLM 对英文规则比中文敏感) + 中文一句话兜底。
   要求 LLM 给 media_url (pbs.twimg.com / video.twimg.com 直链)。

2. **verifyMediaUrl (新增)**: curl -I -L -m 8s HEAD,
   host 必须是 pbs.twimg.com/video.twimg.com/ton.twitter.com,
   HTTP 200 + content-type image/* 或 video/* 才入库,
   否则 imageUrl 置 "" (保 tweet 丢图)。防 LLM 幻觉假图链 —
   A 站 全替代 vs 兜底 pilot 的关键防线。

3. **normalize date 窗口双保险**: cutoffISO <= date <= todayISO
   才保留。**date 缺失/非 YYYY-MM-DD 整条 drop** (不再 fallback 今天 —
   通信龙 v1 read 时抓到的 anti-pattern,会把老帖伪装成新帖)。
   drops 计数器: badUrl/wrongUser/emptyText/badDate/outOfWindow/
   mediaVerifyFail。

4. **新鲜度自查日志**: summarize() 算 freshest/oldest/ageDays/
   日期分布 byDate/media-verified ratio; verbose 模式打 per-tweet
   curl HTTP code + content-type; freshest > 2d 老 stderr WARN
   (grok web 索引滞后 surface 出来,A 站要 last-24h 时能早发现)。

v2 实测 (2026-06-07, 3 账号 since=72h max-per=3):
- 6 条 tweet, 全在 [2026-06-03, 2026-06-06] 窗口
- freshest=2026-06-05 (1d ago), 满足 A 站 last-24h target
- date 分布: 2026-06-04 × 5, 2026-06-05 × 1
- media_url: LLM 返 1 条 video.twimg.com/amplify_video/.../mp4,
  curl HEAD 验证 content-type=video/mp4 → 入库; 其余 5 条 LLM
  诚实地不返 media_url (符合预期, 不编造)
- exit 0 / 串行 ~3m30s
- 失败路径 (nonexistent account): tweets=[] + exit 1

README 同步更新: 字段规则 v2 收紧 + 实测 + 6 个新 gotcha
(WARN 新鲜度 / imageUrl 多空 / curl 校验失败 drop / date 缺失
drop / outOfWindow drop / 慢)。

Author-Agent: 通信demo马
Dispatched-By: 通信龙 (commhub task 4c5085d3 HIGH + 48bc3d35/62ea048b
spec 凑齐 + 8d6204b6 v1 review ping)
---
 demos/grok-news-fetcher/README.md             |  49 +++--
 .../grok-news-fetcher/fetch_news_via_grok.js  | 196 ++++++++++++++----
 2 files changed, 189 insertions(+), 56 deletions(-)

diff --git a/demos/grok-news-fetcher/README.md b/demos/grok-news-fetcher/README.md
index 3dd62ea4..54f2e7f5 100644
--- a/demos/grok-news-fetcher/README.md
+++ b/demos/grok-news-fetcher/README.md
@@ -86,14 +86,19 @@ grok login           # 一次性 OAuth
 }
 ```
 
-字段规则:
+字段规则 (v2 已收紧):
 - `user` = handle 不带 `@`
 - `account` / `source` = `@<user>` (两个字段同值，保留是兼容 A站老 schema)
 - `url` = 必须是真实 `https://x.com/<user>/status/<digits>` 链接，**否则那条 tweet 被脚本静默丢弃**
 - `tweetId` = 从 url 正则抠 (`/status/(\d+)`)
-- `date` = `YYYY-MM-DD`，LLM 返 `YYYY-MM` 自动补 `-01`，全空 fallback 今天
-- `likes` / `views` = `0` (grok native 拿不到 engagement metadata，X 平台限制)
-- `imageUrl` = `""` (grok 也拿不到媒体附件，留空)
+- `date` = `YYYY-MM-DD`，**LLM 返不出真实发布日期 (非 `YYYY-MM-DD` 格式 / 日期缺失) → 那条 tweet 整条 drop**。不再 fallback 今天 — 避免老帖被伪装成新帖混进窗口
+- `date` 还做**窗口双保险校验**：`cutoff <= date <= today` 才保留，超窗 silent drop
+- `imageUrl` = LLM 返的 `media_url` 字段，**必须通过 curl HEAD 校验**：
+  - host 必须是 `pbs.twimg.com` / `video.twimg.com` / `ton.twitter.com` 之一
+  - HTTP 200 + content-type `image/*` 或 `video/*`
+  - 校验失败 → `imageUrl` 置 `""`（tweet 本身保留，只是没图）
+  - **防 LLM 幻觉假图链** — A 站全替代 pilot 的关键防线
+- `likes` / `views` = `0` (grok native 拿不到 engagement metadata，X 平台限制，诚实不编造)
 - `category` = `"行业"` 固定值
 
 ## 失败语义 (A站硬要求)
@@ -107,27 +112,36 @@ grok login           # 一次性 OAuth
 
 **绝对不塞假数据**。下游翻译看到空 tweets 应当跳过写库，按 ai-insight 现有约定。
 
-## 实测 (2026-06-07, 通信demo马)
+## 实测 (2026-06-07, 通信demo马, v2)
 
 > 命令: `./fetch_news_via_grok.js --accounts @sama,@OpenAI,@karpathy --since 72h --max-per 3 -v`
 
 ```
-[fetch-grok] accounts=3 since=72h max-per=3
+[fetch-grok] accounts=3 since=72h max-per=3 window=[2026-06-03, 2026-06-06]
 [fetch-grok] → @sama
-[fetch-grok]   ✓ 3 tweet(s)
+[fetch-grok]   ✓ 3 tweet(s) kept
 [fetch-grok] → @OpenAI
-[fetch-grok]   ✓ 3 tweet(s)
+[fetch-grok]   ✓ 3 tweet(s) kept
 [fetch-grok] → @karpathy
-[fetch-grok]   ✓ 0 tweet(s)
+[fetch-grok]   ✓ 0 tweet(s) kept
 [fetch-grok] wrote 6 tweet(s) to news.json
-
-real    3m26s
+[fetch-grok] summary: freshest=2026-06-05 (1d ago) | oldest=2026-06-04 | window=[2026-06-03, 2026-06-06] | media-verified=1/6
+[fetch-grok] date distribution:
+  2026-06-04: 5
+  2026-06-05: 1
+[fetch-grok] media verify: 1/1 passed (HTTP 200 + image/video content-type)
+  [@OpenAI 2062630454537424930] ✓ video/mp4
+    https://video.twimg.com/amplify_video/2062605181427339264/vid/avc1/480x270/0mV_E3qR35fCtFaE.mp4
+
+real    3m30s
 exit    0
 ```
 
-- 6/6 URL 实测真链接 (例如 `https://x.com/sama/status/2062661191969972645`, `https://x.com/OpenAI/status/2062927046448431587`)
+- 6/6 URL 实测真链接（例如 `https://x.com/sama/status/2062661191969972645`, `https://x.com/OpenAI/status/2062927046448431587`）
+- 6/6 date 都在窗口 `[2026-06-03, 2026-06-06]` 内（freshest=2026-06-05，1d ago）
 - schema 12 字段一字不差
-- 串行 (避免 `auth.json` / rate-limit 冲突)：3 账号 ~3.5min，10 账号大约 12-15min
+- LLM 返了 1 条 `video.twimg.com` 真链接，curl HEAD 验证 content-type `video/mp4` → 入库；其余 5 条 LLM 没返 `media_url`（grok 诚实地不编，行为符合预期）
+- 串行（避免 `auth.json` / rate-limit 冲突）：3 账号 ~3.5min，10 账号大约 12-15min
 
 > 命令: `./fetch_news_via_grok.js --accounts @ThisAccountDefinitelyDoesNotExistFooBar2026 --since 1h --max-per 2`
 
@@ -157,11 +171,14 @@ news.json: {"nextId": 1, "tweets": []}
 | 现象 | 原因 | 应对 |
 |------|------|------|
 | stderr 里有 `failed to watch root recursively` / `Error reading from stream: serde error` | grok-build CLI 给 `/tmp` 文件监视和流解析的告警噪音，跟实际任务无关 | 脚本只 parse stdout，stderr 全丢，正确行为 |
-| 某账号 `0 tweet(s)` 但实际有发 | LLM 用 `web_search` 索引可能滞后几小时；或 X 反爬丢 `web_fetch` | 增大 `--since` 时间窗；或重跑 |
+| 某账号 `0 tweet(s)` 但实际有发 | LLM `web_search` 索引可能滞后；或 X 反爬丢 `web_fetch`；或 LLM 拿不到真实日期被双保险丢 | 增大 `--since` 时间窗；或重跑 |
+| `WARN: freshest tweet is N days old — grok web index may be lagging` | grok 索引滞后于实时 X，A 站要 last-24h 时这条提示让你早发现 | 增大 `--since`；或调度更高频；或这个时段 grok 兜底确实拿不到当天新帖，跳过翻译入库 |
 | 某账号 grok 异常退出 (timeout / non-zero exit) | LLM 卡顿 / endpoint 5xx | 脚本静默跳过该账号，继续下一个，最后 stderr 列出失败列表 |
 | `likes` / `views` 永远 `0` | X 不向匿名 / 非 logged-in 客户端开放 engagement metadata，grok web_search 也拿不到，正确诚实地不编造 | A 站下游若需要排序请用其他信号 (如发布时间) |
-| `imageUrl` 永远 `""` | 同上，grok web_search 不会自动拉媒体附件 URL | 需要图片请用 A 站现有的 twitterapi.io 路径，grok 兜底不覆盖 |
-| LLM 返回的 `date` 字段不是 `YYYY-MM-DD` | LLM 输出非确定性 | 脚本对 `YYYY-MM` 补 `-01`，全无效 fallback 今天，永远是合法日期串 |
+| `imageUrl` 经常为 `""` | LLM 诚实地不返 `media_url`（没图 / 拿不准 host），或 `media_url` curl 校验失败被 drop | 这是**好行为**（防幻觉假图链）；如果某账号长期 0 imageUrl，可调 prompt 增强但 LLM 行为非确定，A 站可降权或回退 twitterapi.io |
+| 某条 LLM 返了 `media_url` 但 curl 校验失败 | 假链（LLM 幻觉）/ X CDN 5xx / content-type 不是 image/video | 静默丢图保 tweet（imageUrl 置 ""）；verbose 模式打 `✗ http XXX` 或 `✗ non-media content-type` |
+| 整条 tweet 因 date 缺失被 drop | LLM 没返 `date` 字段或非 `YYYY-MM-DD` 格式 — 我们拒绝把老帖伪装成新的 | 提升 prompt 严格度；或承认那条无法验证日期、跳过 |
+| 整条 tweet 因超窗被 drop | LLM 不守 cutoff 提示，给了老帖；脚本本地双保险拦截 | drops 计数 + verbose 日志会标出 `outOfWindow=N`；调短 `--since` 让 LLM 更准 |
 | 跑得慢 (10 账号 ~15min) | grok 单轮 LLM 调用 30-90s × 串行 10 次 | cron 调度别小于 30min；并发可加但要先解决 `auth.json` 锁 (本脚本未做) |
 
 ## 与 A站 现有 `auto_update_news.js` 的关系
diff --git a/demos/grok-news-fetcher/fetch_news_via_grok.js b/demos/grok-news-fetcher/fetch_news_via_grok.js
index c7b7e1ed..b2099375 100755
--- a/demos/grok-news-fetcher/fetch_news_via_grok.js
+++ b/demos/grok-news-fetcher/fetch_news_via_grok.js
@@ -165,40 +165,55 @@ function preflight() {
   }
 }
 
-function buildPrompt(handle, since, maxPer) {
-  const cleanHandle = handle.replace(/^@/, "");
-  let windowDesc;
+function computeCutoff(since) {
+  const now = new Date();
   if (/^\d+h$/i.test(since)) {
     const hours = Number.parseInt(since, 10);
-    windowDesc = `最近 ${hours} 小时`;
-  } else if (/^\d{4}-\d{2}-\d{2}$/.test(since)) {
-    windowDesc = `自 ${since} 起 (since:${since})`;
-  } else {
-    windowDesc = since;
+    const cutoff = new Date(now.getTime() - hours * 3600 * 1000);
+    return { cutoffISO: cutoff.toISOString().slice(0, 10), hours };
+  }
+  if (/^\d{4}-\d{2}-\d{2}$/.test(since)) {
+    return { cutoffISO: since, hours: null };
   }
+  return { cutoffISO: now.toISOString().slice(0, 10), hours: 24 };
+}
+
+function buildPrompt(handle, since, maxPer) {
+  const cleanHandle = handle.replace(/^@/, "");
+  const today = new Date().toISOString().slice(0, 10);
+  const { cutoffISO, hours } = computeCutoff(since);
+  const windowDescZh = hours != null ? `最近 ${hours} 小时 (自 ${cutoffISO} 起)` : `自 ${cutoffISO} 起`;
+  const windowDescEn = hours != null ? `the last ${hours} hours (since ${cutoffISO})` : `since ${cutoffISO}`;
   return [
-    `找一下 X (Twitter) 上 @${cleanHandle} ${windowDesc}内的帖子, 最多 ${maxPer} 条。`,
+    `Today is ${today}. Find recent original posts on X (Twitter) from @${cleanHandle} within ${windowDescEn}, at most ${maxPer} posts.`,
+    `(中文: 今天是 ${today}, 找 @${cleanHandle} ${windowDescZh}内的原创帖, 最多 ${maxPer} 条, 不含 retweet)`,
     "",
-    "**步骤**:",
-    `1. 先用 web_search (allowed_domains=["x.com"]) 找 @${cleanHandle} 的最新原创帖子 (不含 retweet)`,
-    "2. 必要时用 web_fetch 拿正文",
-    "3. 最后只输出一个**严格 JSON 数组**, 不要 markdown, 不要解释, 不要 code fence",
+    "**Steps**:",
+    `1. Use web_search (allowed_domains=["x.com"]) to find @${cleanHandle}'s most recent original posts`,
+    "2. Use web_fetch if needed to confirm post body and publish date",
+    "3. Output ONLY a strict JSON array — no markdown, no explanation, no code fence",
     "",
-    "数组每条字段:",
-    `  user      ${cleanHandle}    (handle 不带 @)`,
-    `  name      显示名 (例: Sam Altman)`,
-    `  text      原文或忠实的中文摘要 (~80 字内)`,
-    `  url       完整的 https://x.com/${cleanHandle}/status/<id> URL (必须是真链接)`,
-    `  date      YYYY-MM-DD (拿不到精确日子用大致月份: YYYY-MM)`,
+    "Each array item fields:",
+    `  user        "${cleanHandle}"          (handle without @)`,
+    `  name        display name              (e.g. "Sam Altman")`,
+    `  text        post body or faithful summary in Chinese (~80 chars)`,
+    `  url         https://x.com/${cleanHandle}/status/<id>  (real, browser-openable)`,
+    `  date        YYYY-MM-DD  (real publish date — MUST be accurate to the day)`,
+    `  media_url   https://pbs.twimg.com/... or https://video.twimg.com/...  (real direct media link; "" if no image/video)`,
     "",
-    "硬要求:",
-    `  · url 字段必须以 "https://x.com/${cleanHandle}/status/" 开头,后接纯数字 status id`,
-    "  · 如果搜索不到任何符合时间窗的帖子,返回空数组 []",
-    "  · 绝对禁止编造 URL 或 status id",
-    "  · 绝对禁止返回 markdown / 解释文字 / code fence",
+    "**HARD RECENCY REQUIREMENTS** (post will be dropped otherwise):",
+    `  · date MUST be >= ${cutoffISO} (the cutoff). Do NOT include older posts. Do NOT pad date with today's date if you don't know the real one — drop that post instead.`,
+    "  · If unsure of the real publish date, drop the post — do NOT fabricate a recent date to pass the filter.",
+    `  · If no posts in [${cutoffISO}, ${today}] match, return empty array [].`,
     "",
-    "示例 (替换为真实数据):",
-    `[{"user":"${cleanHandle}","name":"...","text":"...","url":"https://x.com/${cleanHandle}/status/123","date":"2026-06-01"}]`,
+    "**OTHER HARD RULES**:",
+    `  · url MUST start with "https://x.com/${cleanHandle}/status/" followed by digits-only status id`,
+    `  · media_url, if non-empty, MUST be a real direct image/video link (pbs.twimg.com, video.twimg.com, or x.com/.../photo/...). Do NOT fabricate.`,
+    "  · NEVER fabricate URL / status id / media_url / date.",
+    "  · Output is JSON ONLY. No markdown, no explanation, no code fence.",
+    "",
+    "Example (replace with real data):",
+    `[{"user":"${cleanHandle}","name":"...","text":"...","url":"https://x.com/${cleanHandle}/status/123","date":"${today}","media_url":"https://pbs.twimg.com/media/abc.jpg"}]`,
   ].join("\n");
 }
 
@@ -264,42 +279,87 @@ function parseEnvelope(stdout) {
 }
 
 const URL_RE = /^https:\/\/x\.com\/([^/]+)\/status\/(\d+)/;
+const MEDIA_HOST_RE = /^https:\/\/(pbs\.twimg\.com|video\.twimg\.com|ton\.twitter\.com)\//;
+
+function verifyMediaUrl(mediaUrl, timeoutSecs = 8) {
+  if (!mediaUrl) return { ok: false, reason: "empty" };
+  if (!MEDIA_HOST_RE.test(mediaUrl)) {
+    return { ok: false, reason: `non-twitter media host: ${mediaUrl.slice(0, 60)}` };
+  }
+  const result = spawnSync(
+    "curl",
+    [
+      "-I", "-L", "--max-redirs", "3",
+      "-m", String(timeoutSecs),
+      "-s", "-o", "/dev/null",
+      "-w", "%{http_code}\t%{content_type}",
+      "-A", "Mozilla/5.0 (compatible; ai-insight-fetcher/1.0)",
+      mediaUrl,
+    ],
+    { encoding: "utf8", timeout: (timeoutSecs + 2) * 1000 },
+  );
+  if (result.error || result.status !== 0) {
+    return { ok: false, reason: `curl err: ${result.error?.message || `exit ${result.status}`}` };
+  }
+  const [code, ctype = ""] = (result.stdout || "").split("\t");
+  if (code !== "200") return { ok: false, reason: `http ${code}` };
+  if (!ctype.startsWith("image/") && !ctype.startsWith("video/")) {
+    return { ok: false, reason: `non-media content-type: ${ctype}` };
+  }
+  return { ok: true, contentType: ctype };
+}
 
-function normalize(raw, handle) {
+function normalize(raw, handle, cutoffISO, todayISO, mediaLog) {
   const cleanHandle = handle.replace(/^@/, "");
   const out = [];
+  const drops = { badUrl: 0, wrongUser: 0, emptyText: 0, badDate: 0, outOfWindow: 0, mediaVerifyFail: 0 };
   for (const t of raw) {
     if (!t || typeof t !== "object") continue;
     const url = typeof t.url === "string" ? t.url.trim() : "";
     const m = url.match(URL_RE);
-    if (!m) continue;
+    if (!m) { drops.badUrl++; continue; }
     const user = m[1];
     const tweetId = m[2];
-    if (user.toLowerCase() !== cleanHandle.toLowerCase()) continue;
+    if (user.toLowerCase() !== cleanHandle.toLowerCase()) { drops.wrongUser++; continue; }
     const text = typeof t.text === "string" ? t.text.trim() : "";
-    if (!text) continue;
+    if (!text) { drops.emptyText++; continue; }
     const dateRaw = typeof t.date === "string" ? t.date.trim() : "";
-    let date = "";
-    if (/^\d{4}-\d{2}-\d{2}$/.test(dateRaw)) date = dateRaw;
-    else if (/^\d{4}-\d{2}$/.test(dateRaw)) date = `${dateRaw}-01`;
-    else date = new Date().toISOString().slice(0, 10);
+    if (!/^\d{4}-\d{2}-\d{2}$/.test(dateRaw)) { drops.badDate++; continue; }
+    if (dateRaw < cutoffISO || dateRaw > todayISO) { drops.outOfWindow++; continue; }
     const name = typeof t.name === "string" && t.name.trim() ? t.name.trim() : cleanHandle;
+    let imageUrl = "";
+    const mediaRaw = typeof t.media_url === "string" ? t.media_url.trim() : "";
+    if (mediaRaw) {
+      const v = verifyMediaUrl(mediaRaw);
+      if (mediaLog) {
+        mediaLog.push({
+          tweetId,
+          handle: cleanHandle,
+          url: mediaRaw,
+          ok: v.ok,
+          contentType: v.contentType || "",
+          reason: v.reason || "",
+        });
+      }
+      if (v.ok) imageUrl = mediaRaw;
+      else drops.mediaVerifyFail++;
+    }
     out.push({
       account: `@${user}`,
       name,
       user,
       tweetId,
       text,
-      date,
+      date: dateRaw,
       likes: 0,
       views: 0,
       url,
       source: `@${user}`,
       category: "行业",
-      imageUrl: "",
+      imageUrl,
     });
   }
-  return out;
+  return { tweets: out, drops };
 }
 
 function dedupe(tweets) {
@@ -314,6 +374,24 @@ function dedupe(tweets) {
   return out;
 }
 
+function summarize(tweets, cutoffISO, todayISO) {
+  if (tweets.length === 0) return { line: "no tweets", byDate: {}, freshestAgeDays: null };
+  const byDate = {};
+  for (const t of tweets) byDate[t.date] = (byDate[t.date] || 0) + 1;
+  const dates = Object.keys(byDate).sort();
+  const freshest = dates[dates.length - 1];
+  const oldest = dates[0];
+  const withMedia = tweets.filter((t) => t.imageUrl).length;
+  const today = new Date(todayISO + "T00:00:00Z");
+  const f = new Date(freshest + "T00:00:00Z");
+  const ageDays = Math.round((today.getTime() - f.getTime()) / 86400000);
+  return {
+    line: `freshest=${freshest} (${ageDays}d ago) | oldest=${oldest} | window=[${cutoffISO}, ${todayISO}] | media-verified=${withMedia}/${tweets.length}`,
+    byDate,
+    freshestAgeDays: ageDays,
+  };
+}
+
 async function main() {
   const opts = parseArgs(process.argv);
   if (opts.help) {
@@ -322,13 +400,18 @@ async function main() {
   }
   preflight();
 
+  const todayISO = new Date().toISOString().slice(0, 10);
+  const { cutoffISO } = computeCutoff(opts.since);
+
   const log = (msg) => {
     if (opts.verbose) process.stderr.write(`[fetch-grok] ${msg}\n`);
   };
-  log(`accounts=${opts.accounts.length} since=${opts.since} max-per=${opts.maxPer}`);
+  log(`accounts=${opts.accounts.length} since=${opts.since} max-per=${opts.maxPer} window=[${cutoffISO}, ${todayISO}]`);
 
   const failures = [];
   const allTweets = [];
+  const mediaLog = [];
+  const totalDrops = { badUrl: 0, wrongUser: 0, emptyText: 0, badDate: 0, outOfWindow: 0, mediaVerifyFail: 0 };
 
   for (const handle of opts.accounts) {
     log(`→ ${handle}`);
@@ -345,8 +428,10 @@ async function main() {
       log(`  ✗ ${parsed.reason}`);
       continue;
     }
-    const tweets = normalize(parsed.raw, handle);
-    log(`  ✓ ${tweets.length} tweet(s)`);
+    const { tweets, drops } = normalize(parsed.raw, handle, cutoffISO, todayISO, mediaLog);
+    for (const k of Object.keys(totalDrops)) totalDrops[k] += drops[k];
+    const dropTags = Object.entries(drops).filter(([, v]) => v > 0).map(([k, v]) => `${k}=${v}`).join(",");
+    log(`  ✓ ${tweets.length} tweet(s) kept${dropTags ? ` [drops: ${dropTags}]` : ""}`);
     allTweets.push(...tweets);
   }
 
@@ -361,6 +446,31 @@ async function main() {
     log(`wrote ${tweets.length} tweet(s) to ${opts.out}`);
   }
 
+  // freshness self-check — surfaces "grok index lag" early
+  const summary = summarize(tweets, cutoffISO, todayISO);
+  process.stderr.write(`[fetch-grok] summary: ${summary.line}\n`);
+  if (Object.keys(summary.byDate).length > 0) {
+    const dateLines = Object.entries(summary.byDate)
+      .sort()
+      .map(([d, n]) => `  ${d}: ${n}`)
+      .join("\n");
+    process.stderr.write(`[fetch-grok] date distribution:\n${dateLines}\n`);
+  }
+  const droppedTags = Object.entries(totalDrops).filter(([, v]) => v > 0).map(([k, v]) => `${k}=${v}`).join(", ");
+  if (droppedTags) {
+    process.stderr.write(`[fetch-grok] total drops: ${droppedTags}\n`);
+  }
+  if (mediaLog.length > 0) {
+    const kept = mediaLog.filter((m) => m.ok).length;
+    process.stderr.write(`[fetch-grok] media verify: ${kept}/${mediaLog.length} passed (HTTP 200 + image/video content-type)\n`);
+    if (opts.verbose) {
+      for (const m of mediaLog) {
+        const tag = m.ok ? `✓ ${m.contentType}` : `✗ ${m.reason}`;
+        process.stderr.write(`  [@${m.handle} ${m.tweetId}] ${tag}\n    ${m.url}\n`);
+      }
+    }
+  }
+
   if (failures.length > 0) {
     process.stderr.write(
       `[fetch-grok] ${failures.length}/${opts.accounts.length} accounts failed:\n`,
@@ -376,6 +486,12 @@ async function main() {
     );
     process.exit(1);
   }
+  // Freshness warning (non-fatal): A站 needs last-24h ideally, surface if grok index lagged
+  if (summary.freshestAgeDays != null && summary.freshestAgeDays > 2) {
+    process.stderr.write(
+      `[fetch-grok] WARN: freshest tweet is ${summary.freshestAgeDays} days old — grok web index may be lagging behind real-time X\n`,
+    );
+  }
   process.exit(0);
 }