From 1b9780e13a6ba0228f6b0e7260fa572a45ae88de Mon Sep 17 00:00:00 2001 From: vansin Date: Sun, 7 Jun 2026 06:39:30 +0800 Subject: [PATCH 1/2] =?UTF-8?q?feat(demos/grok-news-fetcher):=20A=E7=AB=99?= =?UTF-8?q?=20ai-insight=20=E5=85=9C=E5=BA=95=E6=96=B0=E9=97=BB=E6=8A=93?= =?UTF-8?q?=E5=8F=96=E8=84=9A=E6=9C=AC?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A 站 twitterapi.io 6-03 起 402 quota exceeded, 管线断流。Vincent route grok-build CLI 0.2.x headless 模式当兜底源。 交付: - fetch_news_via_grok.js: 串行调 grok -p ... --output-format json, 从 envelope.text 抠 LLM 返的 JSON 数组, 规范化为 A站 auto_update_news.js --fetch-only 一字不差的 12 字段 schema (account/name/user/tweetId/text/date/likes/views/url/source/ category/imageUrl). 零 npm 依赖, 纯 Node.js stdlib. - README.md: 用法 + cron 接入 + 实测输出 + gotcha + 红线 实测 (2026-06-07): - 3 账号 (@sama/@OpenAI/@karpathy) 串行 3m26s - 6/6 URL 真实可点开 https://x.com//status/ - schema 12 字段对齐, exit 0 - 失败路径 (nonexistent account): 空 tweets + exit 1, stderr 提示 downstream skip write - stderr 噪音 (PermissionDenied/serde error) 只丢弃, 不影响 parse 红线: - 本地/Docker 自测, 不碰 47.77.216.1 A站生产 - 绝不塞假数据 — 失败一律空数组 + exit≠0 - url 必须真链接 (正则校验) 否则那条 tweet 静默丢弃 Author-Agent: 通信demo马 Dispatched-By: 通信龙 (commhub task 58afee64-899b-468b-90e6-3b8fb379258d) Helpers: 通信SDK马 (grok-x-search demo 参考实现) --- demos/grok-news-fetcher/README.md | 199 +++++++++ .../grok-news-fetcher/fetch_news_via_grok.js | 385 ++++++++++++++++++ 2 files changed, 584 insertions(+) create mode 100644 demos/grok-news-fetcher/README.md create mode 100755 demos/grok-news-fetcher/fetch_news_via_grok.js diff --git a/demos/grok-news-fetcher/README.md b/demos/grok-news-fetcher/README.md new file mode 100644 index 00000000..3dd62ea4 --- /dev/null +++ b/demos/grok-news-fetcher/README.md @@ -0,0 +1,199 @@ +# demos/grok-news-fetcher — A站 ai-insight 兜底新闻抓取脚本 + +> **Pitch**: 用 grok-build CLI 0.2.x 的 headless 模式 (`grok -p ... --output-format json`) 当 X (Twitter) 新闻抓取的*兜底源*,给 A站 [ai-insight](https://ai-insight.org) 在 twitterapi.io 402 额度耗尽后顶上。 +> +> 输出 schema 跟 ai-insight 现有 `auto_update_news.js --fetch-only` **一字不差**,可直接接入其下游翻译 / 写库管线。 + +## 背景 + +- A 站 ai-insight 用 `twitterapi.io` 抓 10 个锚定账号的最新推文,翻译入库再发布。 +- 2026-06-03 起 twitterapi.io 返 **HTTP 402 quota exceeded**,管线断流,6-03 起每日抓取 0 条。 +- Vincent 决定用 grok-build (已实证支持 native X 搜索,见 [`demos/grok-x-search/`](../grok-x-search/)) 当**兜底**而非主源 — 频率低、容忍 LLM 输出微抖动。 + +## 交付内容 + +| 文件 | 作用 | +|------|------| +| [`fetch_news_via_grok.js`](./fetch_news_via_grok.js) | 主脚本 (Node.js, 无 npm 依赖) | +| `README.md` | 本文 | + +零 npm 依赖、纯 Node.js stdlib (`child_process`, `fs`, `path`),方便丢任意机器或 cron 上跑。 + +## 前置 + +- Node.js ≥ 18 (用 `node:child_process` 等内建模块) +- `grok` 0.2.29+ on `$PATH`,已 `grok login` (`~/.grok/auth.json` 存在) +- 网络可达 `grok-build` 的 LLM endpoint + +```bash +grok --version # 应当 grok 0.2.x (alpha 也行) +grok login # 一次性 OAuth +``` + +## 一句话跑 + +```bash +# 默认 10 锚定账号 + 最近 24h + 输出 ./news.json +./fetch_news_via_grok.js + +# 指定账号 / 时间窗 / 输出位置 +./fetch_news_via_grok.js \ + --accounts @OpenAI,@AnthropicAI,@sama \ + --since 24h \ + --max-per 5 \ + --out /var/lib/ai-insight/fallback.json + +# 绝对日期窗 + stdout +./fetch_news_via_grok.js --since 2026-06-06 --out - +``` + +## CLI 参数 + +| Flag | 默认 | 说明 | +|------|------|------| +| `--accounts ` | 10 锚定账号 (见下) | 逗号分隔的 `@handle` 列表 | +| `--since ` | `24h` | `Nh` 相对小时 (例 `24h`, `72h`) 或 `YYYY-MM-DD` 绝对日期 | +| `--start-id ` | `1` | 输出 `nextId` 字段起点 | +| `--out ` | `./news.json` | 输出文件路径,`-` = stdout | +| `--max-per ` | `5` | 每账号最多抓几条 | +| `--timeout ` | `180` | 每次 grok 子进程超时 | +| `-v, --verbose` | off | 每账号进度打 stderr | +| `-h, --help` | — | 用法 | + +默认 10 锚定账号: `@OpenAI`, `@AnthropicAI`, `@GoogleDeepMind`, `@xai`, `@sama`, `@karpathy`, `@Alibaba_Qwen`, `@deepseek_ai`, `@Kimi_Moonshot`, `@nvidia`。 + +## 输出 schema (跟 `auto_update_news.js --fetch-only` 对齐) + +```json +{ + "nextId": 7, + "tweets": [ + { + "account": "@sama", + "name": "Sam Altman", + "user": "sama", + "tweetId": "2062661191969972645", + "text": "天啊,互联网的早期真的很特别。", + "date": "2026-06-04", + "likes": 0, + "views": 0, + "url": "https://x.com/sama/status/2062661191969972645", + "source": "@sama", + "category": "行业", + "imageUrl": "" + } + ] +} +``` + +字段规则: +- `user` = handle 不带 `@` +- `account` / `source` = `@` (两个字段同值,保留是兼容 A站老 schema) +- `url` = 必须是真实 `https://x.com//status/` 链接,**否则那条 tweet 被脚本静默丢弃** +- `tweetId` = 从 url 正则抠 (`/status/(\d+)`) +- `date` = `YYYY-MM-DD`,LLM 返 `YYYY-MM` 自动补 `-01`,全空 fallback 今天 +- `likes` / `views` = `0` (grok native 拿不到 engagement metadata,X 平台限制) +- `imageUrl` = `""` (grok 也拿不到媒体附件,留空) +- `category` = `"行业"` 固定值 + +## 失败语义 (A站硬要求) + +| 情况 | 输出 | exit | +|------|------|------| +| 至少 1 条 tweet 聚合成功 | 正常 JSON | `0` | +| 所有账号失败 / 总数为 0 | `{nextId, tweets: []}` 仍写文件 | `1` | +| CLI 参数错误 | stderr 报错 | `2` | +| `grok` 不在 PATH | stderr 报错 | `3` | + +**绝对不塞假数据**。下游翻译看到空 tweets 应当跳过写库,按 ai-insight 现有约定。 + +## 实测 (2026-06-07, 通信demo马) + +> 命令: `./fetch_news_via_grok.js --accounts @sama,@OpenAI,@karpathy --since 72h --max-per 3 -v` + +``` +[fetch-grok] accounts=3 since=72h max-per=3 +[fetch-grok] → @sama +[fetch-grok] ✓ 3 tweet(s) +[fetch-grok] → @OpenAI +[fetch-grok] ✓ 3 tweet(s) +[fetch-grok] → @karpathy +[fetch-grok] ✓ 0 tweet(s) +[fetch-grok] wrote 6 tweet(s) to news.json + +real 3m26s +exit 0 +``` + +- 6/6 URL 实测真链接 (例如 `https://x.com/sama/status/2062661191969972645`, `https://x.com/OpenAI/status/2062927046448431587`) +- schema 12 字段一字不差 +- 串行 (避免 `auth.json` / rate-limit 冲突):3 账号 ~3.5min,10 账号大约 12-15min + +> 命令: `./fetch_news_via_grok.js --accounts @ThisAccountDefinitelyDoesNotExistFooBar2026 --since 1h --max-per 2` + +``` +[fetch-grok] zero tweets aggregated — exit 1 (downstream should skip write) +exit 1 +news.json: {"nextId": 1, "tweets": []} +``` + +## Cron 接入示例 + +```cron +# /etc/cron.d/ai-insight-grok-fallback +# 每 6h 抓一次,输出到 fallback 目录 +0 */6 * * * ai-insight cd /opt/ai-insight && /opt/ai-insight/scripts/fetch_news_via_grok.js \ + --since 6h --out /var/lib/ai-insight/grok-fallback.json \ + >/var/log/ai-insight/grok-fetch.log 2>&1 ; \ + test -s /var/lib/ai-insight/grok-fallback.json && \ + /opt/ai-insight/scripts/translate_and_upsert.js \ + --in /var/lib/ai-insight/grok-fallback.json +``` + +下游应当先用 `jq '.tweets | length'` 或检查 exit code 决定是否走翻译入库。 + +## 已知边界 / gotcha + +| 现象 | 原因 | 应对 | +|------|------|------| +| stderr 里有 `failed to watch root recursively` / `Error reading from stream: serde error` | grok-build CLI 给 `/tmp` 文件监视和流解析的告警噪音,跟实际任务无关 | 脚本只 parse stdout,stderr 全丢,正确行为 | +| 某账号 `0 tweet(s)` 但实际有发 | LLM 用 `web_search` 索引可能滞后几小时;或 X 反爬丢 `web_fetch` | 增大 `--since` 时间窗;或重跑 | +| 某账号 grok 异常退出 (timeout / non-zero exit) | LLM 卡顿 / endpoint 5xx | 脚本静默跳过该账号,继续下一个,最后 stderr 列出失败列表 | +| `likes` / `views` 永远 `0` | X 不向匿名 / 非 logged-in 客户端开放 engagement metadata,grok web_search 也拿不到,正确诚实地不编造 | A 站下游若需要排序请用其他信号 (如发布时间) | +| `imageUrl` 永远 `""` | 同上,grok web_search 不会自动拉媒体附件 URL | 需要图片请用 A 站现有的 twitterapi.io 路径,grok 兜底不覆盖 | +| LLM 返回的 `date` 字段不是 `YYYY-MM-DD` | LLM 输出非确定性 | 脚本对 `YYYY-MM` 补 `-01`,全无效 fallback 今天,永远是合法日期串 | +| 跑得慢 (10 账号 ~15min) | grok 单轮 LLM 调用 30-90s × 串行 10 次 | cron 调度别小于 30min;并发可加但要先解决 `auth.json` 锁 (本脚本未做) | + +## 与 A站 现有 `auto_update_news.js` 的关系 + +``` + ┌─ 主源: twitterapi.io (--fetch-only) + │ +auto_update_news.js ──→ 下游翻译 / 写库 + │ + └─ 兜底: fetch_news_via_grok.js (本脚本) + 输出文件路径相同的 schema, A站把它当 --fetch-only 的结果替代 +``` + +集成方式由 A 站负责人决定: +1. 在 ai-insight 的调度脚本里检测 twitterapi.io 返 402 → fallback 跑本脚本 +2. 或独立 cron 每 6-12h 跑本脚本,下游用 `nextId` 做去重 + +## 红线 (开发自约束) + +- ❌ **不碰 A 站生产服务器 `47.77.216.1`** — 那是 A 站自己装 grok + 部署;本仓库只交付脚本 + 用法 +- ❌ **不内置任何 A 站凭据 / endpoint / 路径** +- ✅ 本地 / Docker 自测,用本仓库 anet 团队的 grok login + +## 关联 + +- 兜底起因: A 站 ai-insight twitterapi.io 402 quota exceeded +- grok native X 搜索能力探测: [`demos/grok-x-search/README.md`](../grok-x-search/README.md) +- E2E 实证: [`docs/tests/p-grok-native-xsearch-e2e/report.md`](../../docs/tests/p-grok-native-xsearch-e2e/report.md) +- 通信demo马 dispatch chain: commhub task `58afee64-899b-468b-90e6-3b8fb379258d` + +## 维护 + +- 主笔: 通信demo马 (claude-code-cli runtime) +- Review: 通信龙 +- 下游对接: A 站负责人 (Vincent 指定) diff --git a/demos/grok-news-fetcher/fetch_news_via_grok.js b/demos/grok-news-fetcher/fetch_news_via_grok.js new file mode 100755 index 00000000..c7b7e1ed --- /dev/null +++ b/demos/grok-news-fetcher/fetch_news_via_grok.js @@ -0,0 +1,385 @@ +#!/usr/bin/env node +/** + * fetch_news_via_grok.js + * + * Headless grok-build X (Twitter) news fetcher. + * + * Drop-in fallback for A站 ai-insight's news pipeline when twitterapi.io quota + * is exhausted. Output schema is byte-identical to + * `auto_update_news.js --fetch-only`. + * + * Usage: + * ./fetch_news_via_grok.js \ + * --accounts @OpenAI,@AnthropicAI,@sama \ + * --since 24h \ + * --start-id 1 \ + * --out news.json + * + * ./fetch_news_via_grok.js --since 2026-06-06 --out - # stdout + * + * Defaults: + * --accounts the 10 anchor accounts (see DEFAULT_ACCOUNTS below) + * --since 24h + * --start-id 1 + * --out ./news.json (use "-" for stdout) + * --max-per 5 (max tweets per account) + * --timeout 180 (per-account grok timeout, seconds) + * + * Exit codes: + * 0 success — at least 1 tweet aggregated + * 1 all accounts failed / total tweet count = 0 (output is empty tweets array) + * 2 bad CLI arguments + * 3 grok binary missing / not logged in + * + * Strictly: + * - parses STDOUT only (stderr has PermissionDenied + serde noise, ignored) + * - never fabricates data — failed accounts produce zero entries silently + * - writes the canonical schema even on full failure (empty tweets array) + */ + +const { spawnSync, execSync } = require("node:child_process"); +const fs = require("node:fs"); +const path = require("node:path"); + +const DEFAULT_ACCOUNTS = [ + "@OpenAI", + "@AnthropicAI", + "@GoogleDeepMind", + "@xai", + "@sama", + "@karpathy", + "@Alibaba_Qwen", + "@deepseek_ai", + "@Kimi_Moonshot", + "@nvidia", +]; + +const DEFAULTS = { + since: "24h", + startId: 1, + out: "./news.json", + maxPer: 5, + timeout: 180, +}; + +function parseArgs(argv) { + const out = { + accounts: null, + since: DEFAULTS.since, + startId: DEFAULTS.startId, + out: DEFAULTS.out, + maxPer: DEFAULTS.maxPer, + timeout: DEFAULTS.timeout, + verbose: false, + help: false, + }; + for (let i = 2; i < argv.length; i++) { + const a = argv[i]; + const next = () => argv[++i]; + switch (a) { + case "--accounts": + out.accounts = next() + .split(",") + .map((s) => s.trim()) + .filter(Boolean); + break; + case "--since": + out.since = next(); + break; + case "--start-id": + out.startId = Number.parseInt(next(), 10); + break; + case "--out": + out.out = next(); + break; + case "--max-per": + out.maxPer = Number.parseInt(next(), 10); + break; + case "--timeout": + out.timeout = Number.parseInt(next(), 10); + break; + case "-v": + case "--verbose": + out.verbose = true; + break; + case "-h": + case "--help": + out.help = true; + break; + default: + process.stderr.write(`unknown arg: ${a}\n`); + process.exit(2); + } + } + if (!out.accounts) out.accounts = DEFAULT_ACCOUNTS; + if (!Number.isFinite(out.startId) || out.startId < 0) out.startId = 1; + if (!Number.isFinite(out.maxPer) || out.maxPer < 1) out.maxPer = DEFAULTS.maxPer; + if (!Number.isFinite(out.timeout) || out.timeout < 30) out.timeout = DEFAULTS.timeout; + return out; +} + +function printHelp() { + process.stdout.write(`fetch_news_via_grok.js — headless X fetcher (grok-build) + +Usage: + fetch_news_via_grok.js [options] + +Options: + --accounts Comma-separated @handles (default: 10 anchor accounts) + --since Time window. Either "h" (relative hours) or + "YYYY-MM-DD" (absolute date). Default: 24h + --start-id Sets the "nextId" field in the output. Default: 1 + --out Output file path. Use "-" for stdout. Default: ./news.json + --max-per Max tweets per account. Default: 5 + --timeout Per-account grok call timeout. Default: 180 + -v, --verbose Log per-account progress to stderr + -h, --help Show this help + +Exit codes: + 0 >=1 tweet aggregated + 1 zero tweets after all accounts + 2 bad CLI args + 3 grok binary missing / not logged in +`); +} + +function preflight() { + try { + const v = execSync("grok --version", { + stdio: ["ignore", "pipe", "ignore"], + timeout: 5000, + }) + .toString() + .trim(); + if (!v.match(/^grok 0\.2\./)) { + process.stderr.write(`[warn] grok version ${v} is outside the tested 0.2.x band\n`); + } + } catch { + process.stderr.write(`[err] grok binary not on PATH — install grok-build CLI first\n`); + process.exit(3); + } + // Best-effort: warn if no auth.json — grok will still err with a clear message. + const home = process.env.HOME || process.env.USERPROFILE || ""; + if (home && !fs.existsSync(path.join(home, ".grok", "auth.json"))) { + process.stderr.write(`[warn] ~/.grok/auth.json missing — run "grok login" before this script\n`); + } +} + +function buildPrompt(handle, since, maxPer) { + const cleanHandle = handle.replace(/^@/, ""); + let windowDesc; + if (/^\d+h$/i.test(since)) { + const hours = Number.parseInt(since, 10); + windowDesc = `最近 ${hours} 小时`; + } else if (/^\d{4}-\d{2}-\d{2}$/.test(since)) { + windowDesc = `自 ${since} 起 (since:${since})`; + } else { + windowDesc = since; + } + return [ + `找一下 X (Twitter) 上 @${cleanHandle} ${windowDesc}内的帖子, 最多 ${maxPer} 条。`, + "", + "**步骤**:", + `1. 先用 web_search (allowed_domains=["x.com"]) 找 @${cleanHandle} 的最新原创帖子 (不含 retweet)`, + "2. 必要时用 web_fetch 拿正文", + "3. 最后只输出一个**严格 JSON 数组**, 不要 markdown, 不要解释, 不要 code fence", + "", + "数组每条字段:", + ` user ${cleanHandle} (handle 不带 @)`, + ` name 显示名 (例: Sam Altman)`, + ` text 原文或忠实的中文摘要 (~80 字内)`, + ` url 完整的 https://x.com/${cleanHandle}/status/ URL (必须是真链接)`, + ` date YYYY-MM-DD (拿不到精确日子用大致月份: YYYY-MM)`, + "", + "硬要求:", + ` · url 字段必须以 "https://x.com/${cleanHandle}/status/" 开头,后接纯数字 status id`, + " · 如果搜索不到任何符合时间窗的帖子,返回空数组 []", + " · 绝对禁止编造 URL 或 status id", + " · 绝对禁止返回 markdown / 解释文字 / code fence", + "", + "示例 (替换为真实数据):", + `[{"user":"${cleanHandle}","name":"...","text":"...","url":"https://x.com/${cleanHandle}/status/123","date":"2026-06-01"}]`, + ].join("\n"); +} + +function callGrok(prompt, timeoutSecs) { + const result = spawnSync( + "grok", + ["-p", prompt, "--output-format", "json", "--always-approve"], + { + stdio: ["ignore", "pipe", "pipe"], + timeout: timeoutSecs * 1000, + maxBuffer: 16 * 1024 * 1024, + encoding: "utf8", + }, + ); + if (result.error) { + return { ok: false, reason: `spawn: ${result.error.message}` }; + } + if (result.signal === "SIGTERM" || result.signal === "SIGKILL") { + return { ok: false, reason: `timeout after ${timeoutSecs}s` }; + } + if (result.status !== 0) { + return { ok: false, reason: `grok exit ${result.status}` }; + } + return { ok: true, stdout: result.stdout }; +} + +function parseEnvelope(stdout) { + let envelope; + try { + envelope = JSON.parse(stdout); + } catch (e) { + return { ok: false, reason: `envelope JSON.parse: ${e.message}` }; + } + if (envelope == null || typeof envelope !== "object") { + return { ok: false, reason: "envelope not an object" }; + } + if (envelope.stopReason && envelope.stopReason !== "EndTurn") { + return { ok: false, reason: `non-EndTurn stopReason: ${envelope.stopReason}` }; + } + if (typeof envelope.text !== "string") { + return { ok: false, reason: "envelope.text missing or non-string" }; + } + let inner = envelope.text.trim(); + if (inner.startsWith("```")) { + inner = inner.replace(/^```(?:json)?\s*/i, "").replace(/```\s*$/i, "").trim(); + } + const firstBracket = inner.indexOf("["); + const lastBracket = inner.lastIndexOf("]"); + if (firstBracket === -1 || lastBracket === -1 || lastBracket < firstBracket) { + return { ok: false, reason: "no JSON array bracket in envelope.text" }; + } + inner = inner.slice(firstBracket, lastBracket + 1); + let raw; + try { + raw = JSON.parse(inner); + } catch (e) { + return { ok: false, reason: `inner JSON.parse: ${e.message}` }; + } + if (!Array.isArray(raw)) { + return { ok: false, reason: "inner payload is not an array" }; + } + return { ok: true, raw }; +} + +const URL_RE = /^https:\/\/x\.com\/([^/]+)\/status\/(\d+)/; + +function normalize(raw, handle) { + const cleanHandle = handle.replace(/^@/, ""); + const out = []; + for (const t of raw) { + if (!t || typeof t !== "object") continue; + const url = typeof t.url === "string" ? t.url.trim() : ""; + const m = url.match(URL_RE); + if (!m) continue; + const user = m[1]; + const tweetId = m[2]; + if (user.toLowerCase() !== cleanHandle.toLowerCase()) continue; + const text = typeof t.text === "string" ? t.text.trim() : ""; + if (!text) continue; + const dateRaw = typeof t.date === "string" ? t.date.trim() : ""; + let date = ""; + if (/^\d{4}-\d{2}-\d{2}$/.test(dateRaw)) date = dateRaw; + else if (/^\d{4}-\d{2}$/.test(dateRaw)) date = `${dateRaw}-01`; + else date = new Date().toISOString().slice(0, 10); + const name = typeof t.name === "string" && t.name.trim() ? t.name.trim() : cleanHandle; + out.push({ + account: `@${user}`, + name, + user, + tweetId, + text, + date, + likes: 0, + views: 0, + url, + source: `@${user}`, + category: "行业", + imageUrl: "", + }); + } + return out; +} + +function dedupe(tweets) { + const seen = new Set(); + const out = []; + for (const t of tweets) { + const key = t.tweetId || t.url; + if (seen.has(key)) continue; + seen.add(key); + out.push(t); + } + return out; +} + +async function main() { + const opts = parseArgs(process.argv); + if (opts.help) { + printHelp(); + process.exit(0); + } + preflight(); + + const log = (msg) => { + if (opts.verbose) process.stderr.write(`[fetch-grok] ${msg}\n`); + }; + log(`accounts=${opts.accounts.length} since=${opts.since} max-per=${opts.maxPer}`); + + const failures = []; + const allTweets = []; + + for (const handle of opts.accounts) { + log(`→ ${handle}`); + const prompt = buildPrompt(handle, opts.since, opts.maxPer); + const call = callGrok(prompt, opts.timeout); + if (!call.ok) { + failures.push({ handle, reason: call.reason }); + log(` ✗ ${call.reason}`); + continue; + } + const parsed = parseEnvelope(call.stdout); + if (!parsed.ok) { + failures.push({ handle, reason: parsed.reason }); + log(` ✗ ${parsed.reason}`); + continue; + } + const tweets = normalize(parsed.raw, handle); + log(` ✓ ${tweets.length} tweet(s)`); + allTweets.push(...tweets); + } + + const tweets = dedupe(allTweets); + const payload = { nextId: opts.startId + tweets.length, tweets }; + const json = JSON.stringify(payload, null, 2); + + if (opts.out === "-") { + process.stdout.write(json + "\n"); + } else { + fs.writeFileSync(opts.out, json + "\n", "utf8"); + log(`wrote ${tweets.length} tweet(s) to ${opts.out}`); + } + + if (failures.length > 0) { + process.stderr.write( + `[fetch-grok] ${failures.length}/${opts.accounts.length} accounts failed:\n`, + ); + for (const f of failures) { + process.stderr.write(` - ${f.handle}: ${f.reason}\n`); + } + } + + if (tweets.length === 0) { + process.stderr.write( + `[fetch-grok] zero tweets aggregated — exit 1 (downstream should skip write)\n`, + ); + process.exit(1); + } + process.exit(0); +} + +main().catch((e) => { + process.stderr.write(`[fetch-grok] fatal: ${e.stack || e.message}\n`); + process.exit(1); +}); From 224969cade35334494ea7bdeda820f149a381a35 Mon Sep 17 00:00:00 2001 From: vansin Date: Sun, 7 Jun 2026 07:18:37 +0800 Subject: [PATCH 2/2] =?UTF-8?q?feat(demos/grok-news-fetcher):=20v2=20recen?= =?UTF-8?q?cy=20+=20media=5Furl=20=E6=A0=A1=E9=AA=8C=E7=BA=A2=E7=BA=BF?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 通信龙 HIGH 验收红线 + 凑齐 spec ack 后追加 3 块: 1. **prompt recency 强制**: buildPrompt 注入今天日期 + cutoffISO + "you MUST return real publish DATE; do NOT include older posts; do NOT pad with today's date if unsure — drop instead" 4 点 英文硬规则 (LLM 对英文规则比中文敏感) + 中文一句话兜底。 要求 LLM 给 media_url (pbs.twimg.com / video.twimg.com 直链)。 2. **verifyMediaUrl (新增)**: curl -I -L -m 8s HEAD, host 必须是 pbs.twimg.com/video.twimg.com/ton.twitter.com, HTTP 200 + content-type image/* 或 video/* 才入库, 否则 imageUrl 置 "" (保 tweet 丢图)。防 LLM 幻觉假图链 — A 站 全替代 vs 兜底 pilot 的关键防线。 3. **normalize date 窗口双保险**: cutoffISO <= date <= todayISO 才保留。**date 缺失/非 YYYY-MM-DD 整条 drop** (不再 fallback 今天 — 通信龙 v1 read 时抓到的 anti-pattern,会把老帖伪装成新帖)。 drops 计数器: badUrl/wrongUser/emptyText/badDate/outOfWindow/ mediaVerifyFail。 4. **新鲜度自查日志**: summarize() 算 freshest/oldest/ageDays/ 日期分布 byDate/media-verified ratio; verbose 模式打 per-tweet curl HTTP code + content-type; freshest > 2d 老 stderr WARN (grok web 索引滞后 surface 出来,A 站要 last-24h 时能早发现)。 v2 实测 (2026-06-07, 3 账号 since=72h max-per=3): - 6 条 tweet, 全在 [2026-06-03, 2026-06-06] 窗口 - freshest=2026-06-05 (1d ago), 满足 A 站 last-24h target - date 分布: 2026-06-04 × 5, 2026-06-05 × 1 - media_url: LLM 返 1 条 video.twimg.com/amplify_video/.../mp4, curl HEAD 验证 content-type=video/mp4 → 入库; 其余 5 条 LLM 诚实地不返 media_url (符合预期, 不编造) - exit 0 / 串行 ~3m30s - 失败路径 (nonexistent account): tweets=[] + exit 1 README 同步更新: 字段规则 v2 收紧 + 实测 + 6 个新 gotcha (WARN 新鲜度 / imageUrl 多空 / curl 校验失败 drop / date 缺失 drop / outOfWindow drop / 慢)。 Author-Agent: 通信demo马 Dispatched-By: 通信龙 (commhub task 4c5085d3 HIGH + 48bc3d35/62ea048b spec 凑齐 + 8d6204b6 v1 review ping) --- demos/grok-news-fetcher/README.md | 49 +++-- .../grok-news-fetcher/fetch_news_via_grok.js | 196 ++++++++++++++---- 2 files changed, 189 insertions(+), 56 deletions(-) diff --git a/demos/grok-news-fetcher/README.md b/demos/grok-news-fetcher/README.md index 3dd62ea4..54f2e7f5 100644 --- a/demos/grok-news-fetcher/README.md +++ b/demos/grok-news-fetcher/README.md @@ -86,14 +86,19 @@ grok login # 一次性 OAuth } ``` -字段规则: +字段规则 (v2 已收紧): - `user` = handle 不带 `@` - `account` / `source` = `@` (两个字段同值,保留是兼容 A站老 schema) - `url` = 必须是真实 `https://x.com//status/` 链接,**否则那条 tweet 被脚本静默丢弃** - `tweetId` = 从 url 正则抠 (`/status/(\d+)`) -- `date` = `YYYY-MM-DD`,LLM 返 `YYYY-MM` 自动补 `-01`,全空 fallback 今天 -- `likes` / `views` = `0` (grok native 拿不到 engagement metadata,X 平台限制) -- `imageUrl` = `""` (grok 也拿不到媒体附件,留空) +- `date` = `YYYY-MM-DD`,**LLM 返不出真实发布日期 (非 `YYYY-MM-DD` 格式 / 日期缺失) → 那条 tweet 整条 drop**。不再 fallback 今天 — 避免老帖被伪装成新帖混进窗口 +- `date` 还做**窗口双保险校验**:`cutoff <= date <= today` 才保留,超窗 silent drop +- `imageUrl` = LLM 返的 `media_url` 字段,**必须通过 curl HEAD 校验**: + - host 必须是 `pbs.twimg.com` / `video.twimg.com` / `ton.twitter.com` 之一 + - HTTP 200 + content-type `image/*` 或 `video/*` + - 校验失败 → `imageUrl` 置 `""`(tweet 本身保留,只是没图) + - **防 LLM 幻觉假图链** — A 站全替代 pilot 的关键防线 +- `likes` / `views` = `0` (grok native 拿不到 engagement metadata,X 平台限制,诚实不编造) - `category` = `"行业"` 固定值 ## 失败语义 (A站硬要求) @@ -107,27 +112,36 @@ grok login # 一次性 OAuth **绝对不塞假数据**。下游翻译看到空 tweets 应当跳过写库,按 ai-insight 现有约定。 -## 实测 (2026-06-07, 通信demo马) +## 实测 (2026-06-07, 通信demo马, v2) > 命令: `./fetch_news_via_grok.js --accounts @sama,@OpenAI,@karpathy --since 72h --max-per 3 -v` ``` -[fetch-grok] accounts=3 since=72h max-per=3 +[fetch-grok] accounts=3 since=72h max-per=3 window=[2026-06-03, 2026-06-06] [fetch-grok] → @sama -[fetch-grok] ✓ 3 tweet(s) +[fetch-grok] ✓ 3 tweet(s) kept [fetch-grok] → @OpenAI -[fetch-grok] ✓ 3 tweet(s) +[fetch-grok] ✓ 3 tweet(s) kept [fetch-grok] → @karpathy -[fetch-grok] ✓ 0 tweet(s) +[fetch-grok] ✓ 0 tweet(s) kept [fetch-grok] wrote 6 tweet(s) to news.json - -real 3m26s +[fetch-grok] summary: freshest=2026-06-05 (1d ago) | oldest=2026-06-04 | window=[2026-06-03, 2026-06-06] | media-verified=1/6 +[fetch-grok] date distribution: + 2026-06-04: 5 + 2026-06-05: 1 +[fetch-grok] media verify: 1/1 passed (HTTP 200 + image/video content-type) + [@OpenAI 2062630454537424930] ✓ video/mp4 + https://video.twimg.com/amplify_video/2062605181427339264/vid/avc1/480x270/0mV_E3qR35fCtFaE.mp4 + +real 3m30s exit 0 ``` -- 6/6 URL 实测真链接 (例如 `https://x.com/sama/status/2062661191969972645`, `https://x.com/OpenAI/status/2062927046448431587`) +- 6/6 URL 实测真链接(例如 `https://x.com/sama/status/2062661191969972645`, `https://x.com/OpenAI/status/2062927046448431587`) +- 6/6 date 都在窗口 `[2026-06-03, 2026-06-06]` 内(freshest=2026-06-05,1d ago) - schema 12 字段一字不差 -- 串行 (避免 `auth.json` / rate-limit 冲突):3 账号 ~3.5min,10 账号大约 12-15min +- LLM 返了 1 条 `video.twimg.com` 真链接,curl HEAD 验证 content-type `video/mp4` → 入库;其余 5 条 LLM 没返 `media_url`(grok 诚实地不编,行为符合预期) +- 串行(避免 `auth.json` / rate-limit 冲突):3 账号 ~3.5min,10 账号大约 12-15min > 命令: `./fetch_news_via_grok.js --accounts @ThisAccountDefinitelyDoesNotExistFooBar2026 --since 1h --max-per 2` @@ -157,11 +171,14 @@ news.json: {"nextId": 1, "tweets": []} | 现象 | 原因 | 应对 | |------|------|------| | stderr 里有 `failed to watch root recursively` / `Error reading from stream: serde error` | grok-build CLI 给 `/tmp` 文件监视和流解析的告警噪音,跟实际任务无关 | 脚本只 parse stdout,stderr 全丢,正确行为 | -| 某账号 `0 tweet(s)` 但实际有发 | LLM 用 `web_search` 索引可能滞后几小时;或 X 反爬丢 `web_fetch` | 增大 `--since` 时间窗;或重跑 | +| 某账号 `0 tweet(s)` 但实际有发 | LLM `web_search` 索引可能滞后;或 X 反爬丢 `web_fetch`;或 LLM 拿不到真实日期被双保险丢 | 增大 `--since` 时间窗;或重跑 | +| `WARN: freshest tweet is N days old — grok web index may be lagging` | grok 索引滞后于实时 X,A 站要 last-24h 时这条提示让你早发现 | 增大 `--since`;或调度更高频;或这个时段 grok 兜底确实拿不到当天新帖,跳过翻译入库 | | 某账号 grok 异常退出 (timeout / non-zero exit) | LLM 卡顿 / endpoint 5xx | 脚本静默跳过该账号,继续下一个,最后 stderr 列出失败列表 | | `likes` / `views` 永远 `0` | X 不向匿名 / 非 logged-in 客户端开放 engagement metadata,grok web_search 也拿不到,正确诚实地不编造 | A 站下游若需要排序请用其他信号 (如发布时间) | -| `imageUrl` 永远 `""` | 同上,grok web_search 不会自动拉媒体附件 URL | 需要图片请用 A 站现有的 twitterapi.io 路径,grok 兜底不覆盖 | -| LLM 返回的 `date` 字段不是 `YYYY-MM-DD` | LLM 输出非确定性 | 脚本对 `YYYY-MM` 补 `-01`,全无效 fallback 今天,永远是合法日期串 | +| `imageUrl` 经常为 `""` | LLM 诚实地不返 `media_url`(没图 / 拿不准 host),或 `media_url` curl 校验失败被 drop | 这是**好行为**(防幻觉假图链);如果某账号长期 0 imageUrl,可调 prompt 增强但 LLM 行为非确定,A 站可降权或回退 twitterapi.io | +| 某条 LLM 返了 `media_url` 但 curl 校验失败 | 假链(LLM 幻觉)/ X CDN 5xx / content-type 不是 image/video | 静默丢图保 tweet(imageUrl 置 "");verbose 模式打 `✗ http XXX` 或 `✗ non-media content-type` | +| 整条 tweet 因 date 缺失被 drop | LLM 没返 `date` 字段或非 `YYYY-MM-DD` 格式 — 我们拒绝把老帖伪装成新的 | 提升 prompt 严格度;或承认那条无法验证日期、跳过 | +| 整条 tweet 因超窗被 drop | LLM 不守 cutoff 提示,给了老帖;脚本本地双保险拦截 | drops 计数 + verbose 日志会标出 `outOfWindow=N`;调短 `--since` 让 LLM 更准 | | 跑得慢 (10 账号 ~15min) | grok 单轮 LLM 调用 30-90s × 串行 10 次 | cron 调度别小于 30min;并发可加但要先解决 `auth.json` 锁 (本脚本未做) | ## 与 A站 现有 `auto_update_news.js` 的关系 diff --git a/demos/grok-news-fetcher/fetch_news_via_grok.js b/demos/grok-news-fetcher/fetch_news_via_grok.js index c7b7e1ed..b2099375 100755 --- a/demos/grok-news-fetcher/fetch_news_via_grok.js +++ b/demos/grok-news-fetcher/fetch_news_via_grok.js @@ -165,40 +165,55 @@ function preflight() { } } -function buildPrompt(handle, since, maxPer) { - const cleanHandle = handle.replace(/^@/, ""); - let windowDesc; +function computeCutoff(since) { + const now = new Date(); if (/^\d+h$/i.test(since)) { const hours = Number.parseInt(since, 10); - windowDesc = `最近 ${hours} 小时`; - } else if (/^\d{4}-\d{2}-\d{2}$/.test(since)) { - windowDesc = `自 ${since} 起 (since:${since})`; - } else { - windowDesc = since; + const cutoff = new Date(now.getTime() - hours * 3600 * 1000); + return { cutoffISO: cutoff.toISOString().slice(0, 10), hours }; + } + if (/^\d{4}-\d{2}-\d{2}$/.test(since)) { + return { cutoffISO: since, hours: null }; } + return { cutoffISO: now.toISOString().slice(0, 10), hours: 24 }; +} + +function buildPrompt(handle, since, maxPer) { + const cleanHandle = handle.replace(/^@/, ""); + const today = new Date().toISOString().slice(0, 10); + const { cutoffISO, hours } = computeCutoff(since); + const windowDescZh = hours != null ? `最近 ${hours} 小时 (自 ${cutoffISO} 起)` : `自 ${cutoffISO} 起`; + const windowDescEn = hours != null ? `the last ${hours} hours (since ${cutoffISO})` : `since ${cutoffISO}`; return [ - `找一下 X (Twitter) 上 @${cleanHandle} ${windowDesc}内的帖子, 最多 ${maxPer} 条。`, + `Today is ${today}. Find recent original posts on X (Twitter) from @${cleanHandle} within ${windowDescEn}, at most ${maxPer} posts.`, + `(中文: 今天是 ${today}, 找 @${cleanHandle} ${windowDescZh}内的原创帖, 最多 ${maxPer} 条, 不含 retweet)`, "", - "**步骤**:", - `1. 先用 web_search (allowed_domains=["x.com"]) 找 @${cleanHandle} 的最新原创帖子 (不含 retweet)`, - "2. 必要时用 web_fetch 拿正文", - "3. 最后只输出一个**严格 JSON 数组**, 不要 markdown, 不要解释, 不要 code fence", + "**Steps**:", + `1. Use web_search (allowed_domains=["x.com"]) to find @${cleanHandle}'s most recent original posts`, + "2. Use web_fetch if needed to confirm post body and publish date", + "3. Output ONLY a strict JSON array — no markdown, no explanation, no code fence", "", - "数组每条字段:", - ` user ${cleanHandle} (handle 不带 @)`, - ` name 显示名 (例: Sam Altman)`, - ` text 原文或忠实的中文摘要 (~80 字内)`, - ` url 完整的 https://x.com/${cleanHandle}/status/ URL (必须是真链接)`, - ` date YYYY-MM-DD (拿不到精确日子用大致月份: YYYY-MM)`, + "Each array item fields:", + ` user "${cleanHandle}" (handle without @)`, + ` name display name (e.g. "Sam Altman")`, + ` text post body or faithful summary in Chinese (~80 chars)`, + ` url https://x.com/${cleanHandle}/status/ (real, browser-openable)`, + ` date YYYY-MM-DD (real publish date — MUST be accurate to the day)`, + ` media_url https://pbs.twimg.com/... or https://video.twimg.com/... (real direct media link; "" if no image/video)`, "", - "硬要求:", - ` · url 字段必须以 "https://x.com/${cleanHandle}/status/" 开头,后接纯数字 status id`, - " · 如果搜索不到任何符合时间窗的帖子,返回空数组 []", - " · 绝对禁止编造 URL 或 status id", - " · 绝对禁止返回 markdown / 解释文字 / code fence", + "**HARD RECENCY REQUIREMENTS** (post will be dropped otherwise):", + ` · date MUST be >= ${cutoffISO} (the cutoff). Do NOT include older posts. Do NOT pad date with today's date if you don't know the real one — drop that post instead.`, + " · If unsure of the real publish date, drop the post — do NOT fabricate a recent date to pass the filter.", + ` · If no posts in [${cutoffISO}, ${today}] match, return empty array [].`, "", - "示例 (替换为真实数据):", - `[{"user":"${cleanHandle}","name":"...","text":"...","url":"https://x.com/${cleanHandle}/status/123","date":"2026-06-01"}]`, + "**OTHER HARD RULES**:", + ` · url MUST start with "https://x.com/${cleanHandle}/status/" followed by digits-only status id`, + ` · media_url, if non-empty, MUST be a real direct image/video link (pbs.twimg.com, video.twimg.com, or x.com/.../photo/...). Do NOT fabricate.`, + " · NEVER fabricate URL / status id / media_url / date.", + " · Output is JSON ONLY. No markdown, no explanation, no code fence.", + "", + "Example (replace with real data):", + `[{"user":"${cleanHandle}","name":"...","text":"...","url":"https://x.com/${cleanHandle}/status/123","date":"${today}","media_url":"https://pbs.twimg.com/media/abc.jpg"}]`, ].join("\n"); } @@ -264,42 +279,87 @@ function parseEnvelope(stdout) { } const URL_RE = /^https:\/\/x\.com\/([^/]+)\/status\/(\d+)/; +const MEDIA_HOST_RE = /^https:\/\/(pbs\.twimg\.com|video\.twimg\.com|ton\.twitter\.com)\//; + +function verifyMediaUrl(mediaUrl, timeoutSecs = 8) { + if (!mediaUrl) return { ok: false, reason: "empty" }; + if (!MEDIA_HOST_RE.test(mediaUrl)) { + return { ok: false, reason: `non-twitter media host: ${mediaUrl.slice(0, 60)}` }; + } + const result = spawnSync( + "curl", + [ + "-I", "-L", "--max-redirs", "3", + "-m", String(timeoutSecs), + "-s", "-o", "/dev/null", + "-w", "%{http_code}\t%{content_type}", + "-A", "Mozilla/5.0 (compatible; ai-insight-fetcher/1.0)", + mediaUrl, + ], + { encoding: "utf8", timeout: (timeoutSecs + 2) * 1000 }, + ); + if (result.error || result.status !== 0) { + return { ok: false, reason: `curl err: ${result.error?.message || `exit ${result.status}`}` }; + } + const [code, ctype = ""] = (result.stdout || "").split("\t"); + if (code !== "200") return { ok: false, reason: `http ${code}` }; + if (!ctype.startsWith("image/") && !ctype.startsWith("video/")) { + return { ok: false, reason: `non-media content-type: ${ctype}` }; + } + return { ok: true, contentType: ctype }; +} -function normalize(raw, handle) { +function normalize(raw, handle, cutoffISO, todayISO, mediaLog) { const cleanHandle = handle.replace(/^@/, ""); const out = []; + const drops = { badUrl: 0, wrongUser: 0, emptyText: 0, badDate: 0, outOfWindow: 0, mediaVerifyFail: 0 }; for (const t of raw) { if (!t || typeof t !== "object") continue; const url = typeof t.url === "string" ? t.url.trim() : ""; const m = url.match(URL_RE); - if (!m) continue; + if (!m) { drops.badUrl++; continue; } const user = m[1]; const tweetId = m[2]; - if (user.toLowerCase() !== cleanHandle.toLowerCase()) continue; + if (user.toLowerCase() !== cleanHandle.toLowerCase()) { drops.wrongUser++; continue; } const text = typeof t.text === "string" ? t.text.trim() : ""; - if (!text) continue; + if (!text) { drops.emptyText++; continue; } const dateRaw = typeof t.date === "string" ? t.date.trim() : ""; - let date = ""; - if (/^\d{4}-\d{2}-\d{2}$/.test(dateRaw)) date = dateRaw; - else if (/^\d{4}-\d{2}$/.test(dateRaw)) date = `${dateRaw}-01`; - else date = new Date().toISOString().slice(0, 10); + if (!/^\d{4}-\d{2}-\d{2}$/.test(dateRaw)) { drops.badDate++; continue; } + if (dateRaw < cutoffISO || dateRaw > todayISO) { drops.outOfWindow++; continue; } const name = typeof t.name === "string" && t.name.trim() ? t.name.trim() : cleanHandle; + let imageUrl = ""; + const mediaRaw = typeof t.media_url === "string" ? t.media_url.trim() : ""; + if (mediaRaw) { + const v = verifyMediaUrl(mediaRaw); + if (mediaLog) { + mediaLog.push({ + tweetId, + handle: cleanHandle, + url: mediaRaw, + ok: v.ok, + contentType: v.contentType || "", + reason: v.reason || "", + }); + } + if (v.ok) imageUrl = mediaRaw; + else drops.mediaVerifyFail++; + } out.push({ account: `@${user}`, name, user, tweetId, text, - date, + date: dateRaw, likes: 0, views: 0, url, source: `@${user}`, category: "行业", - imageUrl: "", + imageUrl, }); } - return out; + return { tweets: out, drops }; } function dedupe(tweets) { @@ -314,6 +374,24 @@ function dedupe(tweets) { return out; } +function summarize(tweets, cutoffISO, todayISO) { + if (tweets.length === 0) return { line: "no tweets", byDate: {}, freshestAgeDays: null }; + const byDate = {}; + for (const t of tweets) byDate[t.date] = (byDate[t.date] || 0) + 1; + const dates = Object.keys(byDate).sort(); + const freshest = dates[dates.length - 1]; + const oldest = dates[0]; + const withMedia = tweets.filter((t) => t.imageUrl).length; + const today = new Date(todayISO + "T00:00:00Z"); + const f = new Date(freshest + "T00:00:00Z"); + const ageDays = Math.round((today.getTime() - f.getTime()) / 86400000); + return { + line: `freshest=${freshest} (${ageDays}d ago) | oldest=${oldest} | window=[${cutoffISO}, ${todayISO}] | media-verified=${withMedia}/${tweets.length}`, + byDate, + freshestAgeDays: ageDays, + }; +} + async function main() { const opts = parseArgs(process.argv); if (opts.help) { @@ -322,13 +400,18 @@ async function main() { } preflight(); + const todayISO = new Date().toISOString().slice(0, 10); + const { cutoffISO } = computeCutoff(opts.since); + const log = (msg) => { if (opts.verbose) process.stderr.write(`[fetch-grok] ${msg}\n`); }; - log(`accounts=${opts.accounts.length} since=${opts.since} max-per=${opts.maxPer}`); + log(`accounts=${opts.accounts.length} since=${opts.since} max-per=${opts.maxPer} window=[${cutoffISO}, ${todayISO}]`); const failures = []; const allTweets = []; + const mediaLog = []; + const totalDrops = { badUrl: 0, wrongUser: 0, emptyText: 0, badDate: 0, outOfWindow: 0, mediaVerifyFail: 0 }; for (const handle of opts.accounts) { log(`→ ${handle}`); @@ -345,8 +428,10 @@ async function main() { log(` ✗ ${parsed.reason}`); continue; } - const tweets = normalize(parsed.raw, handle); - log(` ✓ ${tweets.length} tweet(s)`); + const { tweets, drops } = normalize(parsed.raw, handle, cutoffISO, todayISO, mediaLog); + for (const k of Object.keys(totalDrops)) totalDrops[k] += drops[k]; + const dropTags = Object.entries(drops).filter(([, v]) => v > 0).map(([k, v]) => `${k}=${v}`).join(","); + log(` ✓ ${tweets.length} tweet(s) kept${dropTags ? ` [drops: ${dropTags}]` : ""}`); allTweets.push(...tweets); } @@ -361,6 +446,31 @@ async function main() { log(`wrote ${tweets.length} tweet(s) to ${opts.out}`); } + // freshness self-check — surfaces "grok index lag" early + const summary = summarize(tweets, cutoffISO, todayISO); + process.stderr.write(`[fetch-grok] summary: ${summary.line}\n`); + if (Object.keys(summary.byDate).length > 0) { + const dateLines = Object.entries(summary.byDate) + .sort() + .map(([d, n]) => ` ${d}: ${n}`) + .join("\n"); + process.stderr.write(`[fetch-grok] date distribution:\n${dateLines}\n`); + } + const droppedTags = Object.entries(totalDrops).filter(([, v]) => v > 0).map(([k, v]) => `${k}=${v}`).join(", "); + if (droppedTags) { + process.stderr.write(`[fetch-grok] total drops: ${droppedTags}\n`); + } + if (mediaLog.length > 0) { + const kept = mediaLog.filter((m) => m.ok).length; + process.stderr.write(`[fetch-grok] media verify: ${kept}/${mediaLog.length} passed (HTTP 200 + image/video content-type)\n`); + if (opts.verbose) { + for (const m of mediaLog) { + const tag = m.ok ? `✓ ${m.contentType}` : `✗ ${m.reason}`; + process.stderr.write(` [@${m.handle} ${m.tweetId}] ${tag}\n ${m.url}\n`); + } + } + } + if (failures.length > 0) { process.stderr.write( `[fetch-grok] ${failures.length}/${opts.accounts.length} accounts failed:\n`, @@ -376,6 +486,12 @@ async function main() { ); process.exit(1); } + // Freshness warning (non-fatal): A站 needs last-24h ideally, surface if grok index lagged + if (summary.freshestAgeDays != null && summary.freshestAgeDays > 2) { + process.stderr.write( + `[fetch-grok] WARN: freshest tweet is ${summary.freshestAgeDays} days old — grok web index may be lagging behind real-time X\n`, + ); + } process.exit(0); }