Skip to content

feat(demos/grok-news-fetcher): A站 ai-insight 兜底新闻抓取脚本 (grok-build headless)#208

Open
s2agi wants to merge 2 commits into
mainfrom
feat/grok-news-fetcher
Open

feat(demos/grok-news-fetcher): A站 ai-insight 兜底新闻抓取脚本 (grok-build headless)#208
s2agi wants to merge 2 commits into
mainfrom
feat/grok-news-fetcher

Conversation

@s2agi
Copy link
Copy Markdown
Contributor

@s2agi s2agi commented Jun 6, 2026

Author & Helpers

  • Author-Agent: 通信demo马 (claude-code-cli runtime)
  • Helpers: 通信SDK马 (demos/grok-x-search/ 参考实现, prompt 模板灵感)
  • Dispatched-By: 通信龙 (commhub task chain: 58afee64 初派 → 4c5085d3 HIGH 验收红线 → 48bc3d35+62ea048b spec 凑齐 → 8d6204b6 v1 review ping → 983f9c0e v2 PASS + 转 A 站 pilot)
  • Review: 通信牛 (重点 prompt 工程那几处非显然的, 见下)

背景

A 站 ai-insighttwitterapi.io 抓 10 个锚定账号最新推文翻译入库。2026-06-03 起 twitterapi.io 返 HTTP 402 quota exceeded, 管线断流。 Vincent 决定用 grok-build CLI 0.2.x headless 模式 (已实证 native X 搜索能力, 见 demos/grok-x-search/) 当兜底 → 全替代 pilot

通信龙独立 headless 实测确认: 只要 prompt 强制 recency window, grok 能拿 last-3-days 真帖 + 真 media_url (pbs.twimg.com/video.twimg.com)。

交付

文件 作用
demos/grok-news-fetcher/fetch_news_via_grok.js 主脚本 (Node.js, 零 npm 依赖) — 串行调 grok subprocess + envelope.text 二次 JSON parse + 规范化为 A 站 schema
demos/grok-news-fetcher/README.md 用法 + cron 接入 + 实测 + gotcha + 红线

输出 schema 跟 A 站 auto_update_news.js --fetch-only 一字不差 (12 字段: account, name, user, tweetId, text, date, likes, views, url, source, category, imageUrl)。

Prompt 工程非显然处 (请通信牛 focus review)

  1. 不能纯硬"严格返 JSON" — 实测 v0 prompt 直接要 JSON, grok 会 EndTurn 不调工具直接给空数组 []。改成 "先用 web_search/web_fetch 找, 再返 JSON" 双 phase 引导, grok 自主会用 web_search (allowed_domains x.com) 甚至 x_keyword_search 工具拿数据。(buildPrompt L138-178)

  2. today + cutoff 双注入 + 英文硬规则 — 中文 LLM 对英文规则更敏感, 4 点 recency 用英文写: Today is YYYY-MM-DD + since cutoff + MUST return real publish date + do NOT pad with today's date if unsure — drop instead. 中文一句话兜底。

  3. 媒体 host 白名单verifyMediaUrl 只接受 pbs.twimg.com / video.twimg.com / ton.twitter.com 三个 host (twitter 官方 CDN), 防 LLM 给 imgur/cdn-cgi/伪域名等假链。(MEDIA_HOST_RE L185)

  4. date 缺失 = 整条 drop, 不 fallback today — 通信龙 v1 review 抓到的 anti-pattern: v1 normalize 里 date 不可解析时 new Date() 盖今天, 把没日期的旧帖伪装成新帖混进窗口。v2 改 if (!/^\d{4}-\d{2}-\d{2}$/.test(dateRaw)) { drops.badDate++; continue; }. (normalize L220 附近)

  5. date 窗口双保险 — prompt 已明示 cutoff, 但 LLM 不一定守; normalize 再加 cutoffISO <= date <= todayISO 本地校验, 超窗 silent drop + drops.outOfWindow++. 宁可少一条不能假装新帖。

  6. media_url curl HEAD 三层闸:

    • host 白名单 (上面)
    • HTTP 200 (curl -I -L --max-redirs 3 -m 8)
    • content-type 必须 image/*video/*
    • 任一 fail → imageUrl"" (保 tweet 丢图), 不整条 drop, 不假装 fallback URL. A 站 全替代 pilot 关键防线.
  7. 新鲜度自查summarize() 算 freshest/oldest/ageDays/byDate; freshest > 2d 老 stderr WARN: grok web index may be lagging. A 站要 last-24h, 这个 WARN 帮他们早 surface "grok 这时段拿不到当天新帖" 不入库。

实测 (2026-06-07, v2 commit 224969c)

$ ./fetch_news_via_grok.js --accounts @sama,@OpenAI,@karpathy --since 72h --max-per 3 -v

[fetch-grok] accounts=3 since=72h max-per=3 window=[2026-06-03, 2026-06-06]
[fetch-grok] → @sama       ✓ 3 tweet(s) kept
[fetch-grok] → @OpenAI     ✓ 3 tweet(s) kept
[fetch-grok] → @karpathy   ✓ 0 tweet(s) kept
[fetch-grok] wrote 6 tweet(s)
[fetch-grok] summary: freshest=2026-06-05 (1d ago) | oldest=2026-06-04 | window=[2026-06-03, 2026-06-06] | media-verified=1/6
[fetch-grok] date distribution: 2026-06-04: 5, 2026-06-05: 1
[fetch-grok] media verify: 1/1 passed (HTTP 200 + image/video content-type)
  [@OpenAI 2062630454537424930] ✓ video/mp4
    https://video.twimg.com/amplify_video/2062605181427339264/vid/avc1/480x270/0mV_E3qR35fCtFaE.mp4

real 3m30s, exit 0
  • 6/6 URL 真链接, 全在窗口 [2026-06-03, 2026-06-06] 内
  • freshest=1d ago, 满足 A 站 last-24h target
  • 1 条 LLM 返了 media_url, curl HEAD 验证 video/mp4 → 入库
  • 5 条 LLM 诚实地不返 media_url (空字符串, 不编造)
  • 通信龙独立验真: 该 video.twimg.com 链 GET → HTTP 200 + video/mp4 + 27MB 真视频; 两条 x.com 推文链 GET → 200 真帖

失败路径:

$ ./fetch_news_via_grok.js --accounts @ThisDoesNotExist2026FooBar --since 1h --max-per 1 -v
[fetch-grok] zero tweets aggregated — exit 1 (downstream should skip write)
exit 1
news.json: {"nextId": 1, "tweets": []}

A 站 pilot 状态

  • 已转 A 站负责人跑 1-day pilot (分支 link + 验过的产物 + 配图覆盖率 caveat by 通信龙)
  • 配图覆盖率 1/6 (sama 3 条全空, OpenAI 1/3) 已 flag, 等 pilot 真实数据再决定是否针对性增强 prompt (LLM 行为非确定, 盲调 ROI 低)

红线遵守

  • ✅ 本地 + worktree 自测, 没碰 47.77.216.1 A 站生产
  • ✅ 脚本零内置 A 站凭据/路径/endpoint, 纯交付物
  • ✅ 绝不塞假数据 — 失败一律空数组 + exit≠0
  • ✅ url + media_url 真链接校验, 假的静默 drop
  • ✅ 不加 Co-Authored-By Claude trailer

Merge 时机

  • ⏸️ HOLD — 等 A 站 pilot 1-day + 通信牛 review 都 OK 再 merge
  • PR 跟 pilot 并行不阻塞

关联

A 站 twitterapi.io 6-03 起 402 quota exceeded, 管线断流。Vincent
route grok-build CLI 0.2.x headless 模式当兜底源。

交付:
- fetch_news_via_grok.js: 串行调 grok -p ... --output-format json,
  从 envelope.text 抠 LLM 返的 JSON 数组, 规范化为 A站
  auto_update_news.js --fetch-only 一字不差的 12 字段 schema
  (account/name/user/tweetId/text/date/likes/views/url/source/
  category/imageUrl). 零 npm 依赖, 纯 Node.js stdlib.
- README.md: 用法 + cron 接入 + 实测输出 + gotcha + 红线

实测 (2026-06-07):
- 3 账号 (@sama/@OpenAI/@karpathy) 串行 3m26s
- 6/6 URL 真实可点开 https://x.com/<user>/status/<id>
- schema 12 字段对齐, exit 0
- 失败路径 (nonexistent account): 空 tweets + exit 1, stderr 提示
  downstream skip write
- stderr 噪音 (PermissionDenied/serde error) 只丢弃, 不影响 parse

红线:
- 本地/Docker 自测, 不碰 47.77.216.1 A站生产
- 绝不塞假数据 — 失败一律空数组 + exit≠0
- url 必须真链接 (正则校验) 否则那条 tweet 静默丢弃

Author-Agent: 通信demo马
Dispatched-By: 通信龙 (commhub task 58afee64-899b-468b-90e6-3b8fb379258d)
Helpers: 通信SDK马 (grok-x-search demo 参考实现)
通信龙 HIGH 验收红线 + 凑齐 spec ack 后追加 3 块:

1. **prompt recency 强制**: buildPrompt 注入今天日期 + cutoffISO +
   "you MUST return real publish DATE; do NOT include older posts;
   do NOT pad with today's date if unsure — drop instead" 4 点
   英文硬规则 (LLM 对英文规则比中文敏感) + 中文一句话兜底。
   要求 LLM 给 media_url (pbs.twimg.com / video.twimg.com 直链)。

2. **verifyMediaUrl (新增)**: curl -I -L -m 8s HEAD,
   host 必须是 pbs.twimg.com/video.twimg.com/ton.twitter.com,
   HTTP 200 + content-type image/* 或 video/* 才入库,
   否则 imageUrl 置 "" (保 tweet 丢图)。防 LLM 幻觉假图链 —
   A 站 全替代 vs 兜底 pilot 的关键防线。

3. **normalize date 窗口双保险**: cutoffISO <= date <= todayISO
   才保留。**date 缺失/非 YYYY-MM-DD 整条 drop** (不再 fallback 今天 —
   通信龙 v1 read 时抓到的 anti-pattern,会把老帖伪装成新帖)。
   drops 计数器: badUrl/wrongUser/emptyText/badDate/outOfWindow/
   mediaVerifyFail。

4. **新鲜度自查日志**: summarize() 算 freshest/oldest/ageDays/
   日期分布 byDate/media-verified ratio; verbose 模式打 per-tweet
   curl HTTP code + content-type; freshest > 2d 老 stderr WARN
   (grok web 索引滞后 surface 出来,A 站要 last-24h 时能早发现)。

v2 实测 (2026-06-07, 3 账号 since=72h max-per=3):
- 6 条 tweet, 全在 [2026-06-03, 2026-06-06] 窗口
- freshest=2026-06-05 (1d ago), 满足 A 站 last-24h target
- date 分布: 2026-06-04 × 5, 2026-06-05 × 1
- media_url: LLM 返 1 条 video.twimg.com/amplify_video/.../mp4,
  curl HEAD 验证 content-type=video/mp4 → 入库; 其余 5 条 LLM
  诚实地不返 media_url (符合预期, 不编造)
- exit 0 / 串行 ~3m30s
- 失败路径 (nonexistent account): tweets=[] + exit 1

README 同步更新: 字段规则 v2 收紧 + 实测 + 6 个新 gotcha
(WARN 新鲜度 / imageUrl 多空 / curl 校验失败 drop / date 缺失
drop / outOfWindow drop / 慢)。

Author-Agent: 通信demo马
Dispatched-By: 通信龙 (commhub task 4c5085d3 HIGH + 48bc3d35/62ea048b
spec 凑齐 + 8d6204b6 v1 review ping)
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 224969cade

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +172 to +173
const cutoff = new Date(now.getTime() - hours * 3600 * 1000);
return { cutoffISO: cutoff.toISOString().slice(0, 10), hours };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve hour precision for relative --since windows

When the documented cron path uses an hourly window such as --since 6h or the default 24h, this truncates the cutoff to YYYY-MM-DD, and normalize() later compares only that date string. Any Grok result from the cutoff calendar day is therefore accepted even if it is outside the requested number of hours (for example, a 23:00 run with --since 6h keeps posts from 00:01 the same day), which undermines the fallback's recency guarantee for A站's last-24h feed. The local guard needs hour-level timestamps, or the option should be documented/implemented as a day-granularity window.

Useful? React with 👍 / 👎.

Comment on lines +161 to +162
>/var/log/ai-insight/grok-fetch.log 2>&1 ; \
test -s /var/lib/ai-insight/grok-fallback.json && \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate the cron import on successful non-empty fetches

In this cron example, ; test -s ... runs regardless of the fetcher's exit status, and the fetcher writes a non-empty JSON file even for the zero-tweet failure case ({"nextId":...,"tweets":[]}), so the translator can still run after the script exits 1. Worse, if the fetcher exits before writing (e.g. missing grok), a stale non-empty fallback file can be re-imported. The sample should chain on the fetch command's success and check .tweets | length > 0 before invoking translate_and_upsert.js.

Useful? React with 👍 / 👎.

return { ok: true, raw };
}

const URL_RE = /^https:\/\/x\.com\/([^/]+)\/status\/(\d+)/;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Anchor status URL validation after the numeric ID

Because this regex is only prefix-anchored, a malformed Grok URL like https://x.com/sama/status/123abc or .../status/123/not-a-status passes validation, gets tweetId set to 123, and is then emitted as the canonical url. That bypasses the script's advertised hard guard that the status id is digits-only and browser-openable, so malformed or fabricated URLs can enter the fallback feed instead of being dropped.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants