feat(demos/grok-news-fetcher): A站 ai-insight 兜底新闻抓取脚本 (grok-build headless)#208
feat(demos/grok-news-fetcher): A站 ai-insight 兜底新闻抓取脚本 (grok-build headless)#208s2agi wants to merge 2 commits into
Conversation
A 站 twitterapi.io 6-03 起 402 quota exceeded, 管线断流。Vincent route grok-build CLI 0.2.x headless 模式当兜底源。 交付: - fetch_news_via_grok.js: 串行调 grok -p ... --output-format json, 从 envelope.text 抠 LLM 返的 JSON 数组, 规范化为 A站 auto_update_news.js --fetch-only 一字不差的 12 字段 schema (account/name/user/tweetId/text/date/likes/views/url/source/ category/imageUrl). 零 npm 依赖, 纯 Node.js stdlib. - README.md: 用法 + cron 接入 + 实测输出 + gotcha + 红线 实测 (2026-06-07): - 3 账号 (@sama/@OpenAI/@karpathy) 串行 3m26s - 6/6 URL 真实可点开 https://x.com/<user>/status/<id> - schema 12 字段对齐, exit 0 - 失败路径 (nonexistent account): 空 tweets + exit 1, stderr 提示 downstream skip write - stderr 噪音 (PermissionDenied/serde error) 只丢弃, 不影响 parse 红线: - 本地/Docker 自测, 不碰 47.77.216.1 A站生产 - 绝不塞假数据 — 失败一律空数组 + exit≠0 - url 必须真链接 (正则校验) 否则那条 tweet 静默丢弃 Author-Agent: 通信demo马 Dispatched-By: 通信龙 (commhub task 58afee64-899b-468b-90e6-3b8fb379258d) Helpers: 通信SDK马 (grok-x-search demo 参考实现)
通信龙 HIGH 验收红线 + 凑齐 spec ack 后追加 3 块: 1. **prompt recency 强制**: buildPrompt 注入今天日期 + cutoffISO + "you MUST return real publish DATE; do NOT include older posts; do NOT pad with today's date if unsure — drop instead" 4 点 英文硬规则 (LLM 对英文规则比中文敏感) + 中文一句话兜底。 要求 LLM 给 media_url (pbs.twimg.com / video.twimg.com 直链)。 2. **verifyMediaUrl (新增)**: curl -I -L -m 8s HEAD, host 必须是 pbs.twimg.com/video.twimg.com/ton.twitter.com, HTTP 200 + content-type image/* 或 video/* 才入库, 否则 imageUrl 置 "" (保 tweet 丢图)。防 LLM 幻觉假图链 — A 站 全替代 vs 兜底 pilot 的关键防线。 3. **normalize date 窗口双保险**: cutoffISO <= date <= todayISO 才保留。**date 缺失/非 YYYY-MM-DD 整条 drop** (不再 fallback 今天 — 通信龙 v1 read 时抓到的 anti-pattern,会把老帖伪装成新帖)。 drops 计数器: badUrl/wrongUser/emptyText/badDate/outOfWindow/ mediaVerifyFail。 4. **新鲜度自查日志**: summarize() 算 freshest/oldest/ageDays/ 日期分布 byDate/media-verified ratio; verbose 模式打 per-tweet curl HTTP code + content-type; freshest > 2d 老 stderr WARN (grok web 索引滞后 surface 出来,A 站要 last-24h 时能早发现)。 v2 实测 (2026-06-07, 3 账号 since=72h max-per=3): - 6 条 tweet, 全在 [2026-06-03, 2026-06-06] 窗口 - freshest=2026-06-05 (1d ago), 满足 A 站 last-24h target - date 分布: 2026-06-04 × 5, 2026-06-05 × 1 - media_url: LLM 返 1 条 video.twimg.com/amplify_video/.../mp4, curl HEAD 验证 content-type=video/mp4 → 入库; 其余 5 条 LLM 诚实地不返 media_url (符合预期, 不编造) - exit 0 / 串行 ~3m30s - 失败路径 (nonexistent account): tweets=[] + exit 1 README 同步更新: 字段规则 v2 收紧 + 实测 + 6 个新 gotcha (WARN 新鲜度 / imageUrl 多空 / curl 校验失败 drop / date 缺失 drop / outOfWindow drop / 慢)。 Author-Agent: 通信demo马 Dispatched-By: 通信龙 (commhub task 4c5085d3 HIGH + 48bc3d35/62ea048b spec 凑齐 + 8d6204b6 v1 review ping)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 224969cade
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const cutoff = new Date(now.getTime() - hours * 3600 * 1000); | ||
| return { cutoffISO: cutoff.toISOString().slice(0, 10), hours }; |
There was a problem hiding this comment.
Preserve hour precision for relative --since windows
When the documented cron path uses an hourly window such as --since 6h or the default 24h, this truncates the cutoff to YYYY-MM-DD, and normalize() later compares only that date string. Any Grok result from the cutoff calendar day is therefore accepted even if it is outside the requested number of hours (for example, a 23:00 run with --since 6h keeps posts from 00:01 the same day), which undermines the fallback's recency guarantee for A站's last-24h feed. The local guard needs hour-level timestamps, or the option should be documented/implemented as a day-granularity window.
Useful? React with 👍 / 👎.
| >/var/log/ai-insight/grok-fetch.log 2>&1 ; \ | ||
| test -s /var/lib/ai-insight/grok-fallback.json && \ |
There was a problem hiding this comment.
Gate the cron import on successful non-empty fetches
In this cron example, ; test -s ... runs regardless of the fetcher's exit status, and the fetcher writes a non-empty JSON file even for the zero-tweet failure case ({"nextId":...,"tweets":[]}), so the translator can still run after the script exits 1. Worse, if the fetcher exits before writing (e.g. missing grok), a stale non-empty fallback file can be re-imported. The sample should chain on the fetch command's success and check .tweets | length > 0 before invoking translate_and_upsert.js.
Useful? React with 👍 / 👎.
| return { ok: true, raw }; | ||
| } | ||
|
|
||
| const URL_RE = /^https:\/\/x\.com\/([^/]+)\/status\/(\d+)/; |
There was a problem hiding this comment.
Anchor status URL validation after the numeric ID
Because this regex is only prefix-anchored, a malformed Grok URL like https://x.com/sama/status/123abc or .../status/123/not-a-status passes validation, gets tweetId set to 123, and is then emitted as the canonical url. That bypasses the script's advertised hard guard that the status id is digits-only and browser-openable, so malformed or fabricated URLs can enter the fallback feed instead of being dropped.
Useful? React with 👍 / 👎.
Author & Helpers
demos/grok-x-search/参考实现, prompt 模板灵感)58afee64初派 →4c5085d3HIGH 验收红线 →48bc3d35+62ea048bspec 凑齐 →8d6204b6v1 review ping →983f9c0ev2 PASS + 转 A 站 pilot)背景
A 站 ai-insight 用
twitterapi.io抓 10 个锚定账号最新推文翻译入库。2026-06-03 起 twitterapi.io 返 HTTP 402 quota exceeded, 管线断流。 Vincent 决定用 grok-build CLI 0.2.x headless 模式 (已实证 native X 搜索能力, 见demos/grok-x-search/) 当兜底 → 全替代 pilot。通信龙独立 headless 实测确认: 只要 prompt 强制 recency window, grok 能拿 last-3-days 真帖 + 真 media_url (pbs.twimg.com/video.twimg.com)。
交付
demos/grok-news-fetcher/fetch_news_via_grok.jsdemos/grok-news-fetcher/README.md输出 schema 跟 A 站
auto_update_news.js --fetch-only一字不差 (12 字段:account, name, user, tweetId, text, date, likes, views, url, source, category, imageUrl)。Prompt 工程非显然处 (请通信牛 focus review)
不能纯硬"严格返 JSON" — 实测 v0 prompt 直接要 JSON, grok 会 EndTurn 不调工具直接给空数组
[]。改成 "先用 web_search/web_fetch 找, 再返 JSON" 双 phase 引导, grok 自主会用web_search(allowed_domains x.com) 甚至x_keyword_search工具拿数据。(buildPrompt L138-178)today + cutoff 双注入 + 英文硬规则 — 中文 LLM 对英文规则更敏感, 4 点 recency 用英文写:
Today is YYYY-MM-DD+since cutoff+MUST return real publish date+do NOT pad with today's date if unsure — drop instead. 中文一句话兜底。媒体 host 白名单 —
verifyMediaUrl只接受pbs.twimg.com / video.twimg.com / ton.twitter.com三个 host (twitter 官方 CDN), 防 LLM 给 imgur/cdn-cgi/伪域名等假链。(MEDIA_HOST_RE L185)date 缺失 = 整条 drop, 不 fallback today — 通信龙 v1 review 抓到的 anti-pattern: v1 normalize 里 date 不可解析时
new Date()盖今天, 把没日期的旧帖伪装成新帖混进窗口。v2 改if (!/^\d{4}-\d{2}-\d{2}$/.test(dateRaw)) { drops.badDate++; continue; }. (normalize L220 附近)date 窗口双保险 — prompt 已明示 cutoff, 但 LLM 不一定守; normalize 再加
cutoffISO <= date <= todayISO本地校验, 超窗 silent drop +drops.outOfWindow++. 宁可少一条不能假装新帖。media_url curl HEAD 三层闸:
-I -L --max-redirs 3 -m 8)image/*或video/*imageUrl置""(保 tweet 丢图), 不整条 drop, 不假装 fallback URL. A 站 全替代 pilot 关键防线.新鲜度自查 —
summarize()算 freshest/oldest/ageDays/byDate; freshest > 2d 老 stderrWARN: grok web index may be lagging. A 站要 last-24h, 这个 WARN 帮他们早 surface "grok 这时段拿不到当天新帖" 不入库。实测 (2026-06-07, v2 commit 224969c)
video/mp4→ 入库失败路径:
A 站 pilot 状态
红线遵守
Merge 时机
关联
demos/grok-x-search/(native X 搜索能力探测)docs/tests/p-grok-native-xsearch-e2e/report.md