Hacker News: dangerlego5

New comment by dangerlego5 in "ML research datasets from ArXiv and Semantic Scholar (JSONL, quality-scored)"

dangerlego5 — Tue, 16 Jun 2026 09:31:59 +0000

I kept rebuilding the same arXiv scraper at the start of every ML project. After the third time I wrote a dedup pipeline, I automated the whole thing.

The interesting part is that the pipeline is shared; if two people subscribe to the same topic, they share one crawl and one deduplicated record pool. Happy to talk through the pgvector dedup approach if anyone's curious.

ML research datasets from ArXiv and Semantic Scholar (JSONL, quality-scored)

dangerlego5 — Tue, 16 Jun 2026 09:31:18 +0000

Article URL: https://huggingface.co/fineset-io

Comments URL: https://news.ycombinator.com/item?id=48552726

Points: 3

# Comments: 1

New comment by dangerlego5 in "Claude Fable is relentlessly proactive"

dangerlego5 — Sat, 13 Jun 2026 11:46:02 +0000

The visual regression point is interesting. In my experience, the models that do best at "overlapping text/bad layout" catches are the ones being fed actual screenshots rather than DOM snapshots. If Fable is doing screenshot-based diffs natively, that would explain an improvement there, but I haven't verified it.