<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: dangerlego5</title><link>https://news.ycombinator.com/user?id=dangerlego5</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Sun, 21 Jun 2026 10:08:26 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=dangerlego5" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by dangerlego5 in "ML research datasets from ArXiv and Semantic Scholar (JSONL, quality-scored)"]]></title><description><![CDATA[
<p>I kept rebuilding the same arXiv scraper at the start of every ML project. After the third time I wrote a dedup pipeline, I automated the whole thing.<p>The interesting part is that the pipeline is shared; if two people subscribe to 
the same topic, they share one crawl and one deduplicated record pool. Happy to talk through the pgvector dedup approach if anyone's curious.</p>
]]></description><pubDate>Tue, 16 Jun 2026 09:31:59 +0000</pubDate><link>https://news.ycombinator.com/item?id=48552729</link><dc:creator>dangerlego5</dc:creator><comments>https://news.ycombinator.com/item?id=48552729</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48552729</guid></item><item><title><![CDATA[ML research datasets from ArXiv and Semantic Scholar (JSONL, quality-scored)]]></title><description><![CDATA[
<p>Article URL: <a href="https://huggingface.co/fineset-io">https://huggingface.co/fineset-io</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=48552726">https://news.ycombinator.com/item?id=48552726</a></p>
<p>Points: 3</p>
<p># Comments: 1</p>
]]></description><pubDate>Tue, 16 Jun 2026 09:31:18 +0000</pubDate><link>https://huggingface.co/fineset-io</link><dc:creator>dangerlego5</dc:creator><comments>https://news.ycombinator.com/item?id=48552726</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48552726</guid></item><item><title><![CDATA[New comment by dangerlego5 in "Claude Fable is relentlessly proactive"]]></title><description><![CDATA[
<p>The visual regression point is interesting. In my experience, the models that do best at "overlapping text/bad layout" catches are the ones being fed actual screenshots rather than DOM snapshots. If Fable is doing screenshot-based diffs natively, that would explain an improvement there, but I haven't verified it.</p>
]]></description><pubDate>Sat, 13 Jun 2026 11:46:02 +0000</pubDate><link>https://news.ycombinator.com/item?id=48516298</link><dc:creator>dangerlego5</dc:creator><comments>https://news.ycombinator.com/item?id=48516298</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48516298</guid></item></channel></rss>