<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: ccgreg</title><link>https://news.ycombinator.com/user?id=ccgreg</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Mon, 01 Jun 2026 18:12:21 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=ccgreg" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by ccgreg in "Free full BGP feed. IPv4 and IPv6 (2020)"]]></title><description><![CDATA[
<p>Good timing, I'm about to release that dataset.</p>
]]></description><pubDate>Sun, 31 May 2026 05:30:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=48343285</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=48343285</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48343285</guid></item><item><title><![CDATA[New comment by ccgreg in "Big tech's anti-labor playbook has come for Wikipedia"]]></title><description><![CDATA[
<p>Common Crawl is working hard to improve diversity in our crawl.</p>
]]></description><pubDate>Fri, 29 May 2026 01:08:31 +0000</pubDate><link>https://news.ycombinator.com/item?id=48317725</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=48317725</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48317725</guid></item><item><title><![CDATA[New comment by ccgreg in "GPT-5.5"]]></title><description><![CDATA[
<p>I don't know of anyone who uses Common Crawl as pre-training data without filtering it. We have an annotation system that lets people pick and choose which subsets they'd like to use.</p>
]]></description><pubDate>Sun, 26 Apr 2026 22:33:45 +0000</pubDate><link>https://news.ycombinator.com/item?id=47915461</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47915461</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47915461</guid></item><item><title><![CDATA[New comment by ccgreg in "Ask HN: Scaling a targeted web crawler beyond 500M pages/day"]]></title><description><![CDATA[
<p>Common Crawl is a sample of the web, so it's not that directly helpful for someone wanting to make a product price dataset.</p>
]]></description><pubDate>Sun, 26 Apr 2026 06:18:33 +0000</pubDate><link>https://news.ycombinator.com/item?id=47907853</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47907853</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47907853</guid></item><item><title><![CDATA[New comment by ccgreg in "Ask HN: Scaling a targeted web crawler beyond 500M pages/day"]]></title><description><![CDATA[
<p>I'm a life-long hacker, and my crawler crawls with consent.</p>
]]></description><pubDate>Sun, 26 Apr 2026 06:14:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=47907826</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47907826</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47907826</guid></item><item><title><![CDATA[New comment by ccgreg in "Ask HN: What funding models exist for a search engine?"]]></title><description><![CDATA[
<p>The largest index we had was 4 billion, which is tiny. Our crawl frontier was much larger.</p>
]]></description><pubDate>Thu, 16 Apr 2026 15:57:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=47795277</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47795277</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47795277</guid></item><item><title><![CDATA[New comment by ccgreg in "Ask HN: What funding models exist for a search engine?"]]></title><description><![CDATA[
<p>> and the data that I’ve experimented with from 2014 seemed high quality<p>That's because it's from the blekko search engine.</p>
]]></description><pubDate>Thu, 16 Apr 2026 01:22:29 +0000</pubDate><link>https://news.ycombinator.com/item?id=47787571</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47787571</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47787571</guid></item><item><title><![CDATA[New comment by ccgreg in "Scientists invented a fake disease. AI told people it was real"]]></title><description><![CDATA[
<p>That's already been happening for more than a year now.</p>
]]></description><pubDate>Fri, 10 Apr 2026 09:56:44 +0000</pubDate><link>https://news.ycombinator.com/item?id=47715766</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47715766</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47715766</guid></item><item><title><![CDATA[New comment by ccgreg in "A Change to Common Crawl Dataset Size Reporting"]]></title><description><![CDATA[
<p>Common Crawl is switching to reporting dataset sizes in nibbles. As an organisation dedicated to data preservation, we feel it would be remiss to allow this underrepresented unit to fall out of use. Our latest crawl now exceeds 689 tebibbles.
Common Crawl Foundation</p>
]]></description><pubDate>Wed, 01 Apr 2026 07:41:49 +0000</pubDate><link>https://news.ycombinator.com/item?id=47598020</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47598020</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47598020</guid></item><item><title><![CDATA[A Change to Common Crawl Dataset Size Reporting]]></title><description><![CDATA[
<p>Article URL: <a href="https://commoncrawl.org/blog/announcing-a-change-to-common-crawl-dataset-size-reporting">https://commoncrawl.org/blog/announcing-a-change-to-common-crawl-dataset-size-reporting</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47598019">https://news.ycombinator.com/item?id=47598019</a></p>
<p>Points: 3</p>
<p># Comments: 1</p>
]]></description><pubDate>Wed, 01 Apr 2026 07:41:48 +0000</pubDate><link>https://commoncrawl.org/blog/announcing-a-change-to-common-crawl-dataset-size-reporting</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47598019</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47598019</guid></item><item><title><![CDATA[New comment by ccgreg in "21,864 Yugoslavian .yu domains"]]></title><description><![CDATA[
<p>The complete list hides in the web graph:<p><a href="https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2026-jan-feb-mar/index.html" rel="nofollow">https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main...</a><p>and the specific file that's every host we've seen in the latest 3 crawls is:<p><a href="https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2026-jan-feb-mar/host/cc-main-2026-jan-feb-mar-host-ranks.txt.gz" rel="nofollow">https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main...</a></p>
]]></description><pubDate>Fri, 27 Mar 2026 18:32:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=47546499</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47546499</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47546499</guid></item><item><title><![CDATA[New comment by ccgreg in "90% of Claude-linked output going to GitHub repos w <2 stars"]]></title><description><![CDATA[
<p>> Common Crawl, with over one billion, nine hundred and seventy thousand web pages in their archive: 345TB.<p>Common Crawl is 300 billion webpages and 10 petabytes. I suppose your number is 1 of our 122 crawls.</p>
]]></description><pubDate>Fri, 27 Mar 2026 02:00:12 +0000</pubDate><link>https://news.ycombinator.com/item?id=47538289</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47538289</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47538289</guid></item><item><title><![CDATA[New comment by ccgreg in "Meta's Omnilingual MT for 1,600 Languages"]]></title><description><![CDATA[
<p>Common Crawl has been running a low-resource language project for 1.5 years now -- it's a hard problem.</p>
]]></description><pubDate>Sat, 21 Mar 2026 20:21:13 +0000</pubDate><link>https://news.ycombinator.com/item?id=47470885</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47470885</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47470885</guid></item><item><title><![CDATA[New comment by ccgreg in "Mac mini will be made at a new facility in Houston"]]></title><description><![CDATA[
<p>The guts on the inside changed several times during that timespan.</p>
]]></description><pubDate>Tue, 24 Feb 2026 23:23:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=47144962</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47144962</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47144962</guid></item><item><title><![CDATA[New comment by ccgreg in "News publishers limit Internet Archive access due to AI scraping concerns"]]></title><description><![CDATA[
<p>Well, yes, it is a bit distressing that ill behaved crawlers are causing a lot of damage -- and collateral damage, too, when well-behaved bots get blocked.</p>
]]></description><pubDate>Sun, 15 Feb 2026 22:03:02 +0000</pubDate><link>https://news.ycombinator.com/item?id=47028106</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47028106</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47028106</guid></item><item><title><![CDATA[New comment by ccgreg in "News publishers limit Internet Archive access due to AI scraping concerns"]]></title><description><![CDATA[
<p>Please read our email reply. I have no idea if we received your request —- your HN username doesn’t match any request we have received.</p>
]]></description><pubDate>Sun, 15 Feb 2026 04:09:07 +0000</pubDate><link>https://news.ycombinator.com/item?id=47020973</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47020973</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47020973</guid></item><item><title><![CDATA[New comment by ccgreg in "News publishers limit Internet Archive access due to AI scraping concerns"]]></title><description><![CDATA[
<p>Oh, and thanks for letting me know that I need to add our reply to Wikipedia.</p>
]]></description><pubDate>Sun, 15 Feb 2026 03:38:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=47020815</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47020815</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47020815</guid></item><item><title><![CDATA[New comment by ccgreg in "News publishers limit Internet Archive access due to AI scraping concerns"]]></title><description><![CDATA[
<p>Did you see our reply? Edit: by which I mean, we sent you an email that explains what we did and how to verify it. Did you not receive an email reply? If not, please contact us again.<p>Also, if your site has CC-BY-NC-SA markings, we have preserved them.</p>
]]></description><pubDate>Sun, 15 Feb 2026 02:40:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=47020555</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47020555</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47020555</guid></item><item><title><![CDATA[New comment by ccgreg in "News publishers limit Internet Archive access due to AI scraping concerns"]]></title><description><![CDATA[
<p>Did you see our reply? <a href="https://commoncrawl.org/blog/setting-the-record-straight-common-crawls-commitment-to-transparency-fair-use-and-the-public-good" rel="nofollow">https://commoncrawl.org/blog/setting-the-record-straight-com...</a><p>Also, if your site has CC-BY-NC-SA markings, we have preserved them.</p>
]]></description><pubDate>Sun, 15 Feb 2026 02:30:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=47020506</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47020506</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47020506</guid></item><item><title><![CDATA[New comment by ccgreg in "News publishers limit Internet Archive access due to AI scraping concerns"]]></title><description><![CDATA[
<p>That 20% number is for a limited list of relatively large news websites. If you include the long tail of news, the % of blocking is much smaller.</p>
]]></description><pubDate>Sun, 15 Feb 2026 01:45:06 +0000</pubDate><link>https://news.ycombinator.com/item?id=47020323</link><dc:creator>ccgreg</dc:creator><comments>https://news.ycombinator.com/item?id=47020323</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47020323</guid></item></channel></rss>