Hacker News: ccgreg

New comment by ccgreg in "Free full BGP feed. IPv4 and IPv6 (2020)"

ccgreg — Sun, 31 May 2026 05:30:09 +0000

Good timing, I'm about to release that dataset.

New comment by ccgreg in "Big tech's anti-labor playbook has come for Wikipedia"

ccgreg — Fri, 29 May 2026 01:08:31 +0000

Common Crawl is working hard to improve diversity in our crawl.

New comment by ccgreg in "GPT-5.5"

ccgreg — Sun, 26 Apr 2026 22:33:45 +0000

I don't know of anyone who uses Common Crawl as pre-training data without filtering it. We have an annotation system that lets people pick and choose which subsets they'd like to use.

New comment by ccgreg in "Ask HN: Scaling a targeted web crawler beyond 500M pages/day"

ccgreg — Sun, 26 Apr 2026 06:18:33 +0000

Common Crawl is a sample of the web, so it's not that directly helpful for someone wanting to make a product price dataset.

New comment by ccgreg in "Ask HN: Scaling a targeted web crawler beyond 500M pages/day"

ccgreg — Sun, 26 Apr 2026 06:14:14 +0000

I'm a life-long hacker, and my crawler crawls with consent.

New comment by ccgreg in "Ask HN: What funding models exist for a search engine?"

ccgreg — Thu, 16 Apr 2026 15:57:32 +0000

The largest index we had was 4 billion, which is tiny. Our crawl frontier was much larger.

New comment by ccgreg in "Ask HN: What funding models exist for a search engine?"

ccgreg — Thu, 16 Apr 2026 01:22:29 +0000

> and the data that I’ve experimented with from 2014 seemed high quality

That's because it's from the blekko search engine.

New comment by ccgreg in "Scientists invented a fake disease. AI told people it was real"

ccgreg — Fri, 10 Apr 2026 09:56:44 +0000

That's already been happening for more than a year now.

New comment by ccgreg in "A Change to Common Crawl Dataset Size Reporting"

ccgreg — Wed, 01 Apr 2026 07:41:49 +0000

Common Crawl is switching to reporting dataset sizes in nibbles. As an organisation dedicated to data preservation, we feel it would be remiss to allow this underrepresented unit to fall out of use. Our latest crawl now exceeds 689 tebibbles. Common Crawl Foundation

A Change to Common Crawl Dataset Size Reporting

ccgreg — Wed, 01 Apr 2026 07:41:48 +0000

Article URL: https://commoncrawl.org/blog/announcing-a-change-to-common-crawl-dataset-size-reporting

Comments URL: https://news.ycombinator.com/item?id=47598019

Points: 3

# Comments: 1

New comment by ccgreg in "21,864 Yugoslavian .yu domains"

ccgreg — Fri, 27 Mar 2026 18:32:08 +0000

The complete list hides in the web graph:

https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main...

and the specific file that's every host we've seen in the latest 3 crawls is:

https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main...

New comment by ccgreg in "90% of Claude-linked output going to GitHub repos w <2 stars"

ccgreg — Fri, 27 Mar 2026 02:00:12 +0000

> Common Crawl, with over one billion, nine hundred and seventy thousand web pages in their archive: 345TB.

Common Crawl is 300 billion webpages and 10 petabytes. I suppose your number is 1 of our 122 crawls.

New comment by ccgreg in "Meta's Omnilingual MT for 1,600 Languages"

ccgreg — Sat, 21 Mar 2026 20:21:13 +0000

Common Crawl has been running a low-resource language project for 1.5 years now -- it's a hard problem.

New comment by ccgreg in "Mac mini will be made at a new facility in Houston"

ccgreg — Tue, 24 Feb 2026 23:23:47 +0000

The guts on the inside changed several times during that timespan.

New comment by ccgreg in "News publishers limit Internet Archive access due to AI scraping concerns"

ccgreg — Sun, 15 Feb 2026 22:03:02 +0000

Well, yes, it is a bit distressing that ill behaved crawlers are causing a lot of damage -- and collateral damage, too, when well-behaved bots get blocked.

New comment by ccgreg in "News publishers limit Internet Archive access due to AI scraping concerns"

ccgreg — Sun, 15 Feb 2026 04:09:07 +0000

Please read our email reply. I have no idea if we received your request —- your HN username doesn’t match any request we have received.

New comment by ccgreg in "News publishers limit Internet Archive access due to AI scraping concerns"

ccgreg — Sun, 15 Feb 2026 03:38:14 +0000

Oh, and thanks for letting me know that I need to add our reply to Wikipedia.

New comment by ccgreg in "News publishers limit Internet Archive access due to AI scraping concerns"

ccgreg — Sun, 15 Feb 2026 02:40:55 +0000

Did you see our reply? Edit: by which I mean, we sent you an email that explains what we did and how to verify it. Did you not receive an email reply? If not, please contact us again.

Also, if your site has CC-BY-NC-SA markings, we have preserved them.

New comment by ccgreg in "News publishers limit Internet Archive access due to AI scraping concerns"

ccgreg — Sun, 15 Feb 2026 02:30:55 +0000

Did you see our reply? https://commoncrawl.org/blog/setting-the-record-straight-com...

Also, if your site has CC-BY-NC-SA markings, we have preserved them.

New comment by ccgreg in "News publishers limit Internet Archive access due to AI scraping concerns"

ccgreg — Sun, 15 Feb 2026 01:45:06 +0000

That 20% number is for a limited list of relatively large news websites. If you include the long tail of news, the % of blocking is much smaller.