Hacker News: pronoiac

New comment by pronoiac in "1 Trillion Web Pages Archived"

pronoiac — Mon, 06 Oct 2025 13:43:35 +0000

I think SciOp is doing something in that area, with a catalog site and webseeds. https://sciop.net/

New comment by pronoiac in "1 Trillion Web Pages Archived"

pronoiac — Mon, 06 Oct 2025 13:31:15 +0000

The Archive Team - not part of the Internet Archive - worked on a distributed backup of a portion of the Internet Archive - https://wiki.archiveteam.org/index.php/INTERNETARCHIVE.BAK

It's been dormant / on hiatus for a few years now.

New comment by pronoiac in "Building the heap: racking 30 petabytes of hard drives for pretraining"

pronoiac — Wed, 01 Oct 2025 17:10:25 +0000

I wonder if they'll go with "toploaders" - like Backblaze Storage Pods - later. They have better density and faster setup, as they don't have to screw in every drive.

They got used drives. I wonder if they did any testing? I've gotten used drives that were DOA, which showed up in tests - SMART tests, short and long, then writing pseudorandom data to verify capacity.

New comment by pronoiac in "Apple Notes Will Gain Markdown Export at WWDC, and, I Have Thoughts"

pronoiac — Thu, 05 Jun 2025 14:37:51 +0000

There's a flamewar detector, which triggers when there are far more comments than upvotes.

New comment by pronoiac in "Find the Odd Disk"

pronoiac — Mon, 21 Apr 2025 05:36:28 +0000

Kern Type, perhaps? https://type.method.ac/

New comment by pronoiac in "Microsoft’s original source code"

pronoiac — Fri, 04 Apr 2025 21:26:13 +0000

Feel free to run EasyOCR against it and submit a PR

New comment by pronoiac in "Microsoft’s original source code"

pronoiac — Fri, 04 Apr 2025 07:26:07 +0000

I attempted OCR, and while it's not great, it's a start. I considered adding a reference to "software wants to be free!" or the Open Letter, but I'm winding down for the night. https://github.com/pronoiac/altair-basic-source-code

New comment by pronoiac in "Microsoft’s original source code"

pronoiac — Fri, 04 Apr 2025 07:22:13 +0000

I attempted OCR with OCRmyPDF / Tesseract. It's not great, but it's under 1% the size, at least. https://github.com/pronoiac/altair-basic-source-code

New comment by pronoiac in "Testing DVD-R and CD-R 25 years later: optical disks from Japan"

pronoiac — Wed, 02 Apr 2025 06:23:05 +0000

Checking diskprices.com - https://diskprices.com/?locale=us&condition=new,used&disk_ty... - there's a cheaper outlier for DVD-R, then it's 25GB BD-Rs for a bit.

LTO tape can be cheaper, but the cost of the drives has long been an obstacle to dabbling.

New comment by pronoiac in "Show HN: OCR Benchmark Focusing on Automation"

pronoiac — Sat, 15 Mar 2025 02:44:33 +0000

I've used ocrit, which uses those APIs. https://github.com/insidegui/ocrit

There are also:

* swiftocr - https://github.com/fny/swiftocr

* macos-vision-ocr - https://github.com/bytefer/macos-vision-ocr

New comment by pronoiac in "Fediverse Donut Club"

pronoiac — Fri, 14 Mar 2025 22:12:10 +0000

They asked for something like Bluesky starter packs on Mastodon, not Bluesky starter packs on Bluesky.

New comment by pronoiac in "Fediverse Donut Club"

pronoiac — Fri, 14 Mar 2025 16:32:07 +0000

I knew I'd seen something, but I just searched for for it; Fedidevs have something like that - https://fedidevs.com/starter-packs/

New comment by pronoiac in "Ask HN: Where are the good Markdown to PDF tools (that meet these requirements)?"

pronoiac — Sun, 02 Mar 2025 19:14:49 +0000

I think Pandoc and Calibre could work for you.

I've worked on PAIP, Paradigms of Artificial Intelligence Programming, and I might be able to help you a bit. It's around 1k pages long. I used Pandoc to generate an epub file, and then Calibre to turn that into a PDF file. I just tried using Pandoc to generate the PDF file directly, and it/LaTeX choked on some Unicode characters.

For internal ebook links, there's a Lua script. You'll have to keep anchors unique across the book for this:

* good: "chapter1#section1_1" and "chapter2#section2_1"

* bad: a "chapter1#section1" and a "chapter2#section1"

WIP: https://github.com/norvig/paip-lisp/pull/195

For line wrapping of code, there's CSS. I first used it over on "Writing an Operating System in 1,000 Lines"; here's the PR: https://github.com/nuta/operating-system-in-1000-lines/pull/...

New comment by pronoiac in "How to run GUI applications directly in containers"

pronoiac — Thu, 27 Feb 2025 18:24:23 +0000

I've run an X app from Docker, a Linux container on a macOS host. I was able to move the incantations to a Makefile: https://github.com/ryanfb/docker_scantailor

New comment by pronoiac in "These years in Common Lisp: 2023-2024 in review"

pronoiac — Sat, 22 Feb 2025 15:17:20 +0000

I've worked on PAIP, and I think the GitHub.com version - https://github.com/norvig/paip-lisp/ - gets more attention than the GitHub.io version linked here. The GitHub.io version automatically gets updates, I think, but I'm not verifying the Markdown works over there.

New comment by pronoiac in "Ask HN: What is the best method for turning a scanned book as a PDF into text?"

pronoiac — Sun, 16 Feb 2025 18:43:22 +0000

It's still in progress! It's looong - about a thousand pages. There's an ebook, but the printed book got more editing.

New comment by pronoiac in "Ask HN: What is the best method for turning a scanned book as a PDF into text?"

pronoiac — Sun, 16 Feb 2025 18:08:50 +0000

I made a high-quality scan of PAIP (Paradigms of Artificial Intelligence Programming), and worked on OCR'ing and incorporating that into an admittedly imperfect git repo of Markdown files. I used Scantailor to deskew and do other adjustments before applying Tesseract, via OCRmyPDF. I wrote notes for some of my process over at https://github.com/norvig/paip-lisp/releases/tag/v1.2 .

I'd also tried ocrit, which uses Apple's Vision framework for OCR, with some success - https://github.com/insidegui/ocrit

It's an ongoing, iterative process. I'll watch this thread with interest.

Some recent threads that might be helpful:

* https://news.ycombinator.com/item?id=42443022 - Show HN: Adventures in OCR

* https://news.ycombinator.com/item?id=43045801 - Benchmarking vision-language models on OCR in dynamic video environments - driscoll42 posted some stats from research

* https://news.ycombinator.com/item?id=43043671 - OCR4all

(Meaning, I have these browser tabs open, I haven't fully digested them yet)

New comment by pronoiac in "Douglas McIlroy responds to Unix spell article with new implementation details"

pronoiac — Sun, 09 Feb 2025 23:37:49 +0000

> The compression was trivial: store a suffix preceded by one byte that contained the length of the prefix that the word shared with its predecessor in dictionary order.

Oh, that looks familiar; the database for the locate command uses something similar - https://www.gnu.org/software/findutils/manual/html_node/find...

New comment by pronoiac in "The Taylorator – All Your Frequencies Are Belong to Us"

pronoiac — Tue, 28 Jan 2025 02:07:04 +0000

Covering all frequencies? No Blank Space?

New comment by pronoiac in "Migrating Away from Bcachefs"

pronoiac — Fri, 24 Jan 2025 05:17:55 +0000

> Maybe yours did much worse because you aren't splitting files into subdirectories but creating them all in one?

No, and also, I'd expect that to be awful. 1000 folders, each with 1000 folders, each with 1000 files.

Those Arxiv and Phoronix links are great!