Hacker News: tomnicholas1

New comment by tomnicholas1 in "Goes-19 weather satellite enters Safe Hold mode"

tomnicholas1 — Thu, 16 Jul 2026 21:00:38 +0000

Yes! But if you want to share it with anyone else that would be great, since we're advocating for fairly radical changes within a big bureaucracy here, as I'm sure you will appreciate :)

New comment by tomnicholas1 in "Goes-19 weather satellite enters Safe Hold mode"

tomnicholas1 — Thu, 16 Jul 2026 15:57:34 +0000

Anyone interested in accessing GOES data at scale will find this interesting - I created a Zarr index over the 7 billion chunks of data in the GOES-16 archive.

https://www.earthmover.io/blog/virtual-zarr

New comment by tomnicholas1 in "Cloud-optimizing the GOES-16 satellite data archive without copying data"

tomnicholas1 — Thu, 04 Jun 2026 18:49:37 +0000

The last 2 years of my professional life have in some sense all been working towards this blog post.

It shows how you can make an enormous archive of public scientific data[1] much easier to use for everyone, without having to copy all the data, which would be prohibitively expensive at this scale.

I also made some sweet gifs and images of the Earth from space from the raw data so check those out!

[1] In this case satellite data, but this could work for data from many other fields - my background is original nuclear fusion plasma physics then oceanography.

Cloud-optimizing the GOES-16 satellite data archive without copying data

tomnicholas1 — Thu, 04 Jun 2026 18:49:37 +0000

Article URL: https://www.earthmover.io/blog/virtual-zarr/

Comments URL: https://news.ycombinator.com/item?id=48402970

Points: 10

# Comments: 1

New comment by tomnicholas1 in "Matadisco – Decentralized Data Discovery"

tomnicholas1 — Sat, 28 Mar 2026 16:03:45 +0000

Awesome to see this project here - it was partly inspired by my blog post (original is linked from the OP, but there's a slightly newer version on my personal site here[0]).

[0]: https://tom-nicholas.com/blog/2025/science-needs-a-social-ne...

Periodic Labs

tomnicholas1 — Fri, 06 Mar 2026 21:24:16 +0000

Article URL: https://periodic.com/

Comments URL: https://news.ycombinator.com/item?id=47281266

Points: 2

# Comments: 0

New comment by tomnicholas1 in "A distributed queue in a single JSON file on object storage"

tomnicholas1 — Tue, 24 Feb 2026 19:10:46 +0000

What you describe is very similar to how Icechunk[1] works. It works beautifully for transactional writes to "repos" containing PBs of scientific array data in object storage.

[1]: https://icechunk.io/en/latest/

New comment by tomnicholas1 in "Show HN: Streaming gigabyte medical images from S3 without downloading them"

tomnicholas1 — Sat, 17 Jan 2026 18:35:20 +0000

People have literally used Zarr for this - at one point Gemini used Zarr for checkpointing model weights. Not sure what the current fashion in that space is though.

It's definitely one of many fields that see convergent evolution towards something that just looks like Zarr. In fact you can use VirtualiZarr to parse HuggingFace's "SafeTensors" format [0].

[0]: https://github.com/zarr-developers/VirtualiZarr/pull/555

New comment by tomnicholas1 in "Show HN: Streaming gigabyte medical images from S3 without downloading them"

tomnicholas1 — Sat, 17 Jan 2026 17:53:44 +0000

IMO Zarr is that newer format. It abstracts over the features of all these other formats so neatly that it can literally subsume them.

I feel that we no longer really need TIFF etc. - for scientific use cases in the cloud Zarr is all that's needed going forwards. The other file formats become just archival blobs that either are converted to Zarr or pointed at by virtual Zarr stores.

New comment by tomnicholas1 in "Show HN: Streaming gigabyte medical images from S3 without downloading them"

tomnicholas1 — Sat, 17 Jan 2026 16:22:14 +0000

The generalized form of this range-request-based streaming approach looks something like my project VirtualiZarr [0].

Many of these scientific file formats (HDF5, netCDF, TIFF/COG, FITS, GRIB, JPEG and more) are essentially just contiguous multidimensional array(/"tensor") chunks embedded alongside metadata about what's in the chunks. Efficiently fetching these from object storage is just about efficiently fetching the metadata up front so you know where the chunks you want are [1].

The data model of Zarr [2] generalizes this pattern pretty well, so that when backed by Icechunk [3], you can store a "datacube" of "virtual chunk references" that point at chunks anywhere inside the original files on S3.

This allows you to stream data out as fast as the S3 network connection allows [4], and then you're free to pull that directly, or build tile servers on top of it [5].

In the Pangeo project and at Earthmover we do all this for Weather and Climate science data. But the underlying OSS stack is domain-agnostic, so works for all sorts of multidimensional array data, and VirtualiZarr has a plugin system for parsing different scientific file formats.

I would love to see if someone could create a virtual Zarr store pointing at this WSI data!

[0]: https://virtualizarr.readthedocs.io/en/stable/

[1]: https://earthmover.io/blog/fundamentals-what-is-cloud-optimi...

[2]: https://earthmover.io/blog/what-is-zarr

[3]: https://earthmover.io/blog/icechunk-1-0-production-grade-clo...

[4]: https://earthmover.io/blog/i-o-maxing-tensors-in-the-cloud

[5]: https://earthmover.io/blog/announcing-flux

New comment by tomnicholas1 in "Programmers and software developers lost the plot on naming their tools"

tomnicholas1 — Fri, 12 Dec 2025 15:39:50 +0000

God this article is 10000% better than the posted one. This is great:

> Names should not describe what you currently think the thing you’re naming is for. Imagine naming your newborn child "Doctor", or "SupportsMeInMyOldAge". Poor kid.

New comment by tomnicholas1 in "F3: Open-source data file format for the future [pdf]"

tomnicholas1 — Thu, 02 Oct 2025 14:37:08 +0000

Thank you for the explanation! But what a mess.

I would love to bring these benefits to the multidimensional array world, via integration with the Zarr/Icechunk formats somehow (which I work on). But this fragmentation of formats makes it very hard to know where to start.

New comment by tomnicholas1 in "F3: Open-source data file format for the future [pdf]"

tomnicholas1 — Thu, 02 Oct 2025 01:17:25 +0000

The pitch for this sounds very similar to the pitch for Vortex (i.e. obviating the need to create a new format every time a shift occurs in data processing and computing by providing a data organization structure and a general-purpose API to allow developers to add new encoding schemes easily).

But I'm not totally clear what the relationship between F3 and Vortex is. It says their prototype uses the encoding implementation in Vortex, but does not use the Vortex type system?

New comment by tomnicholas1 in "Progress toward fusion energy gain as measured against the Lawson criteria"

tomnicholas1 — Thu, 08 May 2025 18:19:10 +0000

The really depressing part is if you plot rate of new delays against real time elapsed, the projected finishing date is even further.

This is why much of the fusion research community feel disillusioned with ITER, and so are more interested in these smaller (and supposedly more "agile") machines with high-temperature superconductors instead.

New comment by tomnicholas1 in "Progress toward fusion energy gain as measured against the Lawson criteria"

tomnicholas1 — Thu, 08 May 2025 17:17:26 +0000

Presumably because everyone in MCF has been waiting for ITER for decades, and JET is being decommissioned after a last gasp. Every other tokamak is considerably smaller (or similar size like DIII-D or JT-60SA).

Much of the interesting tokamak engineering ideas were on small (so low-power) machines or just concepts using high-temperature superconducting magnets.

New comment by tomnicholas1 in "What Is Cloud-Optimized Scientific Data?"

tomnicholas1 — Thu, 17 Apr 2025 17:59:31 +0000

I wrote the article I wish I could have read back when I first heard of Zarr and cloud-native science back in 2018.

This explains how object storage and conventional filesystems are different, and the key properties that make Zarr work so well in cloud object storage.

What Is Cloud-Optimized Scientific Data?

tomnicholas1 — Thu, 17 Apr 2025 17:59:31 +0000

Article URL: https://earthmover.io/blog/fundamentals-what-is-cloud-optimized-scientific-data

Comments URL: https://news.ycombinator.com/item?id=43720227

Points: 3

# Comments: 1

New comment by tomnicholas1 in "What Is Entropy?"

tomnicholas1 — Mon, 14 Apr 2025 22:41:22 +0000

Isn't that more about enumerating the microstates? The Pauli exclusion principle just ends up forbidding some of the microstates (forbidding a significant fraction of them if you're in the low-temperature regime).

New comment by tomnicholas1 in "What Is Entropy?"

tomnicholas1 — Mon, 14 Apr 2025 21:56:25 +0000

Yes, that assumption is called the Ergodic Hypothesis, and generally justified in undergraduate statistical mechanics courses by proving and appealing to Liouville's theorem.

[1] https://en.wikipedia.org/wiki/Ergodic_hypothesis

New comment by tomnicholas1 in "Tensors vs. Tables: Why tabular tools trip over gridded data"

tomnicholas1 — Thu, 03 Apr 2025 18:07:26 +0000

The scientific community works primarily with array (or "tensor") data, using tools like numpy, xarray, and zarr. People familiar with modern relational database tools such as DuckDB and Parquet often ask why can't we just use those? This article explains why: it's massively inefficient to use tabular tools on array data, and demonstrates with a benchmark showing a 10x difference in query speed.