Hacker News: platypii

New comment by platypii in "A visual explainer of how to scroll through billions of rows in the browser"

platypii — Thu, 12 Feb 2026 20:49:21 +0000

Sylvain Lesage’s cool interactive explainer on visualizing extreme row counts—think billions—inside the browser. His technical deep dive explains how the open-source library HighTable works around scrollbar limits by:

- Lazy loading - Virtual scrolling (allows millions of rows) - "Infinite Pixel Technique" (allows billions of rows)

Hyperparam sponsored Sylvain’s work as part of our broader effort to invest in open-source infrastructure and get ahead of the data-scale problems that are emerging with LLMs. With a regular table, you can view thousands of rows, but the browser breaks pretty quickly. We created HighTable with virtual scroll so you can see millions of rows, but that still wasn’t enough for massive unstructured datasets. What Sylvain has built virtualizes the virtual scroll so you can literally view billions of rows—all inside the browser. His write-up goes deep into the mechanics of building a ridiculously large-scale table component in react.

A visual explainer of how to scroll through billions of rows in the browser

platypii — Thu, 12 Feb 2026 20:49:20 +0000

Article URL: https://rednegra.net/blog/20260212-virtual-scroll/

Comments URL: https://news.ycombinator.com/item?id=46994944

Points: 3

# Comments: 1

New comment by platypii in "Ask HN: Where are you keeping your LLM logs?"

platypii — Fri, 09 Jan 2026 19:44:05 +0000

We're willing to spend money, but I've had the "datadog billing problem" before where it starts reasonable and then grows to a non-trivial percent of saas budget, and then theres a scramble to refactor. Trying to get ahead of that as the LLM logs are MUCH larger that my APM logs.

Ask HN: Where are you keeping your LLM logs?

platypii — Fri, 09 Jan 2026 18:36:35 +0000

LLM logs are crushing my application logging system. We recently launched AI features on our app and went from ~100mb/month of normal website logs to 3gb/month of llm conversation logs and growing. Our existing logging system was overwhelmed (queries timing out, etc), and costs started increasing. We’re considering how to re-architect our llm logs specifically so we can handle more users plus the increasing token use from things like reasoning models, tool calling, and multi-agent systems. I’m not selling any solutions here, genuinely curious what others are doing. Do you store them alongside APM logs? Dedicated LLM logging service? Build it yourself with open source tools?

Comments URL: https://news.ycombinator.com/item?id=46557332

Points: 1

# Comments: 4

Show HN: Squirreling: a browser-native SQL engine

platypii — Tue, 30 Dec 2025 17:03:33 +0000

I made a small (~9 KB), open source SQL engine in JavaScript built for interactive data exploration. Squirreling is unique in that it’s built entirely with modern async JavaScript in mind and enables new kinds of interactivity by prioritizing streaming, late materialization, and async user-defined functions. No other database engine can do this in the browser.

More technical details in the post. Feedback welcome.

Comments URL: https://news.ycombinator.com/item?id=46435321

Points: 2

# Comments: 0

Best way to annotate large parquet LLM logs without full rewrites?

platypii — Mon, 22 Dec 2025 19:52:22 +0000

I asked this on the Apache mailing list but haven’t found a good solution yet. Wondering if anyone has some ideas for how to engineer this?

Here’s my problem: I have gigabytes of LLM conversation logs in parquet in S3. I want to add per-row annotations (llm-as-a-judge scores), ideally without touching the original text data.

So for a given dataset, I want to add a new column. This seemed like a perfect use case for Iceberg. Iceberg does let you evolve the table schema, including adding a column. BUT you can only add a column with a default value. If I want to fill in that column with annotations, ICEBERG MAKES ME REWRITE EVERY ROW. So despite being based on parquet, a column-oriented format, I need to re-write the entire source text data (gigabytes of data) just to add ~1mb of annotations. This feels wildly inefficient.

I considered just storing the column in its own table and then joining them. This does work but the joins are annoying to work with, and I suspect query engines do not optimize well a "join on row_number" operation.

I've been exploring using little-known features of parquet like the file_path field to store column data in external files. But literally zero parquet clients support this.

I'm running out of ideas for how to work with this data efficiently. It's bad enough that I am considering building my own table format if I can’t find a solution. Anyone have suggestions?

Comments URL: https://news.ycombinator.com/item?id=46358172

Points: 2

# Comments: 2

Ask HN: Local tools for working with LLM datasets?

platypii — Wed, 17 Dec 2025 21:21:44 +0000

I’ve been doing data science for years, and am very familiar with jupyter notebooks and more recently been using duckdb a lot. But now I have this huge pile of output tokens from my 4090s, and it feels characteristically different from data I’ve worked with in the past. Notebooks and duckdb on the CLI don’t feel like they’re built for working with huge volumes of text data like my training set and llm output traces.

What have you found work well for this? I’m trying to fine-tune on a text dataset and be able to inspect the output from eval runs. I would prefer local and open source tools to a paid service.

Comments URL: https://news.ycombinator.com/item?id=46305665

Points: 2

# Comments: 0

New comment by platypii in "What UI do you use on top of data engineering tools to look at data?"

platypii — Wed, 10 Dec 2025 18:35:49 +0000

Makes sense. I'm not currently in snowflake because I'm mostly working with local parquet files. Would prefer not to have to pay for snowflake just to explore my data. I'm interested in better data UIs though so I might need to check it out.

What UI do you use on top of data engineering tools to look at data?

platypii — Wed, 10 Dec 2025 18:10:11 +0000

Tools like DuckDB Wasm and data engineering platforms like Iceberg leverage Parquet’s built-in indexing to very efficiently query files over the network. But as I’ve been building data tools myself, the stack gets complicated fast, especially once you try to visualize or explore the data instead of just querying it. I’m intrigued by some of the modern tricks people are using to do more data engineering client-side.

With OPFS + Parquet + Wasm, the browser already has everything it needs to handle multi-GB LLM datasets client-side.

Is the world of data UIs evolving? Are there new data tools and best practices beyond notebooks and DuckDB?

Comments URL: https://news.ycombinator.com/item?id=46221195

Points: 1

# Comments: 3

New comment by platypii in "Show HN: We built an AI tool for working with massive LLM chat log datasets"

platypii — Wed, 19 Nov 2025 17:09:13 +0000

I started Hyperparam one year ago because I knew that the world of data was changing, and existing tools like Python and Jupyter Notebooks were not built for the scale of LLM data. The weights of LLMs may be tensors, but the input and output of LLMs are massive piles of text.

No human has the patience to sift through all that text, so we need better tools to help us understand and analyze it. That's why I built Hyperparam to be the first tool specifically designed for working with LLM data at scale. No one else seemed to be solving this problem.

Show HN: We built an AI tool for working with massive LLM chat log datasets

platypii — Wed, 19 Nov 2025 17:02:53 +0000

There’s an important problem with AI that nobody’s talking about. AI’s entire lifecycle is tons of data in for training, and an even larger amount of text data out. Traditional tools can’t handle the sheer volume of text, leaving teams overwhelmed and unable to make their data work for them.

Today we’re launching Hyperparam, a browser-native app for exploring and transforming multi-gigabyte datasets in real time. It combines a fast UI that can stream huge unstructured datasets with an army of AI agents that can score, label, filter, and categorize them. Now you can actually make sense of AI-scale data instead of drowning in it.

Example: Using the chat, ask Hyperparam’s AI agent to score every conversation in a 100K-row dataset for sycophancy, filter out the worst responses, adjust prompts, regenerate, and export your dataset V2. It all runs in one browser tab with no waiting and no lag.

It’s free while it’s in beta if you want to try it on your own data.

Comments URL: https://news.ycombinator.com/item?id=45981930

Points: 16

# Comments: 1

New comment by platypii in "Lessons from Hyperparam's year of open source data transformation"

platypii — Thu, 13 Nov 2025 19:30:07 +0000

This is a Q&A I did on what I learned from a year of open source data transformation. Most of all, it reinforced my belief that browser-native tools aren’t “toys” that don’t work for real systems. When Hugging Face integrated my libraries, it confirmed that the browser can handle serious data work, and maybe there's an opportunity for more browser-based data tools.

Lessons from Hyperparam's year of open source data transformation

platypii — Thu, 13 Nov 2025 19:30:07 +0000

Article URL: https://blog.hyperparam.app/lessons-from-open-source-data-transformation/

Comments URL: https://news.ycombinator.com/item?id=45919418

Points: 1

# Comments: 1

New comment by platypii in "Ask HN: How far can we push the browser for large-scale data parsing?"

platypii — Thu, 06 Nov 2025 17:16:58 +0000

As with anything, there are engineering tradeoffs.

What I've found is that moving data processing toward the browser has been for one, a refreshing developer experience because I don't need to build a pair of backend+frontend. From a user experience point of view, I think you can build MORE interactive data applications by pushing it toward the frontend.

Ask HN: How far can we push the browser for large-scale data parsing?

platypii — Thu, 06 Nov 2025 16:42:54 +0000

How far can we push the browser as a data engine — not just for visualizations, but for curating and querying large datasets? Do we need traditional backend architectures?

I wanted to see what happens when we treat the browser like part of the data stack, using pure JavaScript to load, slice, and explore datasets interactively. That experiment led to a small set of open-source tools — Hyparquet and HighTable. They’re designed to test the limits of browser-native data processing to see where the browser stops being a thin client and starts acting like a real data engine.

Curious what others think about the future of browser-first data tools:

- Where do you see the practical limits for client-side data processing? - What would make browser-based architectures a viable alternative to traditional data stacks?

Comments URL: https://news.ycombinator.com/item?id=45837156

Points: 1

# Comments: 2

New comment by platypii in "From GPT-4 to GPT-5: Measuring progress through MedHELM [pdf]"

platypii — Fri, 22 Aug 2025 04:08:51 +0000

Why not? We are trying to evaluate AI's capabilities. It's OBVIOUS that we should compare it to our only prior example of intelligence -- humans. Saying we shouldn't compare or anthropomorphize machine is a ridiculous hill to die on.

New comment by platypii in "The Quest for Instant Data"

platypii — Thu, 24 Jul 2025 16:09:51 +0000

This is the story of how I spent a year making the world's fastest Parquet loader in JavaScript. The goal:

- Make a faster, more interactive viewer for AI datasets (which are mostly parquet format)

- Simplify the stack by doing everything from the browser (no backend)

TLDR: My open-source library Hyparquet can load data in 155ms, which would take 3466ms in duckdb-wasm for the same file.

The Quest for Instant Data

platypii — Thu, 24 Jul 2025 16:03:32 +0000

Article URL: https://blog.hyperparam.app/2025/07/24/quest-for-instant-data/

Comments URL: https://news.ycombinator.com/item?id=44672363

Points: 16

# Comments: 1

New comment by platypii in "Show HN: Hyperparam: OSS Tools for Exploring Datasets Locally in the Browser"

platypii — Thu, 01 May 2025 15:51:11 +0000

I don’t have benchmarks specifically against duckdb. I’m sure native C++ will run faster than JavaScript.

But whats important is that with Hyperparam you can do it in the browser, where the bottleneck will always be network-bound not cpu-bound.

New comment by platypii in "Show HN: Hyperparam: OSS Tools for Exploring Datasets Locally in the Browser"

platypii — Thu, 01 May 2025 15:26:58 +0000

Funny you say that, because I built these tools because I wanted to build something very much like what you're describing!

I was trying to look at, filter, and transform large AI datasets, and I was frustrated with how bad the existing tool was for working with datasets with huge amounts of text (web scrapes, github dumps, reasoning tokens, agent chat logs). Jupyter notebook is woefully bad at helping you to look at your data.

So I wanted to build better browser tools for working with AI datasets. But to do that I had to build these tools (there was no working parquet implementation in JS when I started).

Anyway I'm still working on building an app for data processing using LLM chat assistant to help a single user curate entire datasets singlehandedly. But for now I'm releasing these components to the community as open source. And having them "do a single task each" was very much intentional. Thanks for the comment!