Hacker News: szarnyasg

New comment by szarnyasg in "Distributed DuckDB Instance"

szarnyasg — Wed, 15 Apr 2026 13:44:13 +0000

DuckDB devrel here. You are right. This was in the FAQ but I also added it to the DuckLake documentation's main page at https://ducklake.select/docs/stable/

New comment by szarnyasg in "Distributed DuckDB Instance"

szarnyasg — Tue, 14 Apr 2026 19:56:33 +0000

DuckLake does not use B+ trees and it handles fragmentation with techniques like partial files and compaction upon checkpointing. The developers of DuckLake talks about this here: https://youtu.be/7Su0aVzbb-U?t=689

(Disclaimer: I work at DuckDB Labs)

New comment by szarnyasg in "Distributed DuckDB Instance"

szarnyasg — Tue, 14 Apr 2026 07:56:52 +0000

Hi, DuckDB DevRel here. To have concurrent read-write access to a database, you can use our DuckLake lakehouse format and coordinate concurrent access through a shared Postgres catalog. We released v1.0 yesterday: https://ducklake.select/2026/04/13/ducklake-10/

I updated your reference [0] with this information.

New comment by szarnyasg in "Grafeo – A fast, lean, embeddable graph database built in Rust"

szarnyasg — Sat, 21 Mar 2026 19:38:56 +0000

That's a difficult question and I would like to avoid giving a direct answer (because I co-lead a nonprofit benchmarking graph databases) but even knowing what you need for a graph database can be a tricky decision. See my FOSDEM 2025 talk, where I tried to make sense of the field:

https://archive.fosdem.org/2025/schedule/event/fosdem-2025-5...

New comment by szarnyasg in "Big data on the cheapest MacBook"

szarnyasg — Fri, 13 Mar 2026 07:58:53 +0000

That's a good point. I re-ran the benchmark on two instances:

- c8gd.4xlarge - this has a single 950 GB NVMe SSD.

- c5ad.4xlarge - this has 2 x 300 GB disks, which I put in a RAID 0 array. There are no c6ad.4xlarge instances, so this is the closes NVMe-enabled approximate to ClickBench's most popular choice, c6a.4xlarge.

I also added results from my local dev machine, a MacBook M1 Max with 64 GB RAM and 10 cores.

Here are the results:

  | machine        | cold_run_avg | cold_run_sum | hot_run_avg | hot_run_sum |
  | -------------- | -----------: | -----------: | ----------: | ----------: |
  | macbook m1 max |         0.48 |        20.68 |        0.43 |       18.60 |
  | macbook neo    |         1.39 |        59.73 |        1.26 |       54.27 |
  | c8gd.4xlarge   |         0.51 |        22.04 |        0.24 |       10.36 |
  | c5ad.4xlarge   |         1.29 |        54.14 |        0.55 |       22.91 |
  | c6a.4xlarge    |         3.37 |       145.08 |        1.11 |       47.86 |
  | c8g.metal-48xl |         3.95 |       169.67 |        0.10 |        4.35 |

On the cold run, the MacBook is on par with the c5ad.4xlarge. The c8gd.4xlarge is about ~2.5x faster on the cold run.

I know this is moving the goalpost, however, it's quite interesting that both of these cloud instances with instance-attached storage are still outperformed by the M1 Max (which is 4+ years old) on the cold run. And they would quite likely lose against the latest MacBook Pro with the M5 Pro/Max on both the cold and the hot runs. But that's an experiment for another day.

New comment by szarnyasg in "Big data on the cheapest MacBook"

szarnyasg — Thu, 12 Mar 2026 20:38:24 +0000

Indeed, it would have been interesting but I really wanted to get the blog post out on the launch day of the MacBook Neo and did not have the bandwidth to run additional cloud experiments.

I ran TPC-DS SF300 now on the c6a.4xlarge. It turns out that it's still quite limited by the EBS disk's IO: while 32 GB memory is much more than 8 GB, DuckDB needs to spill to disk a lot and this shows on the runtimes. Running all 99 queries took 37 minutes, so about half of the MacBook's 79 minutes.

> Command being timed: "duckdb tpcds-sf300.db -f bench.sql"

> Percent of CPU this job got: 250%

> Elapsed (wall clock) time (h:mm:ss or m:ss): 37:00.96

> Maximum resident set size (kbytes): 25559652

New comment by szarnyasg in "Big Data on the Cheapest MacBook"

szarnyasg — Thu, 12 Mar 2026 14:30:03 +0000

You're right! I pushed an updated TL;DR block.

DuckDB 1.4.3 LTS with Native Windows ARM64 Support

szarnyasg — Tue, 09 Dec 2025 13:48:37 +0000

Article URL: https://duckdb.org/2025/12/09/announcing-duckdb-143

Comments URL: https://news.ycombinator.com/item?id=46204891

Points: 2

# Comments: 0

DuckLake 0.3 with Iceberg Interoperability and Geometry Support

szarnyasg — Thu, 18 Sep 2025 16:02:53 +0000

Article URL: https://ducklake.select/2025/09/17/ducklake-03/

Comments URL: https://news.ycombinator.com/item?id=45291313

Points: 2

# Comments: 0

New comment by szarnyasg in "DuckLake is an integrated data lake and catalog format"

szarnyasg — Tue, 27 May 2025 19:02:46 +0000

Yes - updates on existing rows are supported.

(I work at DuckDB Labs.)

New comment by szarnyasg in "DuckLake is an integrated data lake and catalog format"

szarnyasg — Tue, 27 May 2025 15:03:47 +0000

Great!

> About the COPY statement, it means we can drop Parquet files ourselves in the blob storage ?

Dropping the Parquet files on the blob storage will not work – you have to COPY them through DuckLake so that the catalog databases is updated with the required catalog and metadata information.

New comment by szarnyasg in "DuckLake is an integrated data lake and catalog format"

szarnyasg — Tue, 27 May 2025 14:49:35 +0000

The YouTube video “Apache Iceberg: What It Is and Why Everyone’s Talking About It” by Tim Berglund explains data lakes really well in the opening minutes: https://www.youtube.com/watch?v=TsmhRZElPvM

New comment by szarnyasg in "DuckLake is an integrated data lake and catalog format"

szarnyasg — Tue, 27 May 2025 14:46:05 +0000

Yes, you can use standard SQL constructs such as INSERT statements and COPY to load data into DuckLake.

(diclaimer: I work at DuckDB Labs)

New comment by szarnyasg in "A lost decade chasing distributed architectures for data analytics?"

szarnyasg — Thu, 22 May 2025 13:43:08 +0000

AWS started offering local SSD storage up to 2 TB in 2012 (HI1 instance type) and in late 2013 this went up to 6.4 TB (I2 instance type). While these amounts don't cover all customers, plenty of data fits on these machines. But the software stack to analyze it efficiently was lacking, especially in the open-source space.

New comment by szarnyasg in "The DuckDB Local UI"

szarnyasg — Thu, 13 Mar 2025 10:04:21 +0000

Hi, DuckDB devrel here. DuckDB is an analytical SQL database in the form factor of SQLite (i.e., in-process). This quadrant summarizes its space in the landscape:

https://blobs.duckdb.org/slides/goto-amsterdam-2024-duckdb-g...

It works as a replacement / complementary component to dataframe libraries due to it's speed and (vertical) scalability. It's lightweight and dependency-free, so it also works as part of data processing pipelines.

New comment by szarnyasg in "The DuckDB Local UI"

szarnyasg — Wed, 12 Mar 2025 17:31:43 +0000

I'm a co-author of the blog post. I agree that the wording was confusing – apologies for the confusion. I added a note at the end:

> The repository does not contain the source code for the frontend, which is currently not available as open-source. Releasing it as open-source is under consideration.

New comment by szarnyasg in "Be Aware of the Makefile Effect"

szarnyasg — Sat, 11 Jan 2025 06:50:39 +0000

I have observed the Makefile effect many times for LaTeX documents. Most researchers I worked with had a LaTeX file full of macros that they have been carrying from project to project for years. These were often inherited from more senior researchers, and were hammered into heavily-modified forks of article templates used in their field or thesis templates used at their institution.

New comment by szarnyasg in "DuckDB is faster at counting the lines of a CSV file than wc"

szarnyasg — Thu, 05 Dec 2024 23:27:00 +0000

I am the author of the original post and I also wrote a followup blog post on it yesterday: https://szarnyasg.org/posts/duckdb-vs-coreutils/

Yes, if you break the file into parts with GNU Parallel, you can easily beat DuckDB as I show in the blog post.

That said, I maintain that it's surprising that DuckDB outperforms wc (and grep) on many common setups, e.g., on a MacBook. This is not something many databases can do, and the ones which can usually don't run on a laptop.

New comment by szarnyasg in "DuckDB over Pandas/Polars"

szarnyasg — Wed, 06 Nov 2024 15:30:20 +0000

Hi – DuckDB Labs devrel here. It's great that you find DuckDB useful!

On the setup side, I agree that local (instance-attached) disks should be preferred but does EBS incur an IO fee? It incurs a significant latency for sure but it doesn't have a per-operation pricing:

> I/O is included in the price of the volumes, so you pay only for each GB of storage you provision.

(https://aws.amazon.com/ebs/pricing/)

DuckDB in Python in the Browser with Pyodide, PyScript, and JupyterLite

szarnyasg — Fri, 04 Oct 2024 13:52:08 +0000

Article URL: https://duckdb.org/2024/10/02/pyodide.html

Comments URL: https://news.ycombinator.com/item?id=41741410

Points: 2

# Comments: 0