Hacker News: super_ar

New comment by super_ar in "Show HN: 500k+ events/sec transformations for ClickHouse ingestion"

super_ar — Wed, 08 Apr 2026 20:37:06 +0000

Good question. I wouldn’t say this replaces Flink in general. If you already run Flink and are comfortable with it, it’s a very powerful system.

Where we saw friction with Flink was mainly: 1.) Operational overhead (jobs, state backends, checkpointing) 2.) Generic sinks not being optimized for ClickHouse (batching, small inserts, etc.)

We focused on making scaling a property of the pipeline itself (just add replicas) and optimizing specifically for ClickHouse ingestion patterns.

So Flink is more general, this is more opinionated and focused on this specific use case.

Show HN: 500k+ events/sec transformations for ClickHouse ingestion

super_ar — Wed, 08 Apr 2026 17:26:26 +0000

Hi HN! We are Ashish and Armend, founders of GlassFlow.

Over the last year, we worked with teams running high-throughput pipelines into self-hosted ClickHouse. Mostly for observability and real-time analytics.

A question that came repeatedly was: What happens when throughput grows?

Usually, things work fine at 10k events/sec, but we started seeing backpressure and errors at >100k.

When the throughput per pipeline stops scaling, then adding more CPU/memory doesn’t help because often parts of the pipeline are not parallelized or are bottlenecked by state handling.

At this point, engineers usually scale by adding more pipeline instances.

That works but comes with some trade-offs: - You have to split the workload (e.g., multiple pipelines reading from the same source) - Transformation logic gets duplicated across pipelines - Stateful logic becomes harder to manage and keep consistent - Debugging and changes get more difficult because the data flow is fragmented

Another challenge arises when working with high-cardinality keys like user IDs, session IDs, or request IDs, and when you need to handle longer time windows (24h or more). The state grows quickly and many systems rely on in-memory state, which makes it expensive and harder to recover from failures.

We wanted to solve this problem and rebuild our approach at GlassFlow.

Instead of scaling by adding more pipelines, we scale within a single pipeline by using replicas. Each replica consumes, processes, and writes independently, and the workload is distributed across them.

In the benchmarks we’re sharing, this scales to 500k+ events/sec while still running stateful transformations and writing into ClickHouse.

A few things we think are interesting: - Scaling is close to linear as you add replicas - Works with stateful transformations (not just stateless ingestion) - State is backed by a file-based KV store instead of relying purely on memory - The ClickHouse sink is optimized for batching to avoid small inserts - The product is built with Go

Full write-up + benchmarks: https://www.glassflow.dev/blog/glassflow-now-scales-to-500k-...

Repo: https://github.com/glassflow/clickhouse-etl

Happy to answer questions about the design or trade-offs.

Comments URL: https://news.ycombinator.com/item?id=47693407

Points: 13

# Comments: 4

New comment by super_ar in "[dead]"

super_ar — Tue, 31 Mar 2026 15:59:08 +0000

I am seeing this pattern a lot lately. Teams start with a simple flow:

logs/metrics → Vector → ClickHouse

Works well as long as they run simple transformations via Vector. When they start adding things like dedupe, longer time windows, more data volume or joins, things start to break. They actually start using Vector as a stream processing engine.

Very typical issue that I see:

Time window limits: By default vector handles windowing in-memory. So with a higher load, it becomes too heavy to run there.

Missing support: When running in prod env, I have seen teams under pressure because there is no support available (except for Datadog customers). But most people I know run it self-hosted.

Scaling hits ceiling: I keep hearing similar numbers: 250k to 300k rec/sec per instance. Even by adding more resources, things do not scale. The consequences are: backpressure, latency spikes, etc.

At that point, it is no longer a “log pipeline.” It is a streaming system. Just not treated like one.

I wrote a deeper breakdown of this here if anyone’s curious:

https://www.glassflow.dev/blog/when-vector-becomes-your-stre...

Curious how people here are handling this.

Are you still pushing more logic into Vector, or have you split it out elsewhere?

New comment by super_ar in "Show HN: Free Logo API – logos for any company or domain"

super_ar — Tue, 09 Dec 2025 18:25:56 +0000

Really cool!

New comment by super_ar in "Show HN: DeChecker – Detect AI-generated text"

super_ar — Tue, 09 Dec 2025 18:21:12 +0000

Looks cool! Do you have any idea who "good" it is at detecting AI-generated text?

New comment by super_ar in "Show HN: I got 50% of my traffic from ChatGPT instead of Google"

super_ar — Tue, 09 Dec 2025 18:17:50 +0000

This is interesting. Just wondering about your traffic volume and how long you have been running lcoalpdf?

For us, it is more like 5% of the traffic from GEO, but we have been running the company for 2 years and have created a lot of handwritten content for devs.

New comment by super_ar in "Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale"

super_ar — Sun, 22 Jun 2025 19:21:14 +0000

Yes, same machine.

New comment by super_ar in "Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale"

super_ar — Sun, 22 Jun 2025 18:39:51 +0000

There is another test that we published on our docs page. You can check it out here:

Setup: https://docs.glassflow.dev/load-test/setup

Results: https://docs.glassflow.dev/load-test/results

New comment by super_ar in "Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale"

super_ar — Sun, 22 Jun 2025 18:35:54 +0000

Fair point. Thanks for calling it out! To clarify, we’re focused on a specific use case: Kafka to ClickHouse pipelines with exactly-once guarantees. Kafka can’t provide exactly-once out of the box when writing to external systems like ClickHouse. You could use something like Flink, but there’s no native Flink-to-ClickHouse connector and Flink requires certain ops effort from the teams. Our goal was to show users a very easy-to-reproduce load test to validate the results. As a next step, we’re actively working on a Kubernetes-ready version that will scale horizontally and plan to share those higher-throughput results with the HN community soon.

New comment by super_ar in "Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale"

super_ar — Sun, 22 Jun 2025 16:05:45 +0000

Totally fair point. For stable, known workloads, you can get really far with something lightweight on a single machine. The challenge comes when you need fault tolerance, scaling, and delivery guarantees without constantly jumping in to fix things. Often heard from data teams talking about data peaks that they cannot predict as easily. But yes, a lot of existing tools make you pay a high-efficiency cost for that. At GlassFlow we are trying to hit that sweet spot...efficient but still resilient.

New comment by super_ar in "Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale"

super_ar — Thu, 19 Jun 2025 13:56:10 +0000

Hi HN, A few weeks ago, we shared GlassFlow: Open Source streaming ETL to dedup and join streams from Kafka for ClickHouse (https://news.ycombinator.com/item?id=43953722).

One of the top questions we received was: “How well does it perform at high throughput?”

We ran a load test and would like to share some results with you.

Summary of the test:

- Tested on 20m records

- Kafka produced 55,000 records/sec

- Processing rate of GlassFlow (deduplication): 9,000+ records/sec

- Measured on a MacBook Pro (M3 Max)

- End-to-end latency: <0.12 ms per request

Here is the blog post with full test results and tried with different parameters (rps, # of publishers, etc.): https://www.glassflow.dev/blog/load-test-glass-flow-for-clic...

It was important to us to set up the testing in a way that everybody could reproduce. Here are the docs: https://docs.glassflow.dev/load-test/setup

We would love to get feedback, especially from folks consuming high-throughput in ClickHouse.

Thanks for reading!

Ashish and Armend (founders)

Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale

super_ar — Thu, 19 Jun 2025 13:56:10 +0000

Article URL: https://www.glassflow.dev/blog/load-test-glass-flow-for-click-house-real-time-deduplication-at-scale

Comments URL: https://news.ycombinator.com/item?id=44318728

Points: 22

# Comments: 10

New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"

super_ar — Sun, 11 May 2025 20:28:15 +0000

Great question! RMT can work well when eventual consistency is acceptable and real-time accuracy isn't critical. But in use cases where results need to be correct immediately (dashboards, alerts, monitoring, etc.), waiting on background merges doesn't work.

Here 2 more detailed examples:

Real-Time fraud detection in logistics: Let's say you are streaming events from multiple sources (payments, GPS devices, user actions) for a dashboard that should trigger alerts when anomalies happen. Now you have duplicates (retries, partial system failure, etc.). Relying on RMT means incorrect counts until merges happen. This situation can lead to missed fraud, later interventions, etc.

Event collection from multi-systems like CRM + E-commerce + Tracking: Similar user or transaction data can come from multiple systems (e.g., CRM, Shopify, internal event logs). The same action might appear in slightly different formats across streams, causing duplicates in Kafka. ClickHouse can store these, but it doesn't enforce primary keys, so you end up with misleading results until RMT resolves.

New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"

super_ar — Sun, 11 May 2025 18:42:49 +0000

Thanks for asking those questions. Duplicates often come from how systems interact with Kafka, not from Kafka itself. For example, if a service retries sending a message after a timeout or if you collect similar data from multiple sources (like CRMs and web apps), you can end up with the same event multiple times. Kafka guarantees delivery at least once, so it doesn't remove duplicates.

ClickHouse doesn't enforce primary keys. It stores whatever you send. ReplacingMergeTree and FINAL are concepts on ClickHouse, but they are not optimal for real-time streams due to the background merging process that needs to be finished to ensure correct query results.

With GlassFlow, you clean the data streams before they hit ClickHouse, ensuring correct query results and less load for ClickHouse.

In your IoT case, a scenario I can imagine is batch replays (you might resend data already ingested). But if you're sure the data is clean and only sent once, you may not need this.

New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"

super_ar — Sun, 11 May 2025 18:27:09 +0000

Great to hear that you are considering it for zenskar. We don't have a publicly available load test, but in internal checks it was able to handle 15k requests per second (locally on a MacBook Pro/M2 Docker). What is the load that you are expecting? Happy to connect.

New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"

super_ar — Sun, 11 May 2025 18:17:47 +0000

Good question! RMT does deduplication, but its dependency on background merges that you can't control can lead to incorrect results in queries until the merge is complete. We wanted something that cleans the duplicates in real time. GlassFlow moves deduplication upstream, before data hits ClickHouse. If you think of it from a pipeline perspective, we believe it is easier to understand, as it is a block before ClickHouse.

New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"

super_ar — Sun, 11 May 2025 17:43:16 +0000

It is a combination of both. We have a fantastic product designer colleague who takes care of the product, and a few friends who designed the website. I will forward your message to them. I am sure you've made their day. Thank you! :)

New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"

super_ar — Sun, 11 May 2025 17:06:33 +0000

Thanks, Sai! Great question. The deduplication works based on the user-defined key, not the entire row. You can specify which field (e.g. a primary key like event_id) to use as the deduplication key. Within a defined time window, GlassFlow guarantees that only the first event with a given key will be forwarded to ClickHouse. Subsequent duplicates are rejected. Our idea was to keep ClickHouse as clean as possible.

Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse

super_ar — Sun, 11 May 2025 13:33:54 +0000

Hi HN! We are Ashish and Armend, founders of GlassFlow. We just launched our open-source streaming ETL that deduplicates and joins Kafka streams before ingesting them to ClickHouse https://github.com/glassflow/clickhouse-etl

Why we built this: Dedup with batch data is straightforward. You load the data into a temporary table. Then, find only the latest versions of the record through hashes or keys and keep them. After that, move the clean data into your main table. But have you tried this with streaming data? Users of our prev product were running real-time analytics pipelines from Kafka to ClickHouse and noticed that the analyses were wrong due to duplicates. The source systems produced duplicates as they ingested similar user data from CRMs, shop systems and click streams.

We wanted to solve this issue for them with the existing ClickHouse options, but ClickHouse ReplacingMergeTree has an uncontrollable background merging process. This means the new data is in the system, but you never know when they’ll finish the merging, and until then, your queries return incorrect results.

We looked into using FINAL but haven't been happy with the speed for real-time workloads.

We tried Flink, but there is too much overhead to manage Java Flink jobs, and a self-built solution would have put us in a position to set up and maintain state storage, possibly a very large one (number of unique keys), to keep track of whether we have already encountered a record. And if your dedupe service fails, you need to rehydrate that state before processing new records. That would have been too much maintenance for us.

We decided to solve it by building a new product and are excited to share it with you.

The key difference is that the streams are deduplicated before ingesting to ClickHouse. So, ClickHouse always has clean data and less load, eliminating the risk of wrong results. We want more people to benefit from it and decided to open-source it (Apache-2.0).

Main components:

- Streaming deduplication: You define the deduplication key and a time window (up to 7 days), and it handles the checks in real time to avoid duplicates before hitting ClickHouse. The state store is built in.

- Temporal Stream Joins: You can join two Kafka streams on the fly with a few config inputs. You set the join key, choose a time window (up to 7 days), and you're good.

- Built-in Kafka source connector: There is no need to build custom consumers or manage polling logic. Just point it at your Kafka cluster, and it auto-subscribes to the topics you define. Payloads are parsed as JSON by default, so you get structured data immediately. As underlying tech, we decided on NATS to make it lightweight and low-latency.

- ClickHouse sink: Data gets pushed into ClickHouse through a native connector optimized for performance. You can tweak batch sizes and flush intervals to match your throughput needs. It handles retries automatically, so you don't lose data on transient failures.

We'd love to hear your feedback and know if you solved it nicely with existing tools. Thanks for reading!

Hacker News: super_ar

New comment by super_ar in "Show HN: 500k+ events/sec transformations for ClickHouse ingestion"

Show HN: 500k+ events/sec transformations for ClickHouse ingestion

New comment by super_ar in "[dead]"

New comment by super_ar in "Show HN: Free Logo API – logos for any company or domain"

New comment by super_ar in "Show HN: DeChecker – Detect AI-generated text"

New comment by super_ar in "Show HN: I got 50% of my traffic from ChatGPT instead of Google"

New comment by super_ar in "Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale"

New comment by super_ar in "Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale"

New comment by super_ar in "Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale"

New comment by super_ar in "Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale"

New comment by super_ar in "Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale"

Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale

New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"

New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"

New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"

New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"

New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"

New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"

Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse

ClickHouse Denormalization is not the answer to slow JOINs