<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: super_ar</title><link>https://news.ycombinator.com/user?id=super_ar</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Sun, 21 Jun 2026 19:43:02 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=super_ar" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by super_ar in "Show HN: 500k+ events/sec transformations for ClickHouse ingestion"]]></title><description><![CDATA[
<p>Good question. I wouldn’t say this replaces Flink in general. If you already run Flink and are comfortable with it, it’s a very powerful system.<p>Where we saw friction with Flink was mainly:
1.) Operational overhead (jobs, state backends, checkpointing)
2.) Generic sinks not being optimized for ClickHouse (batching, small inserts, etc.)<p>We focused on making scaling a property of the pipeline itself (just add replicas) and optimizing specifically for ClickHouse ingestion patterns.<p>So Flink is more general, this is more opinionated and focused on this specific use case.</p>
]]></description><pubDate>Wed, 08 Apr 2026 20:37:06 +0000</pubDate><link>https://news.ycombinator.com/item?id=47695926</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=47695926</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47695926</guid></item><item><title><![CDATA[Show HN: 500k+ events/sec transformations for ClickHouse ingestion]]></title><description><![CDATA[
<p>Hi HN! We are Ashish and Armend, founders of GlassFlow.<p>Over the last year, we worked with teams running high-throughput pipelines into self-hosted ClickHouse. Mostly for observability and real-time analytics.<p>A question that came repeatedly was:
What happens when throughput grows?<p>Usually, things work fine at 10k events/sec, but we started seeing backpressure and errors at >100k.<p>When the throughput per pipeline stops scaling, then adding more CPU/memory doesn’t help because often parts of the pipeline are not parallelized or are bottlenecked by state handling.<p>At this point, engineers usually scale by adding more pipeline instances.<p>That works but comes with some trade-offs:
- You have to split the workload (e.g., multiple pipelines reading from the same source)
- Transformation logic gets duplicated across pipelines
- Stateful logic becomes harder to manage and keep consistent
- Debugging and changes get more difficult because the data flow is fragmented<p>Another challenge arises when working with high-cardinality keys like user IDs, session IDs, or request IDs, and when you need to handle longer time windows (24h or more). The state grows quickly and many systems rely on in-memory state, which makes it expensive and harder to recover from failures.<p>We wanted to solve this problem and rebuild our approach at GlassFlow.<p>Instead of scaling by adding more pipelines, we scale within a single pipeline by using replicas. Each replica consumes, processes, and writes independently, and the workload is distributed across them.<p>In the benchmarks we’re sharing, this scales to 500k+ events/sec while still running stateful transformations and writing into ClickHouse.<p>A few things we think are interesting:
- Scaling is close to linear as you add replicas
- Works with stateful transformations (not just stateless ingestion)
- State is backed by a file-based KV store instead of relying purely on memory
- The ClickHouse sink is optimized for batching to avoid small inserts
- The product is built with Go<p>Full write-up + benchmarks:
<a href="https://www.glassflow.dev/blog/glassflow-now-scales-to-500k-events-per-sec" rel="nofollow">https://www.glassflow.dev/blog/glassflow-now-scales-to-500k-...</a><p>Repo:
<a href="https://github.com/glassflow/clickhouse-etl" rel="nofollow">https://github.com/glassflow/clickhouse-etl</a><p>Happy to answer questions about the design or trade-offs.</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47693407">https://news.ycombinator.com/item?id=47693407</a></p>
<p>Points: 13</p>
<p># Comments: 4</p>
]]></description><pubDate>Wed, 08 Apr 2026 17:26:26 +0000</pubDate><link>https://github.com/glassflow/clickhouse-etl</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=47693407</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47693407</guid></item><item><title><![CDATA[New comment by super_ar in "[dead]"]]></title><description><![CDATA[
<p>I am seeing this pattern a lot lately. Teams start with a simple flow:<p>logs/metrics → Vector → ClickHouse<p>Works well as long as they run simple transformations via Vector. When they start adding things like dedupe, longer time windows, more data volume or joins, things start to break. They actually start using Vector as a stream processing engine.<p>Very typical issue that I see:<p>Time window limits: By default vector handles windowing in-memory. So with a higher load, it becomes too heavy to run there.<p>Missing support: When running in prod env, I have seen teams under pressure because there is no support available (except for Datadog customers). But most people I know run it self-hosted.<p>Scaling hits ceiling: I keep hearing similar numbers: 250k to 300k rec/sec per instance. Even by adding more resources, things do not scale. The consequences are: backpressure, latency spikes, etc.<p>At that point, it is no longer a “log pipeline.” It is a streaming system. Just not treated like one.<p>I wrote a deeper breakdown of this here if anyone’s curious:<p><a href="https://www.glassflow.dev/blog/when-vector-becomes-your-streaming-engine" rel="nofollow">https://www.glassflow.dev/blog/when-vector-becomes-your-stre...</a><p>Curious how people here are handling this.<p>Are you still pushing more logic into Vector, or have you split it out elsewhere?</p>
]]></description><pubDate>Tue, 31 Mar 2026 15:59:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=47589344</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=47589344</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47589344</guid></item><item><title><![CDATA[New comment by super_ar in "Show HN: Free Logo API – logos for any company or domain"]]></title><description><![CDATA[
<p>Really cool!</p>
]]></description><pubDate>Tue, 09 Dec 2025 18:25:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=46208538</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=46208538</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46208538</guid></item><item><title><![CDATA[New comment by super_ar in "Show HN: DeChecker – Detect AI-generated text"]]></title><description><![CDATA[
<p>Looks cool! Do you have any idea who "good" it is at detecting AI-generated text?</p>
]]></description><pubDate>Tue, 09 Dec 2025 18:21:12 +0000</pubDate><link>https://news.ycombinator.com/item?id=46208469</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=46208469</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46208469</guid></item><item><title><![CDATA[New comment by super_ar in "Show HN: I got 50% of my traffic from ChatGPT instead of Google"]]></title><description><![CDATA[
<p>This is interesting. Just wondering about your traffic volume and how long you have been running lcoalpdf?<p>For us, it is more like 5% of the traffic from GEO, but we have been running the company for 2 years and have created a lot of handwritten content for devs.</p>
]]></description><pubDate>Tue, 09 Dec 2025 18:17:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=46208427</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=46208427</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46208427</guid></item><item><title><![CDATA[New comment by super_ar in "Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale"]]></title><description><![CDATA[
<p>Yes, same machine.</p>
]]></description><pubDate>Sun, 22 Jun 2025 19:21:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=44349543</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=44349543</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44349543</guid></item><item><title><![CDATA[New comment by super_ar in "Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale"]]></title><description><![CDATA[
<p>There is another test that we published on our docs page. You can check it out here:<p>Setup: <a href="https://docs.glassflow.dev/load-test/setup" rel="nofollow">https://docs.glassflow.dev/load-test/setup</a><p>Results: <a href="https://docs.glassflow.dev/load-test/results" rel="nofollow">https://docs.glassflow.dev/load-test/results</a></p>
]]></description><pubDate>Sun, 22 Jun 2025 18:39:51 +0000</pubDate><link>https://news.ycombinator.com/item?id=44349244</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=44349244</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44349244</guid></item><item><title><![CDATA[New comment by super_ar in "Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale"]]></title><description><![CDATA[
<p>Fair point. Thanks for calling it out! To clarify, we’re focused on a specific use case: Kafka to ClickHouse pipelines with exactly-once guarantees. Kafka can’t provide exactly-once out of the box when writing to external systems like ClickHouse. You could use something like Flink, but there’s no native Flink-to-ClickHouse connector and Flink requires certain ops effort from the teams.
Our goal was to show users a very easy-to-reproduce load test to validate the results. As a next step, we’re actively working on a Kubernetes-ready version that will scale horizontally and plan to share those higher-throughput results with the HN community soon.</p>
]]></description><pubDate>Sun, 22 Jun 2025 18:35:54 +0000</pubDate><link>https://news.ycombinator.com/item?id=44349205</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=44349205</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44349205</guid></item><item><title><![CDATA[New comment by super_ar in "Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale"]]></title><description><![CDATA[
<p>Totally fair point. For stable, known workloads, you can get really far with something lightweight on a single machine. The challenge comes when you need fault tolerance, scaling, and delivery guarantees without constantly jumping in to fix things. Often heard from data teams talking about data peaks that they cannot predict as easily. But yes, a lot of existing tools make you pay a high-efficiency cost for that. At GlassFlow we are trying to hit that sweet spot...efficient but still resilient.</p>
]]></description><pubDate>Sun, 22 Jun 2025 16:05:45 +0000</pubDate><link>https://news.ycombinator.com/item?id=44348022</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=44348022</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44348022</guid></item><item><title><![CDATA[New comment by super_ar in "Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale"]]></title><description><![CDATA[
<p>Hi HN, A few weeks ago, we shared GlassFlow: Open Source streaming ETL to dedup and join streams from Kafka for ClickHouse (<a href="https://news.ycombinator.com/item?id=43953722">https://news.ycombinator.com/item?id=43953722</a>).<p>One of the top questions we received was: “How well does it perform at high throughput?”<p>We ran a load test and would like to share some results with you.<p>Summary of the test:<p>- Tested on 20m records<p>- Kafka produced 55,000 records/sec<p>- Processing rate of GlassFlow (deduplication): 9,000+ records/sec<p>- Measured on a MacBook Pro (M3 Max)<p>- End-to-end latency: <0.12 ms per request<p>Here is the blog post with full test results and tried with different parameters (rps, # of publishers, etc.):
<a href="https://www.glassflow.dev/blog/load-test-glass-flow-for-click-house-real-time-deduplication-at-scale" rel="nofollow">https://www.glassflow.dev/blog/load-test-glass-flow-for-clic...</a><p>It was important to us to set up the testing in a way that everybody could reproduce. Here are the docs:
<a href="https://docs.glassflow.dev/load-test/setup" rel="nofollow">https://docs.glassflow.dev/load-test/setup</a><p>We would love to get feedback, especially from folks consuming high-throughput in ClickHouse.<p>Thanks for reading!<p>Ashish and Armend (founders)</p>
]]></description><pubDate>Thu, 19 Jun 2025 13:56:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=44318729</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=44318729</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44318729</guid></item><item><title><![CDATA[Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.glassflow.dev/blog/load-test-glass-flow-for-click-house-real-time-deduplication-at-scale">https://www.glassflow.dev/blog/load-test-glass-flow-for-click-house-real-time-deduplication-at-scale</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44318728">https://news.ycombinator.com/item?id=44318728</a></p>
<p>Points: 22</p>
<p># Comments: 10</p>
]]></description><pubDate>Thu, 19 Jun 2025 13:56:10 +0000</pubDate><link>https://www.glassflow.dev/blog/load-test-glass-flow-for-click-house-real-time-deduplication-at-scale</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=44318728</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44318728</guid></item><item><title><![CDATA[New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"]]></title><description><![CDATA[
<p>Great question! RMT can work well when eventual consistency is acceptable and real-time accuracy isn't critical. But in use cases where results need to be correct immediately (dashboards, alerts, monitoring, etc.), waiting on background merges doesn't work.<p>Here 2 more detailed examples:<p>Real-Time fraud detection in logistics:
Let's say you are streaming events from multiple sources (payments, GPS devices, user actions) for a dashboard that should trigger alerts when anomalies happen. Now you have duplicates (retries, partial system failure, etc.). Relying on RMT means incorrect counts until merges happen. This situation can lead to missed fraud, later interventions, etc.<p>Event collection from multi-systems like CRM + E-commerce + Tracking:
Similar user or transaction data can come from multiple systems (e.g., CRM, Shopify, internal event logs). The same action might appear in slightly different formats across streams, causing duplicates in Kafka. ClickHouse can store these, but it doesn't enforce primary keys, so you end up with misleading results until RMT resolves.</p>
]]></description><pubDate>Sun, 11 May 2025 20:28:15 +0000</pubDate><link>https://news.ycombinator.com/item?id=43956875</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=43956875</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43956875</guid></item><item><title><![CDATA[New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"]]></title><description><![CDATA[
<p>Thanks for asking those questions. Duplicates often come from how systems interact with Kafka, not from Kafka itself. For example, if a service retries sending a message after a timeout or if you collect similar data from multiple sources (like CRMs and web apps), you can end up with the same event multiple times. Kafka guarantees delivery at least once, so it doesn't remove duplicates.<p>ClickHouse doesn't enforce primary keys. It stores whatever you send. ReplacingMergeTree and FINAL are concepts on ClickHouse, but they are not optimal for real-time streams due to the background merging process that needs to be finished to ensure correct query results.<p>With GlassFlow, you clean the data streams before they hit ClickHouse, ensuring correct query results and less load for ClickHouse.<p>In your IoT case, a scenario I can imagine is batch replays (you might resend data already ingested). But if you're sure the data is clean and only sent once, you may not need this.</p>
]]></description><pubDate>Sun, 11 May 2025 18:42:49 +0000</pubDate><link>https://news.ycombinator.com/item?id=43955978</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=43955978</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43955978</guid></item><item><title><![CDATA[New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"]]></title><description><![CDATA[
<p>Great to hear that you are considering it for zenskar. We don't have a publicly available load test, but in internal checks it was able to handle
15k requests per second (locally on a MacBook Pro/M2 Docker). What is the load that you are expecting? Happy to connect.</p>
]]></description><pubDate>Sun, 11 May 2025 18:27:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=43955817</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=43955817</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43955817</guid></item><item><title><![CDATA[New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"]]></title><description><![CDATA[
<p>Good question! RMT does deduplication, but its dependency on background merges that you can't control can lead to incorrect results in queries until the merge is complete. We wanted something that cleans the duplicates in real time. GlassFlow moves deduplication upstream, before data hits ClickHouse. If you think of it from a pipeline perspective, we believe it is easier to understand, as it is a block before ClickHouse.</p>
]]></description><pubDate>Sun, 11 May 2025 18:17:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=43955747</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=43955747</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43955747</guid></item><item><title><![CDATA[New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"]]></title><description><![CDATA[
<p>It is a combination of both. We have a fantastic product designer colleague who takes care of the product, and a few friends who designed the website. I will forward your message to them. I am sure you've made their day. Thank you! :)</p>
]]></description><pubDate>Sun, 11 May 2025 17:43:16 +0000</pubDate><link>https://news.ycombinator.com/item?id=43955445</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=43955445</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43955445</guid></item><item><title><![CDATA[New comment by super_ar in "Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse"]]></title><description><![CDATA[
<p>Thanks, Sai! Great question. The deduplication works based on the user-defined key, not the entire row. You can specify which field (e.g. a primary key like event_id) to use as the deduplication key. Within a defined time window, GlassFlow guarantees that only the first event with a given key will be forwarded to ClickHouse. Subsequent duplicates are rejected. Our idea was to keep ClickHouse as clean as possible.</p>
]]></description><pubDate>Sun, 11 May 2025 17:06:33 +0000</pubDate><link>https://news.ycombinator.com/item?id=43955158</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=43955158</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43955158</guid></item><item><title><![CDATA[Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse]]></title><description><![CDATA[
<p>Hi HN! We are Ashish and Armend, founders of GlassFlow. We just launched our open-source streaming ETL that deduplicates and joins Kafka streams before ingesting them to ClickHouse <a href="https://github.com/glassflow/clickhouse-etl">https://github.com/glassflow/clickhouse-etl</a><p>Why we built this:
Dedup with batch data is straightforward. You load the data into a temporary table. Then, find only the latest versions of the record through hashes or keys and keep them. After that, move the clean data into your main table. But have you tried this with streaming data? 
Users of our prev product were running real-time analytics pipelines from Kafka to ClickHouse and noticed that the analyses were wrong due to duplicates. The source systems produced duplicates as they ingested similar user data from CRMs, shop systems and click streams.<p>We wanted to solve this issue for them with the existing ClickHouse options, but ClickHouse ReplacingMergeTree has an uncontrollable background merging process. This means the new data is in the system, but you never know when they’ll finish the merging, and until then, your queries return incorrect results.<p>We looked into using FINAL but haven't been happy with the speed for real-time workloads.<p>We tried Flink, but there is too much overhead to manage Java Flink jobs, and a self-built solution would have put us in a position to set up and maintain state storage, possibly a very large one (number of unique keys), to keep track of whether we have already encountered a record. And if your dedupe service fails, you need to rehydrate that state before processing new records. That would have been too much maintenance for us.<p>We decided to solve it by building a new product and are excited to share it with you.<p>The key difference is that the streams are deduplicated before ingesting to ClickHouse. So, ClickHouse always has clean data and less load, eliminating the risk of wrong results. We want more people to benefit from it and decided to open-source it (Apache-2.0).<p>Main components:<p>- Streaming deduplication: 
You define the deduplication key and a time window (up to 7 days), and it handles the checks in real time to avoid duplicates before hitting ClickHouse. The state store is built in.<p>- Temporal Stream Joins:
You can join two Kafka streams on the fly with a few config inputs. You set the join key, choose a time window (up to 7 days), and you're good.<p>- Built-in Kafka source connector:
There is no need to build custom consumers or manage polling logic. Just point it at your Kafka cluster, and it auto-subscribes to the topics you define. Payloads are parsed as JSON by default, so you get structured data immediately. As underlying tech, we decided on NATS to make it lightweight and low-latency.<p>- ClickHouse sink:
Data gets pushed into ClickHouse through a native connector optimized for performance. You can tweak batch sizes and flush intervals to match your throughput needs. It handles retries automatically, so you don't lose data on transient failures.<p>We'd love to hear your feedback and know if you solved it nicely with existing tools. Thanks for reading!</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=43953722">https://news.ycombinator.com/item?id=43953722</a></p>
<p>Points: 78</p>
<p># Comments: 32</p>
]]></description><pubDate>Sun, 11 May 2025 13:33:54 +0000</pubDate><link>https://github.com/glassflow/clickhouse-etl</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=43953722</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43953722</guid></item><item><title><![CDATA[ClickHouse Denormalization is not the answer to slow JOINs]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.glassflow.dev/blog/denormalization-clickhouse">https://www.glassflow.dev/blog/denormalization-clickhouse</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=43635569">https://news.ycombinator.com/item?id=43635569</a></p>
<p>Points: 18</p>
<p># Comments: 5</p>
]]></description><pubDate>Wed, 09 Apr 2025 18:26:11 +0000</pubDate><link>https://www.glassflow.dev/blog/denormalization-clickhouse</link><dc:creator>super_ar</dc:creator><comments>https://news.ycombinator.com/item?id=43635569</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43635569</guid></item></channel></rss>