Hacker News: tang8330

New comment by tang8330 in "Show HN: Artie – Real-time data replication to your warehouse, now self-serve"

tang8330 — Wed, 10 Jun 2026 17:34:07 +0000

Great question - there's no Debezium under the hood. Artie has its own Reader and Transfer layers, built from scratch.

TOAST columns: Artie has automatic detection built in. If a TOAST column hasn't changed, its value won't appear in the WAL. Artie detects this and skips the update for that column in the destination. This works without needing to set REPLICA IDENTITY FULL on your tables.

Schema drift: Artie never requires a schema registry. For relational sources like Postgres, Artie reads the source schema directly and syncs new columns immediately. For DDL changes, Artie uses lazy schema evaluation. On the next DML event for the table, it compares source vs. destination schema and applies any outstanding changes before writing the row.

Let me know if you have any other questions!

Show HN: Artie – Real-time data replication to your warehouse, now self-serve

tang8330 — Wed, 10 Jun 2026 05:27:31 +0000

Hey HN, cofounder of Artie here. We’ve built a real-time data replication tool that captures every row-level change in your source database and streams it to your warehouse in under 60 seconds.

The last time I posted here, people had to book a call with us in order to access Artie. Today, that’s no longer the case. You can now connect your source and destination and start streaming immediately.

I spent years of my career building large-scale data pipelines and experienced how difficult it was to get real-time data firsthand. I believed there must be a better way to stream data into our warehouse, which resulted in Artie being born. And now with AI agents, reducing data latency has become more and more crucial as agents need to make decisions off of fresh data.

When I first started building Artie, I quickly learned that the components meant to keep CDC running smoothly are very much bolted on with tons of edge cases. Unfortunately in practice, they were not built to work together. We ended up dealing with schema drift, backfill race conditions, Kafka offset commits, and TOAST columns. I’d love to know if others have hit these same issues while building in-house.

artie.com, would love feedback!

Comments URL: https://news.ycombinator.com/item?id=48471805

Points: 21

# Comments: 5

Show HN: Artie – Real-time data replication to your data warehouse, self-serve

tang8330 — Tue, 09 Jun 2026 17:47:17 +0000

Hey HN, cofounder of Artie here. I’ve been working on real-time database replication using CDC (Postgres/MongoDB into Snowflake, BigQuery, Redshift) with my wife for the last three years. Last time I posted here, people had to book a call with us to get access, but that’s no longer the case. You can connect your source and destination and start streaming immediately.

I encountered this problem firsthand as a heavy data warehouse user at prior jobs. Our warehouse data was always lagged and analytics were always stale. The most visceral version of this today: imagine an AI agent making decisions – on pricing, support routing, risk scoring – off a data warehouse that's 3-12 hours behind.

When we started, I thought the hard part was reading the WAL. The real problems:

Schema drift: CDC events carry row data but not column metadata, so when an engineer adds a column in prod, events with that column start arriving at the destination before you've run ALTER TABLE. In this case, you wouldn’t get an error – you would just silently drop data.

Backfill race conditions: the typical approach (snapshot first, then start CDC) means by the time your snapshot finishes on a large table, the stream has moved on. If you stitch them together wrong, you overwrite newer data with older snapshots.

Kafka offset commits: this sounds obvious but they’re difficult to execute. You can only commit after a successful merge into the destination, or you double-write on replay. Partial failures across a distributed system compound this quickly.

TOAST columns: Postgres omits unchanged TOAST columns (large text/JSON/bytea – think JSONB config fields, long descriptions, binary blobs) from WAL events entirely for storage optimization. A naive pipeline reads ‘missing’ as ‘set to null’ and silently wipes valid data, which can mean a customer's entire config blob gets wiped out because an unrelated column on the same row got updated. The fix is merge logic that treats absent columns as ‘don't touch’ rather than ‘set to null,’ which breaks most off-the-shelf UPSERT patterns.

Curious whether others have hit these same walls building in-house, and would love feedback.

Comments URL: https://news.ycombinator.com/item?id=48464686

Points: 3

# Comments: 0

New comment by tang8330 in "Deep Dive into Postgres Write-Ahead Logs"

tang8330 — Wed, 25 Sep 2024 01:38:05 +0000

Thanks for the flag. We'll get that fixed!

Running Redshift at Scale

tang8330 — Wed, 15 Nov 2023 18:48:10 +0000

Article URL: https://blog.artie.so/best-practices-on-running-redshift-at-scale

Comments URL: https://news.ycombinator.com/item?id=38280664

Points: 32

# Comments: 7

New comment by tang8330 in "Preventing replication slot overflow on Postgres DB (AWS RDS)"

tang8330 — Mon, 11 Sep 2023 15:03:15 +0000

Artie OSS: https://github.com/artie-labs/transfer Artie Cloud: https://www.artie.so/

Preventing replication slot overflow on Postgres DB (AWS RDS)

tang8330 — Mon, 11 Sep 2023 15:03:15 +0000

Article URL: https://blog.artie.so/preventing-wal-growth-on-postgres-db-running-on-aws-rds

Comments URL: https://news.ycombinator.com/item?id=37468327

Points: 4

# Comments: 1

New comment by tang8330 in "Launch HN: Artie (YC S23) – Real time data replication to data warehouses"

tang8330 — Tue, 25 Jul 2023 21:11:34 +0000

On a single unbounded (CPU + mem) Debezium running on a VM extracting Postgres, I was able to clock in about 7-10m/hr. You could increase the # of tasks, but then it'll hinder your DB perf. Also, this is on your primary DB.

We found it far more efficient and less risky to do CDC streaming and snapshotting w/o read lock in parallel to two different topics. Once snapshot is done and drained, we then move to drain the CDC topic.

New comment by tang8330 in "Launch HN: Artie (YC S23) – Real time data replication to data warehouses"

tang8330 — Tue, 25 Jul 2023 15:11:48 +0000

Hi! Yes, our next 2 connectors are going to be S3 and DynamoDB.

GitHub Issue: https://github.com/artie-labs/transfer/issues/157

If you'd like to be another design partner for us on this, do reach out. I'm at robin@artie.so.

New comment by tang8330 in "Launch HN: Artie (YC S23) – Real time data replication to data warehouses"

tang8330 — Tue, 25 Jul 2023 15:10:51 +0000

Definitely. For staging temporary tables to merge for our Redshfit destination, we're uploading it to S3.

We will be creating a S3 destination with TSV, Avro, Parquet format support verily soon.

New comment by tang8330 in "Launch HN: Artie (YC S23) – Real time data replication to data warehouses"

tang8330 — Tue, 25 Jul 2023 15:09:08 +0000

For Postgres, we have our own custom snapshotter that is capable of doing parallel snapshots against your read replica and not incur WAL growth. More details here: https://news.ycombinator.com/item?id=36855338

For MySQL and MongoDB, we rely on Debezium to perform the initial snapshots.

New comment by tang8330 in "Launch HN: Artie (YC S23) – Real time data replication to data warehouses"

tang8330 — Tue, 25 Jul 2023 15:06:02 +0000

Thank you so much for the support!

New comment by tang8330 in "Launch HN: Artie (YC S23) – Real time data replication to data warehouses"

tang8330 — Mon, 24 Jul 2023 22:49:55 +0000

Thanks for the comment!

Your comment regarding DDL is interesting.

Today, this is what happens:

1/ Column doesn't exist in the destination, we'll create it based on our typing inference from the data type (important: not the data value).

2/ Certain tools will handle automatic column data type conversion if a change like this was detected at the source. We do not do this. We will simply hard fail and cause head-of-line blocking reasons being: this is anti-pattern and should be rare, in which case - it's okay to cause an err and require manual intervention for this breaking change.

3/ If the column has been dropped from the source, you as the end user can decide whether this column should be also dropped in the destination, or not. The default is not to drop it.

^ We hear more customers explicitly don't want columns to be dropped because it could cause downstream errors, such as other views / tables not compiling due to referencing a non-existent column.

We haven't heard much from folks that don't even want columns to be added. If there is a need, we can definitely add that as a config option to provide maximum configurability.

> Finally, the biggest issue with CDC always ends up being the seed loads, recoveries and the incremental snapshot strategies.

Yep totally. On the recovery bit, this is exactly why we are leveraging Kafka. If there are any particular issues, we simply don't commit the offset and cause head-of-line blocking.

On the incremental snapshot and recoveries bit, we primarily leverage Debezium's DDD-3 high watermark strategy [1] for MySQL and MongoDB. Postgres has a different issue in that replication slots can grow really fast, esp on AWS! [2]. We ended up writing our own custom snapshotter for Postgres that is Debezium compatible to onboard customers that have a massive dataset and cannot afford to have a read lock on their WAL.

[1] https://github.com/debezium/debezium-design-documents/blob/m... [2] https://www.morling.dev/blog/insatiable-postgres-replication...

New comment by tang8330 in "Launch HN: Artie (YC S23) – Real time data replication to data warehouses"

tang8330 — Mon, 24 Jul 2023 19:15:29 +0000

Hm, perhaps I wasn't being clear, apologies for that.

What I am proposing above is ways to provide a view to teams that do not want real-time data while keeping your underlying dataset in real-time.

New comment by tang8330 in "Launch HN: Artie (YC S23) – Real time data replication to data warehouses"

tang8330 — Mon, 24 Jul 2023 18:48:26 +0000

Thanks for your feedback!

> Something McKinley doesn't address is that it's quite advantageous if the values in your data warehouse don't change intra-day because this lets business users reach consensus. Whereas if Bob runs a report and gets $X, and Alice runs the same report 5 minutes later and gets $Y, that creates confusion (much more than you would expect). I recall a particular system I built that refreshed every 6 hours (limited by upstream), that eventually Marketing asked me to dial back to every 24 hours because they couldn't stand things changing in the middle of the day.

If they want to see a consistent view of the report, you could bound this.

1/ SELECT * FROM FOO WHERE DATE_TRUNC('day', updated_at) < DATE_TRUNC('day', DATEADD(day, -1, CURRENT_DATE()));

If your dataset doesn't contain kv, you can turn on include `artie_updated_at` which will provide an additional column with the updated_at field to support incremental ingestion.

2/ If you had stateful data, you could also explore creating a Snowflake task and leveraging the time travel f(x) to create a "snapshot" if your workload depended on it.

3/ Also, if you _did_ want this to be more lagged, you can actually increase the flushIntervalSeconds [1] to 6h, 24h, whichever time interval you fancy. You as the customer should have maximum flexibility when it comes to when to flush to DWH.

4/ You can also choose to refresh the analytical report on Looker / Mode to be daily. [2]

> Now of course I see you're targeting more real-time use cases like fraud detection. That's great! But why you would run a fraud detection process out of your data warehouse, which likely doesn't even have a production-grade uptime SLA? Run it out of your production database, that's what it's for!

You can certainly do this in production db (that was our original hypothesis as well!), however, after talking to more companies...it has become more obvious to us that folks that are running fraud algos actually want to join this across various data sets. Further, by using a DWH - it provides a nice visualization layer on top.

Of course, you could go with something even more bespoke by utilizing real-time DBs such as Materialize / Rockset / RisingWave. Just comes with trade offs such as increase in architectural complexity.

There are also plenty of additional use cases this can unlock given that DWH is a platform, any post-DWH application can benefit from less lag, such as reverse ETLs.

[1] https://docs.artie.so/running-transfer/options

[2] https://mode.com/help/articles/report-scheduling-and-sharing...

New comment by tang8330 in "Launch HN: Artie (YC S23) – Real time data replication to data warehouses"

tang8330 — Mon, 24 Jul 2023 18:40:44 +0000

For now, we're super focused on databases as sources. We really want to do this well before we move on to other data sources such as APIs.

New comment by tang8330 in "Launch HN: Artie (YC S23) – Real time data replication to data warehouses"

tang8330 — Mon, 24 Jul 2023 17:14:51 +0000

Thank you!!

New comment by tang8330 in "Launch HN: Artie (YC S23) – Real time data replication to data warehouses"

tang8330 — Mon, 24 Jul 2023 17:07:55 +0000

Definitely. Do you expect the resulting data to also be in binary / bytes format in your DWH?

I ask because there's a workaround by setting `binary.handling.mode` to a STRING type [1].

Transfer will then automatically pick this up and write this as a B64 string to the DWH.

[1] https://debezium.io/documentation/reference/stable/connector...

New comment by tang8330 in "Launch HN: Artie (YC S23) – Real time data replication to data warehouses"

tang8330 — Mon, 24 Jul 2023 15:53:44 +0000

That's fascinating! Thanks for providing more color, support for geometric shapes coming! https://github.com/artie-labs/transfer/issues/155

New comment by tang8330 in "Launch HN: Artie (YC S23) – Real time data replication to data warehouses"

tang8330 — Mon, 24 Jul 2023 15:28:53 +0000

Eventual consistency FTW!