Hacker News: edrenova

New comment by edrenova in "PostgreSQL Anonymizer"

edrenova — Fri, 17 Jan 2025 17:23:40 +0000

Just to jump in here -> We support RDS + more and you can self-host, Neosync.

(I'm one of the co-founders)

New comment by edrenova in "Show HN: Greenmask 0.2 – Database anonymization tool"

edrenova — Thu, 17 Oct 2024 05:18:31 +0000

Thanks for the shout-out! Co-founder of Neosync here - love seeing more tools in this space and pushing the envelope further. Good luck!

New comment by edrenova in "In the land of LLMs, can we do better mock data generation?"

edrenova — Wed, 02 Oct 2024 18:46:20 +0000

Yup agreed. We built an orchestration engine into Neosync for that reason. Can handles all of the reading/writing from DBs for you. Also can generate data from scratch (using LLMs or not).

New comment by edrenova in "In the land of LLMs, can we do better mock data generation?"

edrenova — Wed, 02 Oct 2024 17:07:30 +0000

Nice write up, mock data generation with LLMs is pretty tough. We spent time trying to do it across multiple tables and it always had issues. Whether you look at classical ML models like GANs or even LLMs, they struggle with producing a lot of data and respecting FKs, Constraints and other relationships.

Maybe some day, it gets better but for now, we've found that using a more traditional algorithmic approach is more consistent.

Transparency: founder of Neosync - open source data anonymization - github.com/nucleuscloud/neosync

New comment by edrenova in "Show HN: Open-Source Data Anonymization for Developers"

edrenova — Tue, 17 Sep 2024 18:02:10 +0000

Thanks for the question! Faker is useful but doesn't have a lot of features. For example, referential integrity, data orchestration or the ability to read/write to a db. So faker can work for simple API schemas but if you need something more robust for an entire database, then that's where we can help.

New comment by edrenova in "Show HN: Open-Source Data Anonymization for Developers"

edrenova — Tue, 17 Sep 2024 18:00:46 +0000

Thanks! Yeah we generally recommend not making your databases public and instead connecting to them using a bastion host. We support this at Neosync. Also, ideally, not connecting to a live DB and instead a snapshot or back up. A read replica could work as well but a snapshot is better.

Show HN: Open-Source Data Anonymization for Developers

edrenova — Tue, 17 Sep 2024 16:15:17 +0000

Hey HN, we're Evis and Nick from Neosync (https://www.github.com/nucleuscloud/neosync).

Since we last introduced Neosync on HN 4 months ago, we’ve made a lot of progress and we’re excited to be launching several new features.

As a reminder, Neosync is an open source platform that helps developers anonymize production data, generate synthetic data, subset it and sync it across their environments for better testing, debugging and developer experience.

We do all of this while handling referential integrity. Whether you have primary keys, foreign keys, unique constraints, circular dependencies (within a table and across tables), sequences and more, Neosync preserves those references.

Our goal is to give every developer production-like, representative data for a better developer experience without any security and privacy issues.

First, we’ve added new integrations. In addition to supporting Postgres and Mysql, we’re introducing first class support for DynamoDB, MongoDB and SQL Server. You can also sync to object storage like S3 and GCP Cloud storage.

Next, we’ve completely revamped our transformers. Transformers are how you anonymize sensitive data and generate new data. We’ve added new Transformers that you can use out of the box or you can write your own custom one in javascript. We’ve added real time validation and the ability to combine transformers together to create your own anonymization scheme.

We’ve also added in new features to make Neosync easier to use. For example, the ability to automatically map transformers to your schema. The ability to only append new records instead of a full refresh. And to stop jobs from running when the schema changes.

We've also upgrade our AI Synthetic Data features. You can use any LLM to generate synthetic data and Neosync will handle the orchestration between your database and the LLM.

Lastly, we’re also announcing Neosync Cloud. Our hosted platform that allows you to use Neosync without having to run any of the infrastructure yourself. All you have to do is connect your source and destination databases(s), configure your schema and you’re done.

Of course, you can use Neosync Open Source on-prem and hundreds of companies do. Neosync is written in Go and Typescript and can be started locally with a single make command.

We'd love any feedback you have and contributions are always welcome.

Comments URL: https://news.ycombinator.com/item?id=41569240

Points: 13

# Comments: 5

New comment by edrenova in "Show HN: An open-source, local-first Webflow for your own app"

edrenova — Thu, 29 Aug 2024 17:24:35 +0000

cool to see this launch, actually came across this a few weeks ago and tried it out, really nice for local dev :)

New comment by edrenova in "How to test without mocking"

edrenova — Mon, 17 Jun 2024 18:57:06 +0000

The ideal experience is that you anonymize prod and sync it locally. Whether it's for testing or debugging, it's the only way to get representative data.

When you write mock data, you almost always write "happy path" data that usually just works. But prod data is messy and chaotic which is really hard to replicate manually.

This is actually exactly what we do at Neosync (https://github.com/nucleuscloud/neosync). We help you anonymize your prod data and then sync it across environments. You can also generate synthetic data as well. We take care of all of the orchestration. And Neosync is open source.

(for transparency: I'm one of the co-founders)

New comment by edrenova in "[dead]"

edrenova — Tue, 28 May 2024 17:48:14 +0000

Excited to announce a new partnership between Neon (open source serverless postgres) and Neosync (open source data anonymization) to give developers the easiest way to create data branches with anonymized production data for better testing, debugging and developer experience.

New comment by edrenova in "Show HN: Neosync – Open-Source Data Anonymization for Postgres and MySQL"

edrenova — Thu, 23 May 2024 16:53:00 +0000

hey! so sorry about this - it's fixed now!

also - happy to chat further if you have any questions - evis@neosync.dev

New comment by edrenova in "Show HN: Neosync – Open-Source Data Anonymization for Postgres and MySQL"

edrenova — Wed, 22 May 2024 23:24:34 +0000

Nice! appreciate you sharing it - would love to see the code at some point but looks like it's confidential.

I spent a lot of time building tokenization solutions at a previous startup so we'll definitely support tokenization at some point. There is a good use-case for it as well!

New comment by edrenova in "Show HN: Neosync – Open-Source Data Anonymization for Postgres and MySQL"

edrenova — Wed, 22 May 2024 23:22:33 +0000

Yup - totally hear you - hopefully we'll have a good solution for that in a few months :)

New comment by edrenova in "Show HN: Neosync – Open-Source Data Anonymization for Postgres and MySQL"

edrenova — Wed, 22 May 2024 22:27:42 +0000

The ideal scenario is that you're able to augment your existing data with more data that looks just like it. The matter of statistical significance really depends on the use-case. For load testing, it's probably not as important as it is for something like feature testin/debugging/analytical queries.

Even if you know the distribution of the data (which imo can be fairly difficult) replicating that can also be tricky. If you know that a gender column is 30-70 male - female, how do you create 30% male names? How about the female names? Are they the same name or do you repeat names? Does it matter? In some cases it does and in others it doesn't.

What we've seen is that it's really use-case specific and there are some tools that can help but there isn't a complete tool set. That's what we're trying to build over time.

New comment by edrenova in "Show HN: Neosync – Open-Source Data Anonymization for Postgres and MySQL"

edrenova — Wed, 22 May 2024 21:47:47 +0000

Thanks for the comment and feedback!

We're actually evaluating a clickhouse integration at the moment for a customer that we're working with so that might be coming in the future. Although today just PG and Mysql.

To answer your questions:

1. We don't support this quite yet although we're working on it for both anonymization and synthetic data. For anonymization, that typically means having deterministic anonymizers that output the same value for the same input (like a hash). For synthetic data that means using a model to be able maintain those same statistical characteristics. We'll have support for both of these within the next quarter. It also depends on what you want to anonymize. If the values that you want to anonymize wouldn't meaningfully change the distribution of the data (think like a name or an address and you're not doing any analytics or queries on those fields) then the statistical distribution of the data stays the same.

2. A few big differences. PG Anonymizer doesn't handle referential integrity, pretty much everything has to be defined in sql and it doesn't have a GUI and it doesn't have any orchestration across environments or databases. Neosync supports all of those.

Folks use the cloud service because they don't have the resource or time to deploy/run the OSS offering themselves. These are usually startups who are okay with us streaming their data and anonymizing it and sending it back to them. We're SOC2 type 2 compliant and usually got through a security review for these deployments. Conversely, they can also just run our managed version and keep all of their data on their infra while we host the control plane.

New comment by edrenova in "Show HN: Neosync – Open-Source Data Anonymization for Postgres and MySQL"

edrenova — Wed, 22 May 2024 21:27:50 +0000

yeah the referential integrity and constraints part is usually the most complicated part and everyone does things differently which adds another layer of complexity on it

New comment by edrenova in "Show HN: Neosync – Open-Source Data Anonymization for Postgres and MySQL"

edrenova — Wed, 22 May 2024 20:33:29 +0000

yeah good question, if you're doing any sort of analytical work, then you'll care about the statistical distribution of your data. If you're running queries or sharing data with third parties, then you want to maintain the same stats. If you're just building features then you might not care as much as if you were doign analytical work. But it could still be relevant if you're building metrics/dashboards/anything visual - you'll want to see that you can render your prod data correctly. So more so for analytical work but less so for normal, run of the mill dev work.

New comment by edrenova in "Show HN: Neosync – Open-Source Data Anonymization for Postgres and MySQL"

edrenova — Wed, 22 May 2024 19:36:19 +0000

we're actually working on this right, can see the PR here -> https://github.com/nucleuscloud/neosync/pull/1832/files

it's a combination of creating a random number of records for foreign keys i.e 1 customer - create between 2 and 5 transctions. Working on giving you control over that, and handling referential integrity with table constraints (foreign keys, unique constraints, etc.)

ML based approaches typically are not very good at this and struggle with handling things like referential integrity. So a more "procedural" or imperative way is slightly better. The ideal is a combination of both.

New comment by edrenova in "Show HN: Neosync – Open-Source Data Anonymization for Postgres and MySQL"

edrenova — Wed, 22 May 2024 19:32:33 +0000

Thanks for the comment and hear you on the anonymization. What we see is that customers will go through and categorize what is PII and what is not and anonymize as needed. If not, they'll back fill with synthetic data. You can change the gender from male to something else, same with the city, etc.

It's really down to the use-case. If you're strictly doing development, then you'll probably want to use more synthetic data than anonymization. If you care about preserving the statistical characteristics of the data then you can use ML models like CTGAN to create net new data.

Definitely a balance between when do you anonymize vs. when do you create synthetic data.

Show HN: Neosync – Open-Source Data Anonymization for Postgres and MySQL

edrenova — Wed, 22 May 2024 17:53:45 +0000

Hey HN, we're Evis and Nick and we're excited to be launching Neosync (https://www.github.com/nucleuscloud/neosync). Neosync is an open source platform that helps developers anonymize production data, generate synthetic data and sync it across their environments for better testing, debugging and developer experience.

Most developers and teams have some version of a database seed script that creates some mock data for their local and stage databases. The problem is that production data is messy and it’s very difficult to replicate that with mock data. This causes two big problems for developers.

The first problem is that features seem to work locally/stage but have bugs and edge cases in production because the seed data you used to develop against was not representative of production data.

The second problem we saw was that debugging production errors would take a long time and would often resurface. When we see a bug in production, the first thing we want to do is reproduce it locally, but if we can’t reproduce the state of the data locally, then we’re kind of flying blind.

Working directly with production data would solve both of these problems but most teams can’t because of: (1) privacy/security issues and (2) scale. So we set out to solve these two problems with Neosync.

We solve the privacy and security problem using anonymization and synthetic data. We have 40+ pre-built transformers (or you can write your own in code) that can anonymize PII or sensitive data so that it’s safe to use locally. Additionally, you can generate synthetic data from scratch that fits your existing schema across your database.

The second problem is scale. Some production databases are too big to fit locally or just have more data than you need. Also, in some cases, you may want to debug a certain customer’s data and you only want their data. We solve this with subsetting. You can pass in a SQL query to filter your table(s) and Neosync will handle all of the heavy lifting including referential integrity.

At the core of Neosync does three things: (1) It streams data from a source to one or multiple destination databases. We never store your sensitive data. (2) While that data is being streamed, we transform it. You define which schemas and tables you want to sync and at the column level, select a transformer that defines how you want to anonymize the data or generate synthetic data. (3) We subset your data based on your filters.

We also ship with APIs, a Terraform provider, a CLI and Github action that you can use to hydrate a CI database.

Neosync is an open source project written in Go and Typescript and can be run on Docker Compose, Bare Metal, or Kubernetes via Helm. You can also use our hosted platform or managed platform that you can deploy in your VPC. We also have a hosted platform with a generous free tier - https://neosync.dev

Here's a brief loom demo: https://www.loom.com/share/ac21378d01cd4d848cf723e4960e8338?...

We'd love any feedback you have!

Comments URL: https://news.ycombinator.com/item?id=40443927

Points: 246

# Comments: 44