Hacker News: brady_bastian

Show HN: Avalon - Synthetic FHIR R4 patient data as OMOP CDM 5.4 views

brady_bastian — Tue, 17 Mar 2026 00:36:33 +0000

Avalon Synthetic clinical data pipeline — generate realistic FHIR R4 patient data, normalize it through Forge, and query it as OMOP CDM 5.4 views.

What is Avalon? Avalon is an end-to-end pipeline that turns Synthea-generated FHIR bundles into clean, documented, queryable tables in BigQuery — then layers OMOP CDM 5.4 views on top.

The output is published free on BigQuery Analytics Hub, where anyone with a GCP account can subscribe and query it directly.

Why? Researchers need realistic clinical data in standard formats (OMOP) without PHI concerns Engineers need FHIR data to build and test healthcare integrations Analysts need pre-normalized tables with documented schemas, not raw nested JSON

Comments URL: https://news.ycombinator.com/item?id=47407068

Points: 1

# Comments: 0

Show HN: Synthea Fhir Data in BigQuery

brady_bastian — Mon, 16 Mar 2026 02:35:57 +0000

We generated ~1,100 synthetic patients with Synthea, processed the FHIR R4 output through our normalization engine (Forge), and published it as a free public dataset on BigQuery Analytics Hub.

8 resource types: Patient, Encounter, Observation, Condition, Procedure, Immunization, MedicationRequest, DiagnosticReport.

The raw Synthea output has 459 nested fields per resource, urn:uuid: references, and no column descriptions. We flatten it to clean views with ~15 columns each, pre-extracted IDs, and descriptions sourced from the FHIR R4 OpenAPI spec. Example:

-- Raw FHIR: SELECT id, code.text FROM diagnostic_report WHERE subject.reference = CONCAT("urn:uuid:", patient_id) -- Forge view: SELECT report_name, patient_id FROM v_diagnostic_report Data scanned per query drops ~90x (450 MB → 5 MB).

Free to subscribe: https://console.cloud.google.com/bigquery/analytics-hub/exch...

Updated weekly. Useful if you're building anything against FHIR data and want a realistic test dataset without standing up your own Synthea pipeline.

Happy to answer questions about the normalization approach or FHIR data modeling tradeoffs.

Comments URL: https://news.ycombinator.com/item?id=47394543

Points: 2

# Comments: 0

Show HN: Forge, the NoSQL to SQL Compiler

brady_bastian — Mon, 09 Mar 2026 12:16:22 +0000

https://forge.foxtrotcommunications.net/

I've been a data engineer for years and one thing drove me crazy: every time we integrated a new API, someone had to manually write SQL to flatten the JSON into tables. LATERAL FLATTEN for Snowflake, UNNEST for BigQuery, EXPLODE for Databricks — same logic, different syntax, written from scratch every time.

Forge takes an OpenAPI spec (or any JSON schema) and automatically:

1. Discovers all fields across all nesting levels 2. Generates dbt models that flatten nested JSON into a star schema 3. Compiles for BigQuery, Snowflake, Databricks, AND Redshift from the same metadata 4. Runs incrementally — new fields get added via schema evolution, no rebuilds

The key insight is that JSON-to-table is a compilation problem, not a query problem. If you know the schema, you can generate all the SQL mechanically. Forge is essentially a compiler: schema in, warehouse- specific SQL out.

How it works under the hood:

- An introspection phase scans actual data rows and collects the union of ALL keys (not just one sample record), so sparse/optional fields are always discovered - Each array-of-objects becomes its own child table with a hierarchical index (idx) linking back to the parent — no manual join keys needed - Warehouse adapters translate universal metadata into dialect-specific SQL: BigQuery: UNNEST(JSON_EXTRACT_ARRAY(...)) Snowflake: LATERAL FLATTEN(input => PARSE_JSON(...)) Databricks: LATERAL VIEW EXPLODE(from_json(...)) Redshift: JSON_PARSE + manual extraction - dbt handles incremental loads with on_schema_change='append_new_columns'

The full pipeline: Bellows (synthetic data generation from OpenAPI specs) → BigQuery staging → Forge (model generation + dbt run) → queryable tables + dbt docs. There's also Merlin (AI-powered field enrichment via Gemini) that auto-generates realistic data generators for each field.

I built this because I watched teams spend weeks writing one-off FLATTEN queries that broke the moment an API added a field. Every Snowflake blog post shows you how to parse 3 fields from a known schema — none of them handle schema evolution, arbitrary nesting depth, or cross-warehouse portability.

Try it: https://forge.foxtrotcommunications.net

Happy to answer questions about the architecture, the cross-warehouse compilation approach, or the AI enrichment layer.

Comments URL: https://news.ycombinator.com/item?id=47308072

Points: 4

# Comments: 1

New comment by brady_bastian in "Ask HN: What Are You Working On? (March 2026)"

brady_bastian — Mon, 09 Mar 2026 12:05:34 +0000

Completely automating the difficult problem of json parsing and normalization in cloud data warehouses. https://forge.foxtrotcommunications.net/

Today engineers spend dozens of hours agonizing over how to unlock the vast analytical possibilities of JSON data in their warehouse. The internet is littered with half solutions and broken promises. Today, we have solved this problem.

Forge – Automate 3NF Schema Generation from Nested JSON in BigQuery/Snowflake

brady_bastian — Wed, 11 Feb 2026 14:05:04 +0000

I've built a product that completely parses highly nested JSON data in cloud data warehouses. Forge works by methodically dissecting each subcollection and each field of Json data, one by one, and creates 3NF tables for each json sub-object. This completely flattens Json data of any complexity and depth and fully accounts for any schema changes in the entire dataset.

While hand crafted scripts work once and ok for a quick look, a systematic deconstruction and rebuild of the entire Json object is required to truly understand the structure. Some companies have Json data coming from MongoDb or Firestore which has undergone hundreds of even thousands of changes from changing data types to abstract manipulations such as changing Json object to array. A simple parsing script won't cut it. You will either sacrifice some data in order to get something out of it or spend weeks writing dozens of scripts and manipulations to correctly process it. Repeat this for each API and each schema that your company utilizes.

Forge doesn't stop at just unnesting. With the included AI schema classifier, Excalibur, we automatically identify which API your data is coming from based upon tens of thousands of examples. From Stripe to hubspot to segment, we detect it, classify it, and automatically apply field mappings. Additionally, Forge uses advanced AI and ML techniques to document and identify PII fields in your data. No more painstaking scrubbing and parsing of your data, just quick and ready analytics.

How does Forge handle schema changes? Automatic detection and adaptation. When new fields appear, Forge regenerates models while maintaining backward compatibility. Zero downtime.

Does my data leave my warehouse? SaaS: Forge connects via service account to process data in-place. Only schema fingerprints (not actual data) sent for AI classification. Enterprise: Everything runs in YOUR VPC. Zero data egress.

What warehouses do you support? BigQuery, Snowflake, Databricks, and Redshift. One parse generates native models for all four simultaneously.

How accurate is PII detection? Pridwen uses a 3-layer hybrid system (rules + ML + crowd) with 95%+ accuracy. Context-aware and supports 20+ languages.

Do you replace Fivetran/Airbyte? No, we're complementary. Use Fivetran/Airbyte to load raw JSON → Use Forge to transform it into analytics tables.

How much engineering time does this save? Conservative estimate: 2-4 weeks initial build + 10 hours/month maintenance = $50,000-100,000/year for mid-size teams.

Comments URL: https://news.ycombinator.com/item?id=46975075

Points: 1

# Comments: 1

New comment by brady_bastian in "Forge – Transform nested JSON into governed dbt models for BQ/Snowflake"

brady_bastian — Wed, 28 Jan 2026 23:19:32 +0000

Hi HN, I’m Brady. I’ve spent years watching data engineers burn out writing brittle parsers for nested JSON, only to have dashboards crash because an upstream API changed a field name.

I built Forge to solve this. It’s an autonomous data infrastructure platform that ingests raw, nested JSON and automatically generates production-ready dbt models.

The Problem Traditional ETL tools (Fivetran/Stitch) often dump raw JSON into a VARIANT or string column, leaving you to write complex parsing logic manually. This is expensive to query and hard to govern. If the schema changes, your SQL breaks.

What Forge Does Forge parses your JSON and compiles it into optimized, native tables for BigQuery, Snowflake, Databricks, and Redshift.

Deep Unnesting: It flattens arrays and objects 5+ levels deep into relational tables with proper keys. AI Classification (Excalibur): We use a Graph Neural Network (GraphSAGE) to classify data patterns (e.g., identifying "customer" vs. "inventory" data) without the data leaving your environment.

Auto-Governance (Pridwen): It detects PII and automatically applies hashing or masking policies based on the classification.

Multi-Warehouse Support: One JSON source generates native SQL for all supported warehouses simultaneously.

How it works Under the hood, Forge generates a full dbt project. You get the exact SQL code it generates, complete with lineage and documentation. We focused heavily on transparency—no black boxes.

Where we're going We are currently working on Llamrei (Q2 2026), which will handle schema evolution by automatically normalizing legacy API versions into "golden schemas" to prevent breaking changes.

We have a free tier (no credit card required) that lets you run full jobs to test the output.

I’d love to hear your feedback on the generated SQL structure and our approach to using GNNs for schema inference.

Forge – Transform nested JSON into governed dbt models for BQ/Snowflake

brady_bastian — Wed, 28 Jan 2026 23:19:32 +0000

Article URL: https://forge.foxtrotcommunications.net/portal

Comments URL: https://news.ycombinator.com/item?id=46803158

Points: 1

# Comments: 1