Hacker News: aaronsteers

New comment by aaronsteers in "Show HN: Airbyte Agents – context for agents across multiple data sources"

aaronsteers — Wed, 06 May 2026 03:04:32 +0000

Great question, @Tsarp - Skill and tools work great together. What we've found is that agents generally need both to achieve great results. We're actually not trying to replace skills, but to give them new super powers.

Are there any examples you've run into where skills were missing tools (or data) that they needed for a specific task?

New comment by aaronsteers in "Show HN: Airbyte Agents – context for agents across multiple data sources"

aaronsteers — Wed, 06 May 2026 01:05:05 +0000

The new Airbyte Agents offering brings a ton of new capabilities actually.

1. Programmatic Interfaces: Including a new REST API, SDK, and MCP Server. 2. New action verbs: Not just replication anymore. We have get/set/list/update/upload, and more! 3. New credentials passthrough: For all the above, you OAuth to Airbyte and we OAuth on your behalf to the systems your agent needs. No need to provide your agents dozens of different secrets in order to access the systems it needs. 4. Context Store. Like your agents' own data warehouse, but completely automatic and hands-free. For those use cases that just aren't possible when calling the REST API directly.

Again - thanks for your comment and sorry for the longwinded response. More info here: https://docs.airbyte.com/ai-agents/

New comment by aaronsteers in "Show HN: Airbyte Agents – context for agents across multiple data sources"

aaronsteers — Wed, 06 May 2026 00:57:42 +0000

Hi, @jessewmc. Thanks for your reply. Regarding your points:

> If I'm reading correctly, the indexing (Context Store) is neutral/unopinionated? How does it select fields for indexing?

While we haven't yet published details on the backend implementation, I can say that our implementation performs very well without needing to prioritize specific fields for indexing. We aim for large text fields to perform decently and retrieval based on small/compressible fields like ints to be fast. (More to come on this in the coming months.)

> Have you done any testing on guided indexing, or metadata layers on top of the data?

We've been testing with different data scales and shapes. Nothing detailed to share yet, but performance has (so far) never itself become the bottleneck in our agent testing. (The LLM thinking itself is often the bottleneck.)

> My experience so far on similar work is that getting data in front of an agent isn't enough context to get useful/reliable answers enough of the time.

Airbyte has rich metadata on our upstream connector's data models, which I think helps us a lot to deliver helpful context to the agent. Another option, when optimizing for specific use cases, is to build your own agent tools on top of our Agent SDK. This allows you to make the calls organic and build the tools in a way that makes natural sense to the agent, regardless of source shape or which system(s) that data is coming from.

> This does look like a good foundation for that kind of tooling though!

We agree! Thanks again for sharing your thoughts here.

New comment by aaronsteers in "Show HN: Airbyte Agents – context for agents across multiple data sources"

aaronsteers — Wed, 06 May 2026 00:46:37 +0000

Hey, swyx! Great seeing you here.

> airbyte agents could serve as a form of MCP gateway

Exactly! And a single set of tools for agents to access both realtime (direct reads/writes) as well as cached (Context Store), bringing hopefully the best access path for each different use case.

> would love a "data engineering for ai engineers" type braindump ... at AIE

Great idea - we have a booth at AIE, and we'll submit there for a talk. Mario will reach out to you about this. :)

New comment by aaronsteers in "Show HN: Airbyte Agents – context for agents across multiple data sources"

aaronsteers — Wed, 06 May 2026 00:41:19 +0000

Glad to hear this resonates with you also. We're aiming to give agents more control over their context, and easier access paths regardless of the source system.

New comment by aaronsteers in "Show HN: Airbyte Agents – context for agents across multiple data sources"

aaronsteers — Wed, 06 May 2026 00:38:25 +0000

Working with APIs is often frustrating and the worst ones are terribly ineficient and frustrating. Our Agent SDK and Agent Context Store insulates you and your agent from this headache, allowing you to query from those synced datasets directly.

The feedback about wanting to download a parquet file is super interesting...

New comment by aaronsteers in "Show HN: Airbyte Agents – context for agents across multiple data sources"

aaronsteers — Wed, 06 May 2026 00:34:59 +0000

Thanks! Really appreciate the kind words. Looking forward to seeing what our amazing community builds with these new tools.

New comment by aaronsteers in "Show HN: Airbyte Agents – context for agents across multiple data sources"

aaronsteers — Wed, 06 May 2026 00:33:29 +0000

That's great to hear - great minds think alike!

> give the agent access to the DB

This is where Airbyte really can shine, I think, and the total can be more the sum of the parts. Because Airbyte excels at data replication already, we can populate your the Agent Context Store without users or agents ever needing to think about the words "ELT" or "ETL".

We're listening carefully to feedback so we hope you will give it a try and let us know how it goes! Thanks!

New comment by aaronsteers in "Show HN: Airbyte Agents – context for agents across multiple data sources"

aaronsteers — Wed, 06 May 2026 00:27:14 +0000

Hello, Jared! Small world! Yes, we did deprecate our old PbA (Powered by Airbyte) offering, but in many ways our new Agents and Embedded offering is a more robust and agent-friendly successor to that older offering.

I am happy to hear you are still getting value out of PyAirbyte! If you do try out Airbyte Agents, please let us know how it goes! We are always listening to feedback and would love to hear from you as you explore the new tools and capabilities.

New comment by aaronsteers in "Show HN: Airbyte Agents – context for agents across multiple data sources"

aaronsteers — Tue, 05 May 2026 15:30:04 +0000

AJ here, from Airbyte.

Yes, we've definitely found that some API data models are easier for models to navigate than others.

The largest factors of Agent inefficiency we've identified so far are: 1. Many APIs lack robust-enough search, forcing agents to page through hundreds or thousands of paginated responses until they find the record they are looking for (our Context Store addresses this). 2. Many APIs have HUGE response sets. Our MCP helps handle this by letting the agent decide exactly what fields they can return. 3. With our SDK, you can literally build your own MCP on top of any source we support (50+ right now and will grow). This is super powerful, and allows you to build more ergonomic MCP servers and tools - even if the models themselves are not intuitive or easy for the LLM to leverage directly.

Combining all three of these together, we see the vast majority of challenges can be addressed via a strong system prompt for guidance. Fine tuning could get you further but anyway, you'd still want your fine tuned model to build on this same foundation, since the efficiences will transfer across use cases and models.

@ecares - Does this answer your question? What do you think?

New comment by aaronsteers in "Airbyte 1.0 – Marketplace, AI Assist, Gen AI Support and Enterprise GA"

aaronsteers — Tue, 24 Sep 2024 17:14:05 +0000

I was very pleased to demo PyAirbyte and my AI Chatbot self-contained in a Jupyter Notebook, along with our new support for the PGVector destination.

My colab notebook is here if you want to kick the tires: https://colab.research.google.com/github/airbytehq/quickstar...

Let me know if you have any questions about our AI connectors or PyAirbyte!

New comment by aaronsteers in "ELTP: Extending ELT for Modern AI and Analytics"

aaronsteers — Tue, 07 Nov 2023 20:42:46 +0000

Thanks for this feedback! I do agree there are some similarities as I called our as common benefits of using "EL pairs" on both sides of the process.

Here are my thoughts though on the importance of the distinction...

The first place you land the data is almost always a place you control - either a data warehouse or a data lake that you have tuned for fast and flexible data processing. The second (publish) process pushes to a location you most likely can't control, and which is not prepared to receive raw/unshaped data.

This is important because the business logic in our transformations will almost always evolve over time. Running between EL and P (the second "EL") gives us reproducibility and efficiency to innovate, using the location we have the best performance profile for running those transforms.

What do you think?

ELTP: Extending ELT for Modern AI and Analytics

aaronsteers — Tue, 07 Nov 2023 15:54:55 +0000

Article URL: https://airbyte.com/blog/eltp-extending-elt-for-modern-ai-and-analytics

Comments URL: https://news.ycombinator.com/item?id=38178297

Points: 74

# Comments: 15

New comment by aaronsteers in "Falcon LLM – A 40B Model"

aaronsteers — Wed, 21 Jun 2023 19:20:05 +0000

Although not evil, adult content should be opt-in, and should be able to be opted-out at a platform level... hence, the need for censored models. Imagine a restaurant booking AI app, built on GPT, that accidentally doubled as a bomb-making tutor or an adult content generator. It's a lawsuit waiting to happen, if nothing else, and it's worth making these use cases harder (if not impossible) to implement in mainstream, commercially available products. Note that for many of these products, the age and consent for adult material has not been already established.

So far, the open source ecosystem seems to be doing a good job of providing both censored and uncensored LLMs - and it seems there are valid use cases for both.

Think of this as similar to Falcon LLM being launched in both 40B and smaller 7B variants - the LLM often will need to match the use case, and the 7B model is a good example of making the model smaller (and worse) on purpose in order to reach certain trade-offs.