Hacker News: awendland

Your Immune System Is Not a Muscle

awendland — Tue, 27 Aug 2024 16:06:57 +0000

Article URL: https://rachel.fast.ai/posts/2024-08-13-crowds-vs-friends/

Comments URL: https://news.ycombinator.com/item?id=41369061

Points: 287

# Comments: 197

New comment by awendland in "Launch HN: Metriport (YC S22) – Open-source API for healthcare data exchange"

awendland — Fri, 24 May 2024 02:47:53 +0000

What percentage, or how many millions, of patients are accessible on the network today?

New comment by awendland in "Show HN: Hacker Search – A semantic search engine for Hacker News"

awendland — Thu, 02 May 2024 18:36:59 +0000

Following @isoprohplex, I'll be the fourth comment to say I also built a variant of this: https://hnss.alexwendland.com/

I built mine on top of an RSS feed I generate from Hacker News which filters out any posts linking to the top 1 million domains [1] and creates a readable version of the content. I use it to surface articles on smaller blogs/personal websites—it's become my main content source. It's generated via Github Actions every 4 hours and stored in a detached branch on Github (~2 GB of data from the past 4 years). Here's an example for posts with >= 10 upvotes [2].

It only took several hours to build the semantic search on top. And that included time for me to try out and learn several different vector DBs, embedding models, data pipelines, and UI frameworks! The current state of AI tooling is wonderfully simple.

In the end I landed on (selected in haste optimizing for developer ergonomics, so only a partial endorsement):

  - BAAI/bge-small-en as an embedding model
  - Python with
    - HuggingFaceBgeEmbeddings from langchain_community for creating embeddings
    - SentenceSplitter from llama_index for chunking documents
    - ChromaDB as a vector DB + chroma-ops to prune the DB
    - sqlite3 for metadata
    - FastAPI, Pydantic, Jinja2, Tailwind for API and server-rendered webpages
  - jsdom and mozilla-readability for article extraction

I generated the index locally on my M2 Mac which ripped through the ~70k articles in ~12 hours to generate all the embeddings.

I run the search site with Podman on a VM from Hetzner—along with other projects—for ~$8 / month. All requests are handled on CPU w/o calls to external AI providers. Query times are <200 ms, which includes embedding generation → vector DB lookup → metadata retrieval → page rendering. The server source code is here [3].

Nice work @jnnnthnn! What you built is fast, the rankings were solid, and the summaries are convenient.

[1] https://majestic.com/reports/majestic-million

[2] https://github.com/awendland/hacker-news-small-sites/blob/ge...

[3] https://github.com/awendland/hacker-news-small-sites-website...

New comment by awendland in "Sociosexual orientations are not reflective of life trajectories"

awendland — Mon, 11 Sep 2023 23:44:31 +0000

I was also curious. Wikipedia addresses the question:

> The theory was popular in the 1970s and 1980s, when it was used as a heuristic device, but lost importance in the early 1990s, when it was criticized by several empirical studies.[5][6] A life-history paradigm has replaced the r/K selection paradigm, but continues to incorporate its important themes as a subset of life history theory.[7] Some scientists now prefer to use the terms fast versus slow life history as a replacement for, respectively, r versus K reproductive strategy.[8]

New comment by awendland in "Show HN: Open-Source Infrastructure for Vector Data Streams"

awendland — Wed, 19 Jul 2023 17:04:26 +0000

I’ve been looking for something like this: eventually consistent syncing of DB content -> embeddings in a vector DB.

So far, I’ve been dealing with a tradeoff between latency + error handling in my API endpoints. I’ll either 1.) embed content + upsert into to the vector DB inside a transaction block for my main DB in the handler, which kills latency, or 2.) kickoff the embedding work separate from the main handler work, which risks data desynchronizing.

I’d much prefer a set-it-and-forget-it approach like Retake.

A few questions:

* If the “real-time server” goes offline temporarily, will it catch up on any newly added rows in the interim?

* Do you intend to emit any OpenTelemetry metrics? I’d like to monitor lag in production.

* Will I be able to deploy this as a single container on ECS/Kubernetes?

New comment by awendland in "Node-Red 3.0 Released"

awendland — Thu, 14 Jul 2022 12:16:12 +0000

This is a little lengthy, but I wanted to share the tactical details of my use case to give you a full picture:

I use Node-Red for a few scheduled activities: archiving Reddit posts or tweets I upvote and pulling information from real estate websites that match criteria I’m interested in.

I like Node-Red vs. cron-managed shell/Python scripts for several reasons:

  - the admin/editor UI is accessible on any device with a web browser (no git, ssh, etc. tooling required)
  - the node-based visual flow is easy to reason about and debug (so even after years of ignoring my scripts I can quickly come back to them and grok what’s going on)
  - the barrier to entry continues to be low (I can pop in and create a new flow in <1 hr)

I prefer it over Zapier or IFTTT since it’s more flexible. I’ve authored arbitrary JavaScript and request logic to retrieve and filter data in ways these pre-packaged tools can’t.

I run it on an AWS LightSail server for ~$4 per month. I use Ansible to manage Ubuntu with podman + systemd running the Node-Red docker image and TLS provided by Caddy. Roughly ~4 hours to setup from scratch and something I return to once every ~18 months to update/tweak with minimal issue.

To sum it up, I appreciate the grok-ability + flexibility + accessibility. It just works and it scales in complexity as I need it to!

New comment by awendland in "Rdrview – Firefox Reader View as a Linux command line tool"

awendland — Mon, 19 Oct 2020 15:25:03 +0000

I needed a reader view library for a side project and decided to compare the most popular options (repo at https://github.com/awendland/readable-web-extractor-comparis...). Among cleanview, metascraper, @postlight/mercury-parser, and mozilla/readability I thought that mozilla/readability performed the best because of its consistent extraction of the primary content and minimal mangling of the semantic structure.

For a quick preview of each library on a random sample of 16 articles posted to HN, see https://github.com/awendland/readable-web-extractor-comparis... (you’ll need to expand a row to see its results).