Hacker News: johnwatson11218

New comment by johnwatson11218 in "It's 2026, Just Use Postgres"

johnwatson11218 — Fri, 06 Feb 2026 13:12:11 +0000

I'm using postgres as part of my current project - https://github.com/johnwatson11218/LatentTopicExplorer

I had added Spacey to my codebase for one of its features and found that just doing the work in the db was near instant and my containers would not run out of ram. I want to get back to Spacey for more involved nlp work but right now the db "just works". I think Oracle is nicer but postgres does what I need for a lot less money!

https://github.com/johnwatson11218/LatentTopicExplorer/commi...

New comment by johnwatson11218 in "Show HN: I used Claude Code to discover connections between 100 books"

johnwatson11218 — Fri, 16 Jan 2026 13:42:48 +0000

I posted my code https://github.com/johnwatson11218/LatentTopicExplorer

New comment by johnwatson11218 in "Ask HN: Share your personal website"

johnwatson11218 — Fri, 16 Jan 2026 13:38:15 +0000

https://github.com/johnwatson11218/LatentTopicExplorer

You have to use docker compose to get to localhost:8000 , there are still bugs but I'm working on it and there was interest expressed in this project on Hacker News a couple of weeks back.

New comment by johnwatson11218 in "Show HN: I used Claude Code to discover connections between 100 books"

johnwatson11218 — Sun, 11 Jan 2026 22:40:24 +0000

Thanks for the supportive comments. I'm definitely thinking I should release sooner rather than later. I have been using LLM for specific tasks and here is some sample stored procedure I had an LLM write for me.

-- -- Name: refresh_topic_tables(); Type: PROCEDURE; Schema: public; Owner: postgres --

CREATE PROCEDURE public.refresh_topic_tables() LANGUAGE plpgsql AS $$ BEGIN -- Drop tables in reverse dependency order DROP TABLE IF EXISTS topic_top_terms; DROP TABLE IF EXISTS topic_term_tfidf; DROP TABLE IF EXISTS term_df; DROP TABLE IF EXISTS term_tf; DROP TABLE IF EXISTS topic_terms;

    -- Recreate tables in correct dependency order
    CREATE TABLE topic_terms AS
    SELECT
        dt.term_id,
        dot.topic_id,
        COUNT(DISTINCT dt.document_id) as document_count,
        SUM(frequency) as total_frequency
    FROM document_terms dt
    JOIN document_topics dot ON dt.document_id = dot.document_id
    GROUP BY dt.term_id, dot.topic_id;

    CREATE TABLE term_tf AS
    SELECT
        topic_id,
        term_id,
        SUM(total_frequency) as term_frequency
    FROM topic_terms
    GROUP BY topic_id, term_id;

    CREATE TABLE term_df AS
    SELECT
        term_id,
        COUNT(DISTINCT topic_id) as document_frequency
    FROM topic_terms
    GROUP BY term_id;

    CREATE TABLE topic_term_tfidf AS
    SELECT
        tt.topic_id,
        tt.term_id,
        tt.term_frequency as tf,
        tdf.document_frequency as df,
        tt.term_frequency * LN( (SELECT COUNT(id) FROM topics) / GREATEST(tdf.document_frequency, 1)) as tf_idf
    FROM term_tf tt
    JOIN term_df tdf ON tt.term_id = tdf.term_id;

    CREATE TABLE topic_top_terms AS
    WITH ranked_terms AS (
        SELECT
            ttf.topic_id,
            t.term_text,
            ttf.tf_idf,
            ROW_NUMBER() OVER (PARTITION BY ttf.topic_id ORDER BY ttf.tf_idf DESC) as rank
        FROM topic_term_tfidf ttf
        JOIN terms t ON ttf.term_id = t.id
    )
    SELECT
        topic_id,
        term_text,
        tf_idf,
        rank
    FROM ranked_terms
    WHERE rank <= 5
    ORDER BY topic_id, rank;

    RAISE NOTICE 'All topic tables refreshed successfully';

EXCEPTION WHEN OTHERS THEN RAISE EXCEPTION 'Error refreshing topic tables: %', SQLERRM; END; $$;

New comment by johnwatson11218 in "Show HN: I used Claude Code to discover connections between 100 books"

johnwatson11218 — Sat, 10 Jan 2026 23:14:05 +0000

I did something similar whereby I used pdfplumber to extract text from my pdf book collection. I dumped it into postgresql, then chunked the text into 100 char chunks w/ a 10 char overlap. These chunks were directly embedded into a 384D space using python sentence_transformers. Then I simply averaged all chunks for a doc and wrote that single vector back to postgresql. Then I used UMAP + HDBScan to perform dimensionality reduction and clustering. I ended up with a 2D data set that I can plot with plotly to see my clusters. It is very cool to play with this. It takes hours to import 100 pdf files but I can take one folder that contains a mix of programming titles, self-help, math, science fiction etc. After the fully automated analysis you can clearly see the different topic clusters.

I just spent time getting it all running on docker compose and moved my web ui from express js to flask. I want to get the code cleaned up and open source it at some point.

New comment by johnwatson11218 in "Egyptian Hieroglyphs: Lesson 1"

johnwatson11218 — Thu, 18 Dec 2025 13:48:54 +0000

I've read that by the end of ancient Egyptian history they had used tricks like a picture of an eye for the letter or sound 'I' or a picture of a bee for the sound of 'B' there was a complete alphabet embedded within the system. To be literate you had to know the tricks from the ancient and middle kingdoms as well. The result was three complete alphabets, similar to modern Japanese. From that point of view the invention of the alphabet was more of a simplification. This always reminded me of the situation in modern enterprise development where lots of infrastructure was written in-house.

New comment by johnwatson11218 in "US Tech Force"

johnwatson11218 — Tue, 16 Dec 2025 00:11:41 +0000

They talk about the specific systems in terms of legacy code and how far removed government agencies are from automated testing and other modern, best practices. It has been a couple of years since I read it but I recall a part about a business process at the IRS that that people don't start learning until they have been there for about 17 years - due to the complexity. It talks about how there had been failed attempts to migrate to a new database, some of the data is now duplicated but the upgrade is de-funded so all the new code has to be aware that data may be duplicated.

I'm not sure if this book got into it but I've also read that the IRS has assembly code from the 1960s that is very optimized and only a few devs can work on it. ChatGPT knows a lot about this history as well.

New comment by johnwatson11218 in "US Tech Force"

johnwatson11218 — Mon, 15 Dec 2025 17:40:22 +0000

If you want to know what you are up against I highly recommend - https://www.amazon.com/Recoding-America-Government-Failing-D...

This book discusses the IT systems at the IRS and VA and shows the kind of push back you can expect from entrenched players.

New comment by johnwatson11218 in "Show HN: Strange Attractors"

johnwatson11218 — Fri, 07 Nov 2025 16:25:52 +0000

I have a pdf of this book and was using LLM to translate the old code into modern, idiomatic python and it is very cool. I wonder if somebody will re-release it with modern code and tooling? In fact , google Gemini was able to do it on the fly using the posted links.

New comment by johnwatson11218 in "Ask HN: Who uses open LLMs and coding assistants locally? Share setup and laptop"

johnwatson11218 — Fri, 07 Nov 2025 14:52:07 +0000

I have a pipeline in Docker compose that starts up postgresql on one container and a python container. The python scripts will recursively read all the pdf files in a directory, use pdf plumber to parse the text to store in a postgres table. Then I use sentence_transformers to take 100 char, w/ 10 char overlap, chunks and embed each section as a 384D vector which is written back to the db. Then I average all the chunks to create a single embedding for the entire pdf file. I have used numpy as well as built in postgres functions to average and it fast either way.

Then I use HMAP + DBSCAN to create a 2D projection of my dataset. DBSCAN writes the clusters to a csv file. I read that back in to create topics, docs2topcs join table. Then I join each topic into a mega doc and consider the original corpus, I compute tf-idf, using only db functions. This gives me the top 5 or so terms per topic and serves as useful topic labels.

I can do 30 to 50 docs in an couple of hours. I imported 1100 pdf files and it took all weekend on an old gaming laptop w/ a ssd. I have a gpu, and I think the embedding steps would go faster but I'm still doing it all synchronously w/o any parallel processing.

New comment by johnwatson11218 in "How AI hears accents: An audible visualization of accent clusters"

johnwatson11218 — Tue, 14 Oct 2025 21:47:26 +0000

I just got a project running whereby I used python + pdfplumber to read in 1100 pdf files, most of my humble bundle collection. I extracted the text and dumped it into a 'documents' table in postgresql. Then I used sentence transformers to reduce each 1K chunk to a single 384D vector which I wrote back to the db. Then I averaged these to produce a document level embedding as a single vector.

Then I was able to apply UMAP + HDBSCAN to this dataset and it produced a 2D plot of all my books. Later I put the discovered topic back in the db and used that to compute tf-idf for my clusters from which I could pick the top 5 terms to serve as a crude cluster label.

It took about 20 to 30 hours to finish all these steps and I was very impressed with the results. I could see my cookbooks clearly separated from my programming and math books. I could drill in and see subclusters for baking, bbq, salads etc.

Currently I'm putting it into a 2 container docker compose file, base postgresql + a python container I'm working on.

New comment by johnwatson11218 in "IRS Direct File on GitHub"

johnwatson11218 — Fri, 06 Jun 2025 13:52:03 +0000

Everyone is talking about having LLMs write software but what about having them delete code? That can be very hard in a legacy enterprise environment. I think dead code detection overlaps with security and that is a good way to sell that kind of code clean up. Having LLMs review your architecture is a fun exercise, being able to incorporate that feedback is a good measure for the dev teams.

New comment by johnwatson11218 in "Oracle engineers caused five days software outage at U.S. hospitals"

johnwatson11218 — Tue, 29 Apr 2025 15:50:16 +0000

I think the Oracle Transaction Manager is one of the best pieces of software that I had to work with in a professional settings. Lots of other stuff in an enterprise setting is very flaky and follows trends but the Oracle internals seem very nice.

New comment by johnwatson11218 in "Ask HN: Share your AI prompt that stumps every model"

johnwatson11218 — Fri, 25 Apr 2025 12:43:39 +0000

My prompt that I couldn't get the LLM to understand was the following. I was having it generate images of depressing offices with no windows and with lots of depressing, grey cubicles with paper all over the floor. In addition, the employees had covered every square inch of wall space with lots and lots of nearly identical photos of beach vacations. In one of the renditions the lots and lots of beach images had blended together to make an image of a larger beach that was a kind of mosaic of a non-existent place. Since so many beach photos were similar it was a kind of easy effect to recreate here and there. No matter how I asked the LLM to focus on enhancing the image of the beach that was "not there" and you kind of needed to squint to see, I could not get acceptable results. Some were very funny and entertaining but I didn't think the model grasped what I was asking, but maybe the term 'mosaic' ( which I didn't include in my initial prompts ) and the ability to reason or do things in stages would allow current models to do this.

New comment by johnwatson11218 in "The psychology behind why children are hooked on Minecraft"

johnwatson11218 — Fri, 04 Apr 2025 12:17:25 +0000

♠♥♣♦

New comment by johnwatson11218 in "The reality of working in tech: We're not hired to write code (2023)"

johnwatson11218 — Fri, 04 Apr 2025 12:10:40 +0000

I was working with deepseek to distill my thoughts on this and the best quote was " This resembles a lossy compression of software, where the "loss" is non-essential complexity.".

"literate programming meets LLMs"—where the goal isn’t just fewer lines, but denser meaning. If you’re experimenting with it, start small: try distilling a single function with GPT-4 + human review, and see if the result feels correct but simpler. In other words LLM assisted code refactoring for compression and clarity is the way to resolve the argument between 'more' or 'less' code in general.

New comment by johnwatson11218 in "Dijkstra On the foolishness of "natural language programming""

johnwatson11218 — Thu, 03 Apr 2025 19:14:47 +0000

Why did mathematicians invent new symbols? Imagine if all of algebra, calculus, linear algebra looked like those word problems from antiquity? Natural language is not good for describing systems, symbolic forms are more compressed and be considered a kind of technology in its own right.

New comment by johnwatson11218 in "Ideas from "A Philosophy of Software Design""

johnwatson11218 — Sun, 22 Dec 2024 17:58:11 +0000

A good game is supposed to be "easy to learn and hard to master". I think software abstractions should have this property as well. Too often the next "fix" in a long chain of failed ideas in overly engineered software feels like the Batman games where one has to complete a mini tutorial to learn to use the "bat-whatever" for a single application/puzzle. Contrast this with the Borderlands franchise, I can learn to play Borderlands in 5 minutes and explore the skills tree and whatnot at my leisure if at all. You hear about "Deus ex machina" as a lazy trait in writing, but it is commonplace in enterprise software. Load Bearing Abstractions.

New comment by johnwatson11218 in "Ask HN: What open source projects need help?"

johnwatson11218 — Sat, 30 Nov 2024 13:52:20 +0000

I have wanted to do something like this for news websites for a while now. I tried the recipe site, my first url from lifehacker wasn't supported, the second attempt from the curated list gave an error loading site, so maybe make the links clickable, have a top 10 ready to go ( cached) at bottom of landing page. Write a bot to periodically check the list of sites this works on and indicate how recent said check was done. Then I might spend more time. Good luck!

New comment by johnwatson11218 in "How Blackjack Works (2007)"

johnwatson11218 — Thu, 28 Nov 2024 00:27:49 +0000

There was a documentary on Prime Video a couple of years ago that followed a card counter as he moved from casino to casino living out of his RV. I think he was up around $600K after one year. He was having to get kicked out of small casinos all over the place and keep moving to new ones every few days. At one point he tells the camera crew that he has to start driving 1/2 a day to skip over a few spots to stay ahead of the surveillance. It didn't look easy but not impossible. He got to set his own hours and be his own boss.