Hacker News: routerl

New comment by routerl in "A DOGE staffer appears to be posting DOGE work on his public GitHub"

routerl — Sat, 01 Mar 2025 16:34:33 +0000

From the perspective of wanting to maintain the integrity of the American federal government, it seems like all this DOGE stuff (and the whole Trumpist movement in general) serves the purpose of a red team, in the cybersecurity sense; people with nebulous intent have gotten access to everything.

So now, if Americans care about the integrity of their government, there needs to be a blue team: how can this catastrophic level of access be dealt with, and how can it be safeguarded against in the future. Alas, I'm not seeing this perspective being enacted. The obvious security compromise is being allowed to stand and continue, usually on the basis that "separation of powers" and "checks and balances" are relied on to be effective; congress will stop this, or the courts will stop this. But we're watching these mechanisms fail.

So, what's the plan here? Where's the counter-offensive? We're watching a system being hacked, and I've yet to see anyone talk about a recovery plan, or a prevention plan.

New comment by routerl in "The number line freaks me out (2016)"

routerl — Wed, 19 Feb 2025 03:02:58 +0000

It seems to be an article about all those "harmless" lies we tell students.

The vast majority of people think mathematics is about numbers, when it is actually about relations, and numbers are just some of the entities whose relations mathematics studies.

Nobody is born with this misconception; we teach it, and test it, and thereby ingrain it in the minds of every student, most of whom will never study mathematics at a level that makes them go "wait, what?". The overwhelming majority of people never get to this level.

I suspect this is also why statistics feels so counterintuitive to so many people, including me. The Monty Hall problem is only a problem to those who are naive about probability, which is most people, because most of us don't learn any of this stuff early enough to form long lasting, correct instincts.

It's not fair to students to bake "harmless" lies into their early education, as a way to simplify the topic such that it becomes more easily teachable. We've only done this because teaching is hard, and thus expensive. Education is expensive, at every step. It's not fair or productive to build a gate around proper education that makes it available only to those who can afford it at the level where the early misconceptions get corrected. Even those people end up spending a lot of cognitive capital on all those "wait, what?" moments, when their cognitive capital would be better spent elsewhere.

New comment by routerl in "Show HN: Mandarin Word Segmenter with Translation"

routerl — Tue, 11 Feb 2025 22:07:02 +0000

OP here, I'm adding a feature that will allow users to save specific words to lists, and export the lists in formats that can be imported to flashcard apps.

New comment by routerl in "Show HN: Mandarin Word Segmenter with Translation"

routerl — Sat, 08 Feb 2025 15:48:09 +0000

For anonymous users, I'm using OpenNMT, via Argos. Logged in users get DeepL translations, which correctly translates 气功师.

New comment by routerl in "Show HN: Mandarin Word Segmenter with Translation"

routerl — Sat, 08 Feb 2025 11:52:23 +0000

Thank you, and thanks for checking it out!

I use Pleco almost every day :)

New comment by routerl in "Show HN: Mandarin Word Segmenter with Translation"

routerl — Sat, 08 Feb 2025 11:51:28 +0000

I did! Jieba is the first step in my segmentation pipeline. As far as I can tell, Jieba's default config tends to work better for simplified, but in my case the custom dictionary I feed it has significantly more traditional entries than simplified entries, especially for historical terms and slang.

New comment by routerl in "Show HN: Mandarin Word Segmenter with Translation"

routerl — Sat, 08 Feb 2025 11:48:53 +0000

It supports traditional and simplified, as well as pinyin and bopomofo :)

It's already possible to switch instantly between pinyin and bopomofo, and I'm working on letting users switch between simplified/traditional, but this is also a non-trivial problem. For now, the app will follow the user's lead: if you enter traditional text, it will return traditional text, and same goes for simplified.

New comment by routerl in "Show HN: Mandarin Word Segmenter with Translation"

routerl — Sat, 08 Feb 2025 11:44:56 +0000

Thanks for the kind words, and the bug report!

The (awful and incorrect) translation you've pointed out comes from the segmenter being too greedy, not finding the (non-existent) word in any dictionary, and therefore dispatching the word to be machine translated, without context. This is the final fallback in the segmentation pipeline, to avoid displaying nothing at all, and my priority right now is making the segmentation pipeline more robust so this rarely (or never) happens, since it sometimes produces hilariously bad results!

New comment by routerl in "Nontraditional Red Teams"

routerl — Fri, 07 Feb 2025 00:48:49 +0000

'"Tradition" is a set of solutions for which we have forgotten the problems. Throw away the solution and you get the problem back.'

This is, by far, my most conservative opinion. Credit to Donald Kingsbury for the quote.

Honorable mention re: the same problem, "dogfooding"[0] is gone from the software industry, which is why users often feel like they're getting suckered by the companies they patronize; the decision makers, who don't themselves use the product, absolutely see the users as suckers.

[0] https://en.wikipedia.org/wiki/Eating_your_own_dog_food

New comment by routerl in "Show HN: Mandarin Word Segmenter with Translation"

routerl — Wed, 05 Feb 2025 21:19:46 +0000

Got it, thanks!

New comment by routerl in "Show HN: Mandarin Word Segmenter with Translation"

routerl — Tue, 04 Feb 2025 21:44:29 +0000

Could you post the text you used? This kind of thing goes straight into my unit tests.

I'm also working on showing all the pronunciations/definitions for a given hanzi, it should be ready later this week.

New comment by routerl in "Show HN: Mandarin Word Segmenter with Translation"

routerl — Tue, 04 Feb 2025 19:51:53 +0000

Thanks for the kind words!

I'm using Jieba[0] because it hits a nice balance of fast and accurate. But I'm initializing it with a custom dictionary (~800k entries), and have added several layers of heuristic post-segmentation. For example, Jieba tends to split up chengyu into two words, but I've decided they should be displayed as a single word, since chengyu are typically a single entry in dictionaries.

[0] https://github.com/fxsjy/jieba

New comment by routerl in "Ask HN: Who wants to be hired? (February 2025)"

routerl — Tue, 04 Feb 2025 17:59:24 +0000

Location: Toronto, Canada

Remote: Yes

Willing to relocate: No

Technologies:

- Frontend: Typescript, Javascript, NextJS, React, Redux, Swift

- Backend: Python, Django, PostgreSQL, Docker

Résumé/CV: Via e-mail

Email: roberto.loja+hn@gmail.com

I've submitted my latest work as a Show HN post at https://news.ycombinator.com/item?id=42936085

Show HN: Mandarin Word Segmenter with Translation

routerl — Tue, 04 Feb 2025 17:56:33 +0000

I've built mandoBot, a web app that segments and translates Mandarin Chinese text. This is a Django API (using Django-Ninja and PostgreSQL) and a NextJS front-end (with Typescript and Chakra). For a sample of what this app does, head to https://mandobot.netlify.app/?share_id=e8PZ8KFE5Y. This is my presentation of the first chapter of a classic story from the Republican era of Chinese fiction, Diary of a Madman by Lu Xun. Other chapters are located in the "Reading Room" section of the app.

This app exists because reading Mandarin is very hard for learners (like me), since Mandarin text does not separate words using spaces in the same way Western languages do. But extensive reading is the most effective way to learn vocabulary and grammar. Thus, learning Mandarin by reading requires first memorizing hundreds or thousands of words, before you can even know where one word ends and the next word begins.

I'm solving this problem by allowing users to input Mandarin text, which is then computationally segmented and machine translated by my server, which also adds dictionary definitions for each word and character. The hard part is the segmentation: it turns out that "Chinese Word Segmentation"[0] is the central problem in Chinese Natural Language Processing; no current solutions reach 100% accuracy, whether they're from Stanford[1], Academia Sinica[2], or Tsing Hua University[3]. This includes every LLM currently available.

I could talk about this for hours, but the bottom line is that this app is a way to develop my full-stack skills; the backend should be fast, accurate, secure, well-tested, and well-documented, and the front-end should be pretty, secure, well-tested, responsive, and accessible. I am the sole developer, and I'm open to any comments and suggestions: roberto.loja+hn@gmail.com

Thanks HN!

[0] https://en.wikipedia.org/wiki/Chinese_word-segmented_writing

[1] https://nlp.stanford.edu/software/segmenter.shtml

[2] https://ckip.iis.sinica.edu.tw/project/ws

[3] http://thulac.thunlp.org/

Comments URL: https://news.ycombinator.com/item?id=42936085

Points: 48

# Comments: 35

New comment by routerl in "AI Expert's Testimony Collapses over Fake AI Citations"

routerl — Mon, 03 Feb 2025 23:49:57 +0000

Seems like a straightforward case of malpractice. The guy had every ability to double check the hallucinated references, but didn't do so; he used AI to replace his own expertise, rather than augment it. I.e. he didn't "use AI to help write", he instead outsourced his work to AI with no oversight.

New comment by routerl in "Show HN: DeepSeek Your HN Profile"

routerl — Tue, 28 Jan 2025 21:03:22 +0000

I can't get mad at this, it's so spot-on.

"Predictions for 2025 Personal Projects

Will start working on a novel semantic search engine for Chinese literature, combining their interest in both Chinese culture and search technologies"

Oh wow, I've been working on this for the past two months and haven't posted about it yet.

New comment by routerl in "Forget ChatGPT: why researchers now run small AIs on their laptops"

routerl — Tue, 24 Sep 2024 13:41:02 +0000

Yes.

New comment by routerl in "Forget ChatGPT: why researchers now run small AIs on their laptops"

routerl — Sat, 21 Sep 2024 15:07:25 +0000

Imho opinion, and I'm no expert, but this has been working well for me:

Segment the texts into chunks that make sense (i.e. into the lengths of text you'll want to find, whether this means chapters, sub-chapters, paragraphs, etc), create embeddings of each chunk, and store the resultant vectors in a vector database. Your search workflow will then be to create an embedding of your query, and perform a distance comparison (e.g. cosine similarity) which returns ranked results. This way you can now semantically search your texts.

Everything I've mentioned above is fairly easily doable with existing LLM libraries like langchain or llamaindex. For reference, this is an RAG workflow.

New comment by routerl in "Steve Ballmer's incorrect binary search interview question"

routerl — Tue, 03 Sep 2024 13:48:50 +0000

This write-up makes the erroneous assumption that he's choosing randomly. He himself says, in this same write-up, that he's choosing adversarially.

Nice write-up anyway, and yes, Ballmer is wrong.

New comment by routerl in "China's 'Wukong' Hit Sells 10M Copies in Three Days"

routerl — Thu, 29 Aug 2024 00:35:42 +0000

It's not a "Chinese mythical book", it's one of the classical novels of Chinese civilization. Think of it as a cross between Lord of the Rings and The Iliad, but containing extensive references to ancient Chinese tales, culture, religion (especially Buddhism), and history (the central monk character in the story is based on a real, and revered, historical monk).

It also has beloved and well known characters who have featured in all kinds of Chinese stories and media for centuries: e.g. Erlang (no relation to the programming language), who is prominently in the opening cutscene, and is often found in stories accompanied by Nezha (who is so popular that generations of Chinese kids grew up hearing this[0] song and watching that show).

And this is by no means just a Chinese phenomenon. This story, Journey to the West, is a cultural keystone in all of Asia: Dragonball is very much based on Journey to the West[1], and "Son Goku" is just the Japanese pronunciation of the name "Sun Wukong", who is the monkey protagonist of this game. The two share many of their powers and characteristics, including flying around on a magic cloud and becoming powerful enough to challenge literal gods.

Finally, this is perhaps the first "postmodern" retelling of this extremely popular story. The game is called "black myth" because it is clearly darker and more serious than previous retellings of this story. For someone who knows this story well (i.e. basically anyone who grew up in Asia), it is a fresh version of an old classic. In this sense, this game is the equivalent of what The Witcher was for (mainly Eastern) Europeans; it takes legends, stories, superstitions you grew up hearing about (e.g. vampires, werewolves, etc) and breathes new life into them.

[0] https://m.youtube.com/watch?v=TG_KTrCetcM

[1] even down to having a cowardly pigman companion, Bajie, who is also in this game.