Hacker News: cschmidt

New comment by cschmidt in "The looming college-enrollment death spiral"

cschmidt — Mon, 13 Apr 2026 20:48:15 +0000

Those are not global students. Those are people who are already living in the state. Foreign students typically pay the most tuition possible with no financial aid, subsidizing everyone else.

New comment by cschmidt in "The Brand Age"

cschmidt — Fri, 06 Mar 2026 16:16:46 +0000

Looks great. I just ordered it. Thanks for the recommendation.

New comment by cschmidt in "Google boss says AI investment boom has 'elements of irrationality'"

cschmidt — Tue, 18 Nov 2025 21:36:51 +0000

There are equal weight S&P ETFs, which avoid having a handful of stock dominating. However, they do have to do a lot more rebalancing to keep things in line.

New comment by cschmidt in "Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?"

cschmidt — Thu, 23 Oct 2025 12:25:37 +0000

There is other research that works with pixels of text, such as this recent paper I saw at COLM 2025 https://arxiv.org/abs/2504.02122.

New comment by cschmidt in "Eleven Music"

cschmidt — Tue, 05 Aug 2025 17:36:49 +0000

I worry how often that is happening already on Spotify.

Gian-Carlo Rota's Combinatorial Theory Course: The Guidi Notes

cschmidt — Wed, 30 Jul 2025 03:01:38 +0000

Article URL: https://www.ellerman.org/gian-carlo-rotas-combinatorial-theory-course-the-guidi-notes/

Comments URL: https://news.ycombinator.com/item?id=44730583

Points: 1

# Comments: 0

New comment by cschmidt in "Stanford’s Department of Management Science and Engineering"

cschmidt — Tue, 29 Jul 2025 23:40:16 +0000

I’m not sure about this masters program, but the undergrad program seems to be proper ORMS.

New comment by cschmidt in "Stanford’s Department of Management Science and Engineering"

cschmidt — Tue, 29 Jul 2025 23:32:58 +0000

I think in this context Management Science is an older term that was synonymous with operations research. The flagship journal of Informs (the institute for operations research and management science) has the same name. Studying how to optimize thing, lots of statistics and math. Stanford was at the forefront of the field from George Danzig onwards. So not trying to make management a “science” in this case.

New comment by cschmidt in "The bitter lesson is coming for tokenization"

cschmidt — Sat, 28 Jun 2025 15:17:54 +0000

Attention does help, which is why it can learn arithmetic, even with arbitrary tokenization. However, if you put it in a standard form, such as right-to-left groups of 3, you make it an easier problem for the LLM to learn. All the examples it sees are in the same format. Here, the issue is that BLT operates in an autoregressive manner (strictly left to right), which makes it harder to tokenize the digits in a way that is easier for the LLM to learn. Each digit is its own token (Llama style), or flipping the digits might be the best.

New comment by cschmidt in "The bitter lesson is coming for tokenization"

cschmidt — Thu, 26 Jun 2025 12:02:55 +0000

Math operations go right to left in the text, while we write them left to right. So if you see the digits 123... in an autoreressive manner, you don't know really anything, since it could be 12345 or 1234567. If you flipped 12345 as 543..., you know the place value of each. You know that the 5 you encounter first is in the ones place, the 4 is the tens place, etc. It gives the LLM a better chance of learning arithmetic.

New comment by cschmidt in "The bitter lesson is coming for tokenization"

cschmidt — Wed, 25 Jun 2025 13:23:34 +0000

And in regard to utf-8 being a shitty biased tokenizer, here is recent paper trying to design a better style of encoding https://arxiv.org/abs/2505.24689

New comment by cschmidt in "The bitter lesson is coming for tokenization"

cschmidt — Wed, 25 Jun 2025 13:18:59 +0000

Virtually all current tokenization schemes do work at the raw byte level, not the utf-8 character. They do this to avoid the Out of Vocabulary (OOV) or unknown token problem. In older models, if you came across something in the data you can't tokenize, you add a . But tokenization should be exactly reversible, so now people use subword tokenizers including all 256 single bytes in the vocab. That way you can always represent any text by dropping down to the single byte level. The other alternative would be to add all utf-8 code points to the vocabulary, but there are more than 150k of those, and enough are rare, that many would be undertrained. You'd have a lot of glitch tokens (https://arxiv.org/abs/2405.05417). That does mean an LLM isn't 100% guaranteed to output well formed utf-8.

New comment by cschmidt in "The bitter lesson is coming for tokenization"

cschmidt — Wed, 25 Jun 2025 11:37:47 +0000

I suppose it is. There is a lot to tokenization - pre-tokenization, how to handle digits, the tokenization training approach - that is about adding cleverness. In the long run, the bitter lesson would be to just get rid of it all and learn from more data. Many people would love to do it. But I think for the case of BLT, digits will still be an issue. There is no way an autoregressive entropy model will be able to split numbers sensibly, since it has no idea how many digits are coming. It seems like it will struggle more with arithmetic. Perhaps you could reverse all the digits in a number, then it has a chance. So 12334 becomes 43321, and it gets to start from the ones digit. This has been suggested as an approach for LLM's.

New comment by cschmidt in "The bitter lesson is coming for tokenization"

cschmidt — Tue, 24 Jun 2025 18:45:00 +0000

This paper has a good solution:

https://arxiv.org/abs/2402.14903

You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.

Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.

New comment by cschmidt in "Last fifty years of integer linear programming: Recent practical advances"

cschmidt — Sat, 14 Jun 2025 16:36:21 +0000

Gurobi does have a cloud service where you pay by the hour. A full non-academic license is pricy.

New comment by cschmidt in "Quarkdown: A modern Markdown-based typesetting system"

cschmidt — Wed, 04 Jun 2025 10:36:27 +0000

I'm just saying that these systems don't work for me. I write ML/AI conference papers in LaTeX, and I think that use case will be tough to dislodge. I can see this being very attractive to people making other types of documents without a fixed format, especially if you don't already know LaTeX.

New comment by cschmidt in "Quarkdown: A modern Markdown-based typesetting system"

cschmidt — Wed, 04 Jun 2025 10:26:38 +0000

One thing that has helped with ease of use is Overleaf. It is a hosted LaTeX editor with lots of collaboration features (leaving comments, history of edits) that let people collaborate in real time on a paper. It comes with many templates to get you started on a new document. If you're working with collaborators, it has a lock on the market.

LaTeX itself can be easy for simple things (pick a template, and put text in each section). And it can grow into almost anything if you put in enough effort. It is far and away the standard way to write math equations, so if your document has lots of formulas, that's a plus.

New comment by cschmidt in "Quarkdown: A modern Markdown-based typesetting system"

cschmidt — Wed, 04 Jun 2025 10:18:43 +0000

You make a fair point - I'm talking specifically about CS/ML/AI conferences. I shouldn't overgeneralize.

New comment by cschmidt in "Quarkdown: A modern Markdown-based typesetting system"

cschmidt — Tue, 03 Jun 2025 14:54:30 +0000

Every conference has their own required LaTeX style file that must be used. Unless there is an automated way to convert these exactly, I don't see how LaTeX alternatives can be used.

New comment by cschmidt in "Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)"

cschmidt — Sun, 01 Jun 2025 14:14:50 +0000

Anyone reading this in the future, I meant to say the length weighting is a bit nonstandard. It is usually by frequency. Oops