<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: cschmidt</title><link>https://news.ycombinator.com/user?id=cschmidt</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Fri, 17 Apr 2026 03:28:38 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=cschmidt" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by cschmidt in "The looming college-enrollment death spiral"]]></title><description><![CDATA[
<p>Those are not global students.  Those are people who are already living in the state. Foreign students typically pay the most tuition possible with no financial aid, subsidizing everyone else.</p>
]]></description><pubDate>Mon, 13 Apr 2026 20:48:15 +0000</pubDate><link>https://news.ycombinator.com/item?id=47757620</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=47757620</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47757620</guid></item><item><title><![CDATA[New comment by cschmidt in "The Brand Age"]]></title><description><![CDATA[
<p>Looks great. I just ordered it. Thanks for the recommendation.</p>
]]></description><pubDate>Fri, 06 Mar 2026 16:16:46 +0000</pubDate><link>https://news.ycombinator.com/item?id=47276910</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=47276910</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47276910</guid></item><item><title><![CDATA[New comment by cschmidt in "Google boss says AI investment boom has 'elements of irrationality'"]]></title><description><![CDATA[
<p>There are equal weight S&P ETFs, which avoid having a handful of stock dominating.  However, they do have to do a lot more rebalancing to keep things in line.</p>
]]></description><pubDate>Tue, 18 Nov 2025 21:36:51 +0000</pubDate><link>https://news.ycombinator.com/item?id=45972478</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=45972478</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45972478</guid></item><item><title><![CDATA[New comment by cschmidt in "Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?"]]></title><description><![CDATA[
<p>There is other research that works with pixels of text, such as this recent paper I saw at COLM 2025 <a href="https://arxiv.org/abs/2504.02122" rel="nofollow">https://arxiv.org/abs/2504.02122</a>.</p>
]]></description><pubDate>Thu, 23 Oct 2025 12:25:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=45681038</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=45681038</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45681038</guid></item><item><title><![CDATA[New comment by cschmidt in "Eleven Music"]]></title><description><![CDATA[
<p>I worry how often that is happening already on Spotify.</p>
]]></description><pubDate>Tue, 05 Aug 2025 17:36:49 +0000</pubDate><link>https://news.ycombinator.com/item?id=44801349</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=44801349</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44801349</guid></item><item><title><![CDATA[Gian-Carlo Rota's Combinatorial Theory Course: The Guidi Notes]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.ellerman.org/gian-carlo-rotas-combinatorial-theory-course-the-guidi-notes/">https://www.ellerman.org/gian-carlo-rotas-combinatorial-theory-course-the-guidi-notes/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44730583">https://news.ycombinator.com/item?id=44730583</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Wed, 30 Jul 2025 03:01:38 +0000</pubDate><link>https://www.ellerman.org/gian-carlo-rotas-combinatorial-theory-course-the-guidi-notes/</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=44730583</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44730583</guid></item><item><title><![CDATA[New comment by cschmidt in "Stanford’s Department of Management Science and Engineering"]]></title><description><![CDATA[
<p>I’m not sure about this masters program, but the undergrad program seems to be proper ORMS.</p>
]]></description><pubDate>Tue, 29 Jul 2025 23:40:16 +0000</pubDate><link>https://news.ycombinator.com/item?id=44729500</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=44729500</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44729500</guid></item><item><title><![CDATA[New comment by cschmidt in "Stanford’s Department of Management Science and Engineering"]]></title><description><![CDATA[
<p>I think in this context Management Science is an older term that was synonymous with operations research. The flagship journal of Informs (the institute for operations research and management science) has the same name. Studying how to optimize thing, lots of statistics and math. Stanford was at the forefront of the field from George Danzig onwards. So not trying to make management a “science” in this case.</p>
]]></description><pubDate>Tue, 29 Jul 2025 23:32:58 +0000</pubDate><link>https://news.ycombinator.com/item?id=44729457</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=44729457</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44729457</guid></item><item><title><![CDATA[New comment by cschmidt in "The bitter lesson is coming for tokenization"]]></title><description><![CDATA[
<p>Attention does help, which is why it can learn arithmetic, even with arbitrary tokenization.  However, if you put it in a standard form, such as right-to-left groups of 3, you make it an easier problem for the LLM to learn. All the examples it sees are in the same format. Here, the issue is that BLT operates in an autoregressive manner (strictly left to right), which makes it harder to tokenize the digits in a way that is easier for the LLM to learn. Each digit is its own token (Llama style), or flipping the digits might be the best.</p>
]]></description><pubDate>Sat, 28 Jun 2025 15:17:54 +0000</pubDate><link>https://news.ycombinator.com/item?id=44405287</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=44405287</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44405287</guid></item><item><title><![CDATA[New comment by cschmidt in "The bitter lesson is coming for tokenization"]]></title><description><![CDATA[
<p>Math operations go right to left in the text, while we write them left to right.  So if you see the digits 123... in an autoreressive manner,  you don't know really anything, since it could be 12345 or 1234567.  If you flipped 12345 as 543..., you know the place value of each.  You know that the 5 you encounter first is in the ones place, the 4 is the tens place, etc. It gives the LLM a better chance of learning arithmetic.</p>
]]></description><pubDate>Thu, 26 Jun 2025 12:02:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=44386567</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=44386567</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44386567</guid></item><item><title><![CDATA[New comment by cschmidt in "The bitter lesson is coming for tokenization"]]></title><description><![CDATA[
<p>And in regard to utf-8 being a shitty biased tokenizer, here is recent paper trying to design a better style of encoding <a href="https://arxiv.org/abs/2505.24689" rel="nofollow">https://arxiv.org/abs/2505.24689</a></p>
]]></description><pubDate>Wed, 25 Jun 2025 13:23:34 +0000</pubDate><link>https://news.ycombinator.com/item?id=44377052</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=44377052</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44377052</guid></item><item><title><![CDATA[New comment by cschmidt in "The bitter lesson is coming for tokenization"]]></title><description><![CDATA[
<p>Virtually all current tokenization schemes do work at the raw byte level, not the utf-8 character. They do this to avoid the Out of Vocabulary (OOV) or unknown token problem.  In older models, if you came across something in the data you can't tokenize, you add a <UNK>.  But tokenization should be exactly reversible, so now people use subword tokenizers including all 256 single bytes in the vocab.  That way you can always represent any text by dropping down to the single byte level.  The other alternative would be to add all utf-8 code points to the vocabulary, but there are more than 150k of those, and enough are rare, that many would be undertrained.  You'd have a lot of glitch tokens (<a href="https://arxiv.org/abs/2405.05417" rel="nofollow">https://arxiv.org/abs/2405.05417</a>). That does mean an LLM isn't 100% guaranteed to output well formed utf-8.</p>
]]></description><pubDate>Wed, 25 Jun 2025 13:18:59 +0000</pubDate><link>https://news.ycombinator.com/item?id=44377004</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=44377004</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44377004</guid></item><item><title><![CDATA[New comment by cschmidt in "The bitter lesson is coming for tokenization"]]></title><description><![CDATA[
<p>I suppose it is.  There is a lot to tokenization - pre-tokenization, how to handle digits, the tokenization training approach - that is about adding cleverness.  In the long run, the bitter lesson would be to just get rid of it all and learn from more data. Many people would love to do it. But I think for the case of BLT, digits will still be an issue.  There is no way an autoregressive entropy model will be able to split numbers sensibly, since it has no idea how many digits are coming.  It seems like it will struggle more with arithmetic. Perhaps you could reverse all the digits in a number, then it has a chance.  So 12334 becomes 43321, and it gets to start from the ones digit.  This has been suggested as an approach for LLM's.</p>
]]></description><pubDate>Wed, 25 Jun 2025 11:37:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=44376102</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=44376102</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44376102</guid></item><item><title><![CDATA[New comment by cschmidt in "The bitter lesson is coming for tokenization"]]></title><description><![CDATA[
<p>This paper has a good solution:<p><a href="https://arxiv.org/abs/2402.14903" rel="nofollow">https://arxiv.org/abs/2402.14903</a><p>You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7.  And if you ensure all 1-3 digits groups are in the vocab, it does much better.<p>Both <a href="https://arxiv.org/abs/2503.13423" rel="nofollow">https://arxiv.org/abs/2503.13423</a> and <a href="https://arxiv.org/abs/2504.00178" rel="nofollow">https://arxiv.org/abs/2504.00178</a> (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.</p>
]]></description><pubDate>Tue, 24 Jun 2025 18:45:00 +0000</pubDate><link>https://news.ycombinator.com/item?id=44369438</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=44369438</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44369438</guid></item><item><title><![CDATA[New comment by cschmidt in "Last fifty years of integer linear programming: Recent practical advances"]]></title><description><![CDATA[
<p>Gurobi does have a cloud service where you pay by the hour.  A full non-academic license is pricy.</p>
]]></description><pubDate>Sat, 14 Jun 2025 16:36:21 +0000</pubDate><link>https://news.ycombinator.com/item?id=44277319</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=44277319</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44277319</guid></item><item><title><![CDATA[New comment by cschmidt in "Quarkdown: A modern Markdown-based typesetting system"]]></title><description><![CDATA[
<p>I'm just saying that these systems don't work for me.  I write ML/AI conference papers in LaTeX, and I think that use case will be tough to dislodge.  I can see this being very attractive to people making other types of documents without a fixed format, especially if you don't already know LaTeX.</p>
]]></description><pubDate>Wed, 04 Jun 2025 10:36:27 +0000</pubDate><link>https://news.ycombinator.com/item?id=44179253</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=44179253</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44179253</guid></item><item><title><![CDATA[New comment by cschmidt in "Quarkdown: A modern Markdown-based typesetting system"]]></title><description><![CDATA[
<p>One thing that has helped with ease of use is Overleaf.  It is a hosted LaTeX editor with lots of collaboration features (leaving comments, history of edits) that let people collaborate in real time on a paper.  It comes with many templates to get you started on a new document.  If you're working with collaborators, it has a lock on the market.<p>LaTeX itself can be easy for simple things (pick a template, and put text in each section).  And it can grow into almost anything if you put in enough effort.  It is far and away the standard way to write math equations, so if your document has lots of formulas, that's a plus.</p>
]]></description><pubDate>Wed, 04 Jun 2025 10:26:38 +0000</pubDate><link>https://news.ycombinator.com/item?id=44179197</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=44179197</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44179197</guid></item><item><title><![CDATA[New comment by cschmidt in "Quarkdown: A modern Markdown-based typesetting system"]]></title><description><![CDATA[
<p>You make a fair point - I'm talking specifically about CS/ML/AI conferences.  I shouldn't overgeneralize.</p>
]]></description><pubDate>Wed, 04 Jun 2025 10:18:43 +0000</pubDate><link>https://news.ycombinator.com/item?id=44179156</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=44179156</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44179156</guid></item><item><title><![CDATA[New comment by cschmidt in "Quarkdown: A modern Markdown-based typesetting system"]]></title><description><![CDATA[
<p>Every conference has their own required LaTeX style file that must be used.  Unless there is an automated way to convert these exactly, I don't see how LaTeX alternatives can be used.</p>
]]></description><pubDate>Tue, 03 Jun 2025 14:54:30 +0000</pubDate><link>https://news.ycombinator.com/item?id=44170780</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=44170780</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44170780</guid></item><item><title><![CDATA[New comment by cschmidt in "Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)"]]></title><description><![CDATA[
<p>Anyone reading this in the future, I meant to say the length weighting is a bit nonstandard. It is usually by frequency.  Oops</p>
]]></description><pubDate>Sun, 01 Jun 2025 14:14:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=44151048</link><dc:creator>cschmidt</dc:creator><comments>https://news.ycombinator.com/item?id=44151048</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44151048</guid></item></channel></rss>