<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: schopra909</title><link>https://news.ycombinator.com/user?id=schopra909</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Wed, 29 Apr 2026 20:22:00 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=schopra909" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by schopra909 in "Launch HN: Freestyle – Sandboxes for Coding Agents"]]></title><description><![CDATA[
<p>Honestly never considered the forking use case; but it makes a ton of sense when explained<p>Congrats on the launch. This is cool tech</p>
]]></description><pubDate>Mon, 06 Apr 2026 19:02:29 +0000</pubDate><link>https://news.ycombinator.com/item?id=47665301</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=47665301</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47665301</guid></item><item><title><![CDATA[New comment by schopra909 in "Show HN: Three new Kitten TTS models – smallest less than 25MB"]]></title><description><![CDATA[
<p>Really cool to see innovation in terms of quality of tiny models. Great work!</p>
]]></description><pubDate>Thu, 19 Mar 2026 18:33:39 +0000</pubDate><link>https://news.ycombinator.com/item?id=47443815</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=47443815</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47443815</guid></item><item><title><![CDATA[New comment by schopra909 in "Exploring JEPA for real-time speech translation"]]></title><description><![CDATA[
<p>Very cool work! We spend a lot of time thinking about "robust representations" in the video space.<p>Are there any alternative ideas to JEPA right now, when it comes to speech encoding that couples meaning and sound? Curious to learn more about journey from the problem space to solution space (JEPA).<p>For context, in our domain video-JEPA hasn't proved to be as helpful as one would have hoped. It's decent at high level semantics (e.g. action detection) but doesn't capture enough "detail" (intentionally so) to be used as a powerful enough encoder (or regularizer). Might be just because the research models are too small / haven't been trained on sufficiently large volumes of data, yet.</p>
]]></description><pubDate>Sat, 14 Mar 2026 01:49:44 +0000</pubDate><link>https://news.ycombinator.com/item?id=47372485</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=47372485</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47372485</guid></item><item><title><![CDATA[New comment by schopra909 in "Don't post generated/AI-edited comments. HN is for conversation between humans"]]></title><description><![CDATA[
<p>Honest question, why were folks posting AI generated comments in the first place? There's such a high inertia to comment. I only comment when I have something to contribute OR find something incredibly interesting.<p>So I'm just baffled, why anyone was using AI to generate comments. Like what was the incentive driving the behavior?</p>
]]></description><pubDate>Wed, 11 Mar 2026 22:21:30 +0000</pubDate><link>https://news.ycombinator.com/item?id=47343025</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=47343025</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47343025</guid></item><item><title><![CDATA[We Built an $8/Month GPU-Cluster Monitor]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.linum.ai/field-notes/gpu-monitoring-service">https://www.linum.ai/field-notes/gpu-monitoring-service</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47253237">https://news.ycombinator.com/item?id=47253237</a></p>
<p>Points: 3</p>
<p># Comments: 0</p>
]]></description><pubDate>Wed, 04 Mar 2026 20:20:29 +0000</pubDate><link>https://www.linum.ai/field-notes/gpu-monitoring-service</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=47253237</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47253237</guid></item><item><title><![CDATA[New comment by schopra909 in "Learnings from 4 months of Image-Video VAE experiments"]]></title><description><![CDATA[
<p>It’s a great question. In terms of pre-training even if they were was enough data at that quality, storing it and either demuxing it into raw frames OR compressing it with a sufficiently powerful encoder likely would cost a lot of $. But there’s a case to potentially use a much smaller subset of that data to dial in aesthetics towards the end of training. The gotcha there would come in terms of data diversity. Often you see that models will adapt to the new distribution and forget patterns from the old data. It’s hard to disentangle a model learning clarity of detail from concepts, so you might forget key ideas when picking up these details. Nevertheless maybe there is a way to use small amounts of this data in a RL finetuning setup? In our experience RL post training changes very little in the underlying model weights — so it might be a “light” enough touch to elicit the the desired details.</p>
]]></description><pubDate>Thu, 26 Feb 2026 06:21:23 +0000</pubDate><link>https://news.ycombinator.com/item?id=47162545</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=47162545</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47162545</guid></item><item><title><![CDATA[New comment by schopra909 in "Learnings from 4 months of Image-Video VAE experiments"]]></title><description><![CDATA[
<p>honestly, it's really hard to shorten the feedback loop in this space. For this, we really just did run one experiment at a time and visually inspect the results everywhere. when you're going 0 -> 1, you're looking for "signs of life" to make sure the basic thing is working. when it comes to testing which (of the infinite levers) to the pull, a lot of it comes from intuition (which i know isn't the most fun answer). we spent a week or so just running experiments on the amount of compression we could squeeze out the VAE without significant degradation in the final results). In hindsight, spending a week on that seems like a waste, since we got the 8x spatial, 4x compression within the first 1-2 days. But in the moment, you're often unsure WHAT will be the key unlock. So, when you're in the middle of storm you're running a quick bayesian process in your head, measuring what you might learn from the outcome of the experiment vs. the time/money it would take to run the experiment. And you, hope that your intuitions become stronger over time, as you take more repetitions. More money, might help the problem (e.g. parallel experiments, more detailed explorations). But, I don't think money is a cure-all. At some point, you get lost in the sauce trying to tie the threads between all the empirical findings you have at your finger tips. Maybe one day AI models could help here integrating these all results. As it stands, they still struggle to reason about this stuff, in context of other research papers and findings (likely because all the context on arxiv is so noisy; you can't trust any particular finding and verifying findings is so hard to do, that it's hard to meta-reason about your experiments correctly).</p>
]]></description><pubDate>Thu, 26 Feb 2026 00:13:59 +0000</pubDate><link>https://news.ycombinator.com/item?id=47160034</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=47160034</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47160034</guid></item><item><title><![CDATA[New comment by schopra909 in "Learnings from 4 months of Image-Video VAE experiments"]]></title><description><![CDATA[
<p>Hadn’t seen that before! Seems very in line with what with the broader points about regularization. In table 4 they show faster convergence in 200 epochs when used alongside REPA. I’d be curious to see if it ended up beating REPA by itself with full 800 epochs of training — or if something about this new latent space, leads to plateauing itself (learns faster but caps out on expressivity). We’ve seen that phenomena before in other situations (eg UNET learns faster than DiT because of convolutions, but stops learning beyond a certain point).</p>
]]></description><pubDate>Wed, 25 Feb 2026 23:08:25 +0000</pubDate><link>https://news.ycombinator.com/item?id=47159351</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=47159351</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47159351</guid></item><item><title><![CDATA[New comment by schopra909 in "Learnings from 4 months of Image-Video VAE experiments"]]></title><description><![CDATA[
<p>yep, Apache 2.0! so anyone's welcome to download and hack away</p>
]]></description><pubDate>Wed, 25 Feb 2026 22:22:25 +0000</pubDate><link>https://news.ycombinator.com/item?id=47158897</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=47158897</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47158897</guid></item><item><title><![CDATA[New comment by schopra909 in "Learnings from 4 months of Image-Video VAE experiments"]]></title><description><![CDATA[
<p>Hi HN, I’m one of the two authors of the post and the Linum v2 text-to-video model (<a href="https://news.ycombinator.com/item?id=46721488">https://news.ycombinator.com/item?id=46721488</a>). We're releasing our Image-Video VAE (open weights) and a deep dive on how we built it. Happy to answer questions about the work!</p>
]]></description><pubDate>Tue, 24 Feb 2026 19:00:26 +0000</pubDate><link>https://news.ycombinator.com/item?id=47141121</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=47141121</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47141121</guid></item><item><title><![CDATA[Learnings from 4 months of Image-Video VAE experiments]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.linum.ai/field-notes/vae-reconstruction-vs-generation">https://www.linum.ai/field-notes/vae-reconstruction-vs-generation</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47141107">https://news.ycombinator.com/item?id=47141107</a></p>
<p>Points: 129</p>
<p># Comments: 16</p>
]]></description><pubDate>Tue, 24 Feb 2026 18:59:31 +0000</pubDate><link>https://www.linum.ai/field-notes/vae-reconstruction-vs-generation</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=47141107</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47141107</guid></item><item><title><![CDATA[New comment by schopra909 in "Show HN: Steerling-8B, a language model that can explain any token it generates"]]></title><description><![CDATA[
<p>This is very cool. Side note, I really dig the JavaScript animations on the causal block diffusion blog post. Made the concept immediately clear</p>
]]></description><pubDate>Tue, 24 Feb 2026 15:19:31 +0000</pubDate><link>https://news.ycombinator.com/item?id=47138261</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=47138261</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47138261</guid></item><item><title><![CDATA[New comment by schopra909 in "Show HN: Text-to-video model from scratch (2 brothers, 2 years, 2B params)"]]></title><description><![CDATA[
<p>That all being said, you can just delete the T5 from memory after encoding the text so save on memory.<p>The 2B parameters will take up 4 Gb of memory but activations will be a lot more given size of context windows for video.<p>A 720p 5 second video is roughly 100K tokens of context</p>
]]></description><pubDate>Fri, 23 Jan 2026 14:49:31 +0000</pubDate><link>https://news.ycombinator.com/item?id=46733136</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=46733136</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46733136</guid></item><item><title><![CDATA[New comment by schopra909 in "Show HN: Text-to-video model from scratch (2 brothers, 2 years, 2B params)"]]></title><description><![CDATA[
<p>Great idea! We haven’t tried it but def interested to see if that works as well.<p>When we started down this path, T5 was the standard (back in 2024).<p>Likely won’t be the text encoder for subsequent models, given its size (per your point) and age</p>
]]></description><pubDate>Fri, 23 Jan 2026 14:47:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=46733111</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=46733111</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46733111</guid></item><item><title><![CDATA[New comment by schopra909 in "Show HN: Text-to-video model from scratch (2 brothers, 2 years, 2B params)"]]></title><description><![CDATA[
<p>I think YC just release video on the basics of diffusion, but honestly I don’t have a good end to end guide.<p>We’re going to write up going 0->1 on a video model (all the steps) over the coming months. But it likely won’t be a class or anything like that.<p><a href="https://www.linum.ai/field-notes">https://www.linum.ai/field-notes</a><p>We want to share our learnings with folks who are curious about the space - but don’t have time to make it a full class experience.<p>Hopefully karpathy does that with his courses in the future!</p>
]]></description><pubDate>Fri, 23 Jan 2026 14:44:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=46733082</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=46733082</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46733082</guid></item><item><title><![CDATA[New comment by schopra909 in "Show HN: Text-to-video model from scratch (2 brothers, 2 years, 2B params)"]]></title><description><![CDATA[
<p><a href="https://www.linum.ai/field-notes">https://www.linum.ai/field-notes</a></p>
]]></description><pubDate>Fri, 23 Jan 2026 14:41:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=46733038</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=46733038</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46733038</guid></item><item><title><![CDATA[New comment by schopra909 in "Show HN: Text-to-video model from scratch (2 brothers, 2 years, 2B params)"]]></title><description><![CDATA[
<p>Not public yet — we’re going to clean it up so it’s readable and release it as blog posts. First one will be everything you need to know on building a VAE for image and video. Should be out in a few weeks. We’re figuring out the write balance between spending time writing and all the work we have on our plate for the next model.<p>If you’re interested in this stuff, keep an eye on field notes (our blog).</p>
]]></description><pubDate>Fri, 23 Jan 2026 14:40:39 +0000</pubDate><link>https://news.ycombinator.com/item?id=46733030</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=46733030</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46733030</guid></item><item><title><![CDATA[New comment by schopra909 in "Show HN: Text-to-video model from scratch (2 brothers, 2 years, 2B params)"]]></title><description><![CDATA[
<p>T5 Encoder is ~5B parameters so back of the envelope would be ~10GB of VRAM (it's in bfloat16). So, for 360p should take ~15 GB RAM (+/- a few GB based on the duration of video generated).<p>We can update the code over the next day or two to provide the option for delete VAE after the text encoding is computed (to save on RAM). And then report back the GB consumed for 360p, 720p 2-5 seconds on GitHub so there are more accurate numbers.<p>Beyond the 10 GB from the T5, there's just a lot of VRAM taken up by the context window of 720p video (even though the model itself is 2B parameters).</p>
]]></description><pubDate>Fri, 23 Jan 2026 01:15:26 +0000</pubDate><link>https://news.ycombinator.com/item?id=46727196</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=46727196</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46727196</guid></item><item><title><![CDATA[New comment by schopra909 in "Show HN: Text-to-video model from scratch (2 brothers, 2 years, 2B params)"]]></title><description><![CDATA[
<p>Per the RAM comment, you may able to get it run locally with two tweaks:<p><a href="https://github.com/Linum-AI/linum-v2/blob/298b1bb9186b5b9ff60331ee44de746734a79075/linum_v2/models/text2video.py#L285" rel="nofollow">https://github.com/Linum-AI/linum-v2/blob/298b1bb9186b5b9ff6...</a><p>1) Free up the t5 as soon as the text is encoded, so you reclaim GPU RAM<p>2) Manual Layer Offloading; move layers off GPU once they're done being used to free up space for the remaining layers + activations</p>
]]></description><pubDate>Thu, 22 Jan 2026 17:45:52 +0000</pubDate><link>https://news.ycombinator.com/item?id=46722643</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=46722643</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46722643</guid></item><item><title><![CDATA[New comment by schopra909 in "Show HN: Text-to-video model from scratch (2 brothers, 2 years, 2B params)"]]></title><description><![CDATA[
<p>Should be fixed now! Thanks again for the heads up</p>
]]></description><pubDate>Thu, 22 Jan 2026 17:18:34 +0000</pubDate><link>https://news.ycombinator.com/item?id=46722149</link><dc:creator>schopra909</dc:creator><comments>https://news.ycombinator.com/item?id=46722149</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46722149</guid></item></channel></rss>