Hacker News: tgtweak

New comment by tgtweak in "Gemma 4 12B: A unified, encoder-free multimodal model"

tgtweak — Thu, 04 Jun 2026 12:19:29 +0000

It's a disruption game - releasing competent open models disrupts smaller labs trying to release their own or commercialize their own. It's a similar rationale behind the Chinese labs releasing near-frontier open-weighted models, the goal is to disrupt and lift the barrier of entry for would-be competitors.

New comment by tgtweak in "Use your Nvidia GPU's VRAM as swap space on Linux"

tgtweak — Wed, 03 Jun 2026 15:27:16 +0000

I think you can definitely improve the throughput/iops by using BAR vs treating it like a file store/mount through cuda which adds a lot of overhead.

New comment by tgtweak in "Shopify Is Down"

tgtweak — Wed, 03 Jun 2026 14:37:36 +0000

Critically, it was the webhook/sync that was down which really messed with a lot of external systems (nosto, klaviyo, 3PLs...)

New comment by tgtweak in "MAI-Code-1-Flash"

tgtweak — Wed, 03 Jun 2026 14:34:58 +0000

Is anyone using haiku 4.5?

Why not showcase it against something in a similar domain like qwen3.6 or gemma 4?

New comment by tgtweak in "A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)"

tgtweak — Mon, 01 Jun 2026 08:34:10 +0000

It may work - depending on your ram speeds it might not even be that much slower.

New comment by tgtweak in "Mythos Finds a Curl Vulnerability"

tgtweak — Mon, 11 May 2026 14:33:31 +0000

I feel like, if it was a codebase without using any security analysis tools, there would have been some more significant findings - perhaps they can re-run it on an 18 month old commit and see how many it found that were subsequenty found and fixed?

Anyway, I think the case that frontier and next-gen models will get increasingly adept at finding vulnerabilities and that those on the receiving end of those vulnerabilities need to be on top of it.

New comment by tgtweak in "Cloudflare to cut about 20% of its workforce"

tgtweak — Fri, 08 May 2026 14:49:35 +0000

>Cloudflare expects second-quarter revenue of $664 million to $665 million, just under analysts' estimate of $665.3 million

Is this considered below expectations on wallstreet... enough to merit an 18% stock cut?

New comment by tgtweak in "The map that keeps Burning Man honest"

tgtweak — Thu, 07 May 2026 21:52:42 +0000

You can definitely add some telemetry to this that records and analyzes realtime location to "map" the litter, even when using a device like this. The conveyor actually seems very well suited to an external camera that records and analyzes the mess to a degree that should be suitable for the purpose of "recording" litter types and concentrations based on the location, without resorting to manual sweep/dust bins which actually sounds pretty insane at this scale.

New comment by tgtweak in "Motherboard sales 'collapse' amid unprecedented shortages fueled by AI"

tgtweak — Thu, 07 May 2026 21:41:11 +0000

When RAM and an SSD cost more than an entire system used to it's not surprising to see this.

New comment by tgtweak in "Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model"

tgtweak — Thu, 23 Apr 2026 03:31:24 +0000

Depends entirely on quantization. Q6_K with max context length (262144) is ~40GB of VRAM.

Q8 with the same context wouldn't fit in 48GB of VRAM, it did with 128k of context.

New comment by tgtweak in "Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model"

tgtweak — Thu, 23 Apr 2026 03:16:34 +0000

I've been using it in a few harnesses (FP8 quant, max context length) and it does seem to get tripped up by tool use, often repeating the same tool when it failed previously - that's usually not a great sign for long-term context and multi-step reasoning. It is excellent at one-shotting though and might be most useful as a sub-agent for a stronger frontier coordinator.

New comment by tgtweak in "Windows 9x Subsystem for Linux"

tgtweak — Wed, 22 Apr 2026 16:02:01 +0000

If you run it in qemu, all good.

New comment by tgtweak in "A new spam policy for “back button hijacking”"

tgtweak — Tue, 14 Apr 2026 17:48:30 +0000

Was honestly thinking "yeah nice Google, now do it for Android" since the worst offenders are apps (looking at you, Tiktok)

New comment by tgtweak in "All elementary functions from a single binary operator"

tgtweak — Mon, 13 Apr 2026 20:19:10 +0000

There is a huge market for "its faster" at the cost of efficiency, but I don't think your claim that an EML hardware block would be inherently less inefficient than the same workload running on a GPU. If you think it would be, back it up with some numbers.

A 10-stage EML pipeline would be about the size of an avx-512 instruction block on a modern CPU, in the realm of ~0.1mm2 on a 5nm process node (collectively including the FMA units behind it), at it's entirety about 1% of the CPU die. None of this suggests that even a ~500 wide 10-stage EML pipeline would be consuming anywhere near the power of a modern datacenter GPU (which wastes a lot of it's energy moving things from memory to ALU to shader core...).

Not sure if you're arguing from a hypothetical position or practical one but you seem to be narrowing your argument to "well for simple math it's less efficient" but that's not the argument being made at all.

New comment by tgtweak in "All elementary functions from a single binary operator"

tgtweak — Mon, 13 Apr 2026 17:31:20 +0000

For basic arithmetic, this is not required nor would it be faster, as it is not likely advantageous for bulk static transcendal functions. Where this becomes interesting is when combining them OR when chaining them where today they must come back out to the main process for reconfiguration and then re-issued.

Practical terms: Jacobian (heavily used in weather and combustion simulation): The transcendental calls, mostly exp(-E_a/RT), are the actual clock-cycle bottleneck. The GPU's SFU computes one exp2 at a time per SM. The ALU then has to convert it (exp(x) = exp2(x × log2(e))), multiply by the pre-exponential factor, and accumulate partial derivatives. It's a long serial chain for each reaction rate.

The core of this is the Arrhenius rate, (A × T^n × exp(-E_a/(R×T))), which involves an exponentiation, a division, a multiplication, and an exponential. On a GPU, that's multiple SFU calls chained with ALU ops. In an EML tree, the whole expression compiles to a single tree that flows through the pipeline in one pass.

GPU (PreJacGPU) is currently the state of the art for speed on these simulations - a moderate width 8-depth EML machine could process a very complex Jacobian as fast as the gpu can evaluate one exp(). Even on a sub-optimal 250mhz FPGA, an entire 50x50 Jacobian would be about 3.5 microseconds vs 50 microseconds PER Jacobian on an A100.

If you put that same logic path into an ASIC, you'd be about 20x the fPGA's speed - in the nanoseconds per round. And this is not like you're building one function into an ASIC it's general purpose. You just feed it a compiled tree configuration and run your data through it.

For anything like linear algebra math, which is also used here, you'd delegate that to the dedicated math functions on the processor - it wouldn't make sense to do those in this.

New comment by tgtweak in "All elementary functions from a single binary operator"

tgtweak — Mon, 13 Apr 2026 15:58:04 +0000

I actually don't think this is true -

Traditional processors, even highly dedicated ones like TMUs in gpus, still require being preconfigured substantially in order to switch between sin/cos/exp2/log2 function calls, whereas a silicon implementation of an 8-layer EML machine could do that by passing a single config byte along with the inputs. If you had a 512-wide pipeline of EML logic blocks in modern silicon (say 5nm), you could get around 1 trillion elementary function evaluations per second on 2.5ghz chip. Compare this with a 96 core zen5 server CPU with AVX-512 which can do about 50-100 billion scalar-equivalent evaluations per second across all cores only for one specific unchanging function.

Take the fastest current math processors: TMUs on a modern gpu: it can calculate sin OR cos OR exp2 OR log2 in 1 cycle per shader unit... but that is ONLY for those elementary functions and ONLY if they don't change - changing the function being called incurs a huge cycle hit, and chaining the calculations also incurs latency hits. An EML coprocessor could do arcsinh(x² + ln(y)) in the same hardware block, with the same latency as a modern cpu can do a single FMA instruction.

New comment by tgtweak in "All elementary functions from a single binary operator"

tgtweak — Mon, 13 Apr 2026 14:54:20 +0000

You could also make an analog EML circuit in theory, using electrical primitives that have been around since the 60s. You could build a simple EML evaluator on a breadboard. Things like trig functions would be hard to reproduce, but you could technically evaluate output in electrical realtime (the time it takes the electrical signal to travel though these 8-10 analog amplifier stages).

New comment by tgtweak in "All elementary functions from a single binary operator"

tgtweak — Mon, 13 Apr 2026 14:51:27 +0000

Yes actually, it is very regular which usually lends itself to silicon implementations - the paper event talks about this briefly.

I think the bigger question is whether it will be more energy-optimal or silicon density-optimal than math libraries that are currently baked into these processors (FPUs).

There are also some edge cases "exp(exp(x))" and infinities that seem to result in something akin to "division by zero" where you need more than standard floating-point representations to compute - but these edge cases seem like compiler workarounds vs silicon issues.

New comment by tgtweak in "All elementary functions from a single binary operator"

tgtweak — Mon, 13 Apr 2026 14:45:17 +0000

This paper seems to suggest that a chip with 10 pipeline stages of EML units could evaluate any elementary function (table 4) in a single pass.

I'm curious how this would compare to the dedicated sse or xmx instructions currently inside most processor's instruction sets.

Lastly, you could also create 5-depth or 6-depth EML tree in hardware (fpga most likely) and use it in lieu of the rust implementation to discover weight-optimal eml formulas for input functions much quicker, those could then feed into a "compiler" that would allow it to run on a similar-scale interpreter on the same silicon.

In simple terms: you can imagine an EML co-processor sitting alongside a CPUs standard math coprocessor(s): XMX, SSE, AMX would do the multiplication/tile math they're optimized for, and would then call the EML coprocessor to do exp,sin,log calls which are processed by reconfiguring the EML trees internally to process those at single-cycle speed instead of relaying them back to the main CPU to do that math in generalized instructions - likely something that takes many cycles to achieve.

New comment by tgtweak in "All elementary functions from a single binary operator"

tgtweak — Mon, 13 Apr 2026 14:27:18 +0000

This could have some interesting hardware implications as well - it suggests that a large dedicated silicon instruction set could accelerate any mathematical algorithm provided it can be mapped to this primitive. It also suggests a compiler/translation layer should be possible as well as some novel visualization methods for functions and methods.