Hacker News: easygenes

New comment by easygenes in "Uber's $1,500/month AI limit is a useful signal for AI tool pricing"

easygenes — Wed, 03 Jun 2026 23:58:48 +0000

If I were paying API rates this year, I would have already burned through $20k in tokens. Looking forward to the costs of this level of capability coming down.

New comment by easygenes in "Gemma 4 12B: A unified, encoder-free multimodal model"

easygenes — Wed, 03 Jun 2026 23:45:10 +0000

I have now also tried it on this scatter plot: https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-p...

Similarly, the 26B A4B Gemma 4 and the 35B A3B Qwen 3.6 identify it clearly, give me the title and trends analysis fairly accurately. While this 12B spits out gobbledygook about it having something to do with hard-drive capacity. It's like it can barely see, gets the very broad strokes (knows it's looking at some kind of chart), but can't identify any details clearly.

New comment by easygenes in "Gemma 4 12B: A unified, encoder-free multimodal model"

easygenes — Wed, 03 Jun 2026 23:36:05 +0000

They haven't made one for this new model, but Unsloth has a comprehensive quant KLD map of Gemma 4 26B A4B here: https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-p...

New comment by easygenes in "Gemma 4 12B: A unified, encoder-free multimodal model"

easygenes — Wed, 03 Jun 2026 23:24:04 +0000

I want to like the vision capabilities of the model. However, when I gave it an image which Gemma 26B A4B and Qwen 3.6 35B A3B has no problem correctly describing in detail, including identifying the Taj Mahal in the background it utterly failed. Its sense of the image was that it was a "distorted wide panorama" and even when I asked directly if it was the Taj Mahal it said no. The reference models saw it correctly as a normal square image taken from a fairly rectilinear lens (iPhone main camera).

New comment by easygenes in "MAI-Code-1-Flash"

easygenes — Wed, 03 Jun 2026 07:53:37 +0000

Have you run it through DeepSWE? I understand that's probably a high ask for this class of model, but would be interesting to see regardless.

Even if it can't fully pass much, there are so many tests against most of the scenarios that you can get a fairly rich report beyond the pass@1 stat. See e.g. this DeepSWE report against the Minimax M3 model: https://entrpi.github.io/misc/deep-swe-minimax-m3/

New comment by easygenes in "MAI-Code-1-Flash"

easygenes — Wed, 03 Jun 2026 03:57:07 +0000

While I agree directionally, I'll caveat that "cost per token" != "cost per task". In the case of Qwen3.6 it tends to think 1.6x more than Haiku, so the cost of Haiku on the same tasks tends to only be about double. More detail from comparing their Artificial Analysis metrics:

  Qwen3.6-35B-A3B   vs   Claude Haiku 4.5
    reasoning mode · AA Intelligence Index v4.0
  
  46.0 ┤   ↖ better — cheaper · smarter · faster
       │
       │
  44.0 ┤     ╭─────╮
       │     │  ●  │ Qwen3.6-35B-A3B
       │     ╰─────╯
  42.0 ┤
       │
       │
  40.0 ┤
       │
       │
  38.0 ┤                                       ╭───╮
       │                      Claude Haiku 4.5 │ ○ │
       │                                       ╰───╯
  36.0 ┤
       └┬─────────┬─────────┬─────────┬─────────┬────────┬
        $200    $300      $400      $500      $600    $700
  
    x → cost to run the index (USD)        lower is better
    y → AA intelligence index              higher is better
  
    bubble area = output speed (tokens / sec)
          ╭─────╮                  ╭───╮
          │  ●  │ Qwen ~196 t/s    │ ○ │ Haiku ~93 t/s
          ╰─────╯                  ╰───╯
  
    ┌─────────────────────┬──────────┬──────────┬───────────┐
    │ model               │ AA index │ run cost │ out speed │
    ├─────────────────────┼──────────┼──────────┼───────────┤
    │ Qwen3.6-35B-A3B    ●│   43.5   │   $280   │  196 t/s  │
    │ Claude Haiku 4.5   ○│   37.1   │   $620   │   93 t/s  │
    └─────────────────────┴──────────┴──────────┴───────────┘


    COST PER TOKEN   ≠   COST PER TASK  
    output tokens per index run:
       Haiku 4.5    87.3M   (79.3M reasoning + 8.0M answer)
       Qwen3.6     143.2M   (131.7M reasoning + 11.5M answer)
       → Qwen emits 1.64× more output
  
    ── output speed (tokens / sec) ──────────  raw rate · higher = faster
       Qwen3.6     100%   ~196 t/s
       Haiku 4.5   ~47%   ~93 t/s
                                                  → Qwen ~2.1× faster per token
  
          ╎   1.64× more tokens  <  2.1× faster rate
          ▼
  
    ── solution speed (per finished answer) ──  higher = faster
       Qwen3.6     100%
       Haiku 4.5   ~78%
                                                  → Qwen ~1.3× FASTER to a solution
  
    SCORECARD
                            intelligence    cost / task     speed to solution
     Qwen3.6-35B-A3B        43.5            $280            ~1.3× faster 
     Claude Haiku 4.5       37.1            $620            (slower)
  
     → Qwen wins all three. The reasoning blow-up (1.64×) is smaller than
       the raw-speed edge (2.1×), so Qwen stays ahead per task.

New comment by easygenes in "Nvidia RTX Spark"

easygenes — Tue, 02 Jun 2026 06:31:52 +0000

Speaking as someone who has had a DGX Spark all year and been active developing at the driver and kernel level for it and other ARM64 Linux devices the last couple of years, it's not bad now and certainly doesn't have any issues that I wouldn't expect to be fully fixed with the second-gen motherboards going into these. The main hardware issues are not with the core SoC. They're replaceable edge peripherals like the PD PMIC.

New comment by easygenes in "Nvidia RTX Spark"

easygenes — Tue, 02 Jun 2026 06:25:20 +0000

Looks like RTX Spark desktop is the DGX Spark desktop, minus the expensive 200GbE Connect-X NIC. Only since the DGX Spark released, memory and nand prices have jumped, so it will likely retail for the same amount as the DGX Spark did on release (which has since gone up significantly).

New comment by easygenes in "DeepSeek reasonix, DeepSeek native coding agent with high caching and low cost"

easygenes — Mon, 25 May 2026 06:40:06 +0000

Claude Opus 4.7 defaults to exactly this design language for a lot of "just make me a rich html presentation page" requests without further specification.

New comment by easygenes in "Don't just paste the AI at me"

easygenes — Sat, 23 May 2026 01:03:27 +0000

While I endorse the message of TFA (though do find the framing a bit on the overly blunt side), I believe it's unfair to reduce to "losing the person". The person is still willing to engage with you and still had to use their human words to prompt the AI. The latent space they exposed within the model is still uniquely the result of their words and effort.

We're just missing the establishment of a decorum of, "even if you do feel like you need to prompt the AI before responding, and even if you like the response, you still need to paraphrase and synthesize to avoid coming off rude and inhuman."

New comment by easygenes in "Gemini 3.5 Flash"

easygenes — Thu, 21 May 2026 00:19:26 +0000

Their methods are only calibrated on open models (of course) and they admit very broad confidence bounds. You can also just see from comparing their estimates of the same models at different reasoning levels that there are major confounders to this. I would err on the absolute lowest side of their estimates for frontier models (e.g. 3T for GPT-5.5, 1.5-2T for Opus 4.5+).

New comment by easygenes in "Gemini 3.5 Flash"

easygenes — Wed, 20 May 2026 05:28:48 +0000

This is the reality of the premiums available from being in the lead by ~8 months on model building technicals.

New comment by easygenes in "Gemini 3.5 Flash"

easygenes — Wed, 20 May 2026 03:30:14 +0000

We know from NVIDIA's public Vera Rubin inference engine marketing materials that the frontier lab models are ~1-2T total.

Mythos is an exception that's larger.

New comment by easygenes in "Gemini 3.5 Flash"

easygenes — Wed, 20 May 2026 02:11:57 +0000

For those who would like to know the total and active parameter count of this model: even though Google doesn't disclose the model technicals, we can infer them within relatively tight margins based on what we do know.

We know they serve the model on TPU 8i, which we have plenty of hard specs for (so we know the key constraints: total memory and bandwidth and compute flops). We can also set a ceiling on the compute complexity and memory demand of the model based on knowing they will be at least as efficient as what is disclosed in the Deepseek V4 Technical Report.

We can also assume that the model was explicitly built to run efficiently in a RadixAttention style batched serving scenario on a single TPU 8i (so no tensor parallelism, etc. to avoid unnecessary overheads... Google explicitly designed the 8th-generation inference architecture to eliminate the need for tensor sharding on mid-sized models).

We know Google intends to serve this model at a floor speed of around 280 tok/s too.

Putting all these pieces together, we can confidently say this model is ~250-300B total, and 10-16B active parameters. Likely mostly FP4 with FP8 where it matters most.

Visual:

  ┌────────────────────────────────────────────────────────┐
  │                   TPU 8i VRAM (288 GB)                 │
  ├───────────────────────────┬────────────────────────────┤
  │   Static Model Weights    │  Dynamic Allocations &     │
  │   (250B - 300B @ Mixed    │  Compressed KV Caches      │
  │   FP4/FP8)                │  (RadixAttention / SRAM)   │
  │   ~110 GB - 150 GB        │  ~138 GB - 178 GB          │
  └───────────────────────────┴────────────────────────────┘

I do model serving optimization work. This is napkin math.

Edit: There's one factor I under-rated in my initial estimate... TurboQuant. This is a compute to KV memory use tradeoff. It's plausible with TurboQuant at a quality-neutral setting they've gotten the model up to 400B with similar economics. This is a variable effecting concurrency and the the way they decided total model size was likely based on what they see for the average user's average KV cache depth in real-world usage.

New comment by easygenes in "Ti-84 Evo"

easygenes — Sat, 02 May 2026 01:37:40 +0000

This has me pining for a future professional class CAS 3d graphing calculator.

I'm thinking something that could be a major upgrade in spirit to the long-in-the-tooth (released a decade ago) Casio FX-CG500.

Could use the soon to be released ARM C-1 Nano and Pro cores in an SoC with stacked 2GB LPDDR4, USB-C charging to a large battery, high-res transflective LCD...

Mockup "AxiomPad Pro X1": https://enia.cc/out/axiompad-cas-mock.png

New comment by easygenes in "L123: A Lotus 1-2-3–style terminal spreadsheet with modern Excel compatibility"

easygenes — Mon, 27 Apr 2026 21:09:36 +0000

Very early in my career I made friends with the business’s sole Lotus Notes administrator, "the email server guy." He was pretty proud of what it could do, and I sometimes get nostalgic for the admin UI.

New comment by easygenes in "Zindex – Diagram Infrastructure for Agents"

easygenes — Tue, 21 Apr 2026 23:28:23 +0000

You could have it propose a spec or review a proposed spec to also get diagrams in a similar manner.

New comment by easygenes in "Zindex – Diagram Infrastructure for Agents"

easygenes — Tue, 21 Apr 2026 23:10:37 +0000

Claude tends to default to and do best with first making ASCII diagrams in markdown files, which you can then ask it to translate into Mermaid if appropriate.

Prompts like, "Please write a comprehensive report on _____ to work with _____. Include a holistic report on architecture and meaning and purpose of all involved systems. Describe the why and how of the changes in depth and include a full glossary of terms and systems. Write as a new .md in docs when you are sure there are no major gaps in your understanding. Include a report on the plan to _____."

Will be the rough shape you want to get it to dig through all the relevant code and make relevant architectural diagrams. Guide more or less towards specifics as appropriate. This has worked well since Opus 4.5.

New comment by easygenes in "Zindex – Diagram Infrastructure for Agents"

easygenes — Tue, 21 Apr 2026 23:00:38 +0000

Yeah, agree. This is the sort of thing you release to build brand awareness and either offer a hosted option as a bonus or integrate into a larger stack. It is not the product. Someone will just make a better OSS option if they don't do it themselves.

New comment by easygenes in "Qwen3.6-Max-Preview: Smarter, Sharper, Still Evolving"

easygenes — Tue, 21 Apr 2026 04:06:07 +0000

Unless you're looking at something like a pass@100 benchmark, the benchmarks are confounded heavily by a likelihood of a "golden path" retrieval within their capabilities. This is on top of uncertainties like how well your task within a domain maps to the relevant test sets, as well as factors like context fullness and context complexity (heavy list of relevant complex instructions can weigh on capabilities in different ways than e.g. having a history where there's prior unrelated tasks still in context).

The best tests are your own custom personal-task-relevant standardized tests (which the best models can't saturate, so aiming for less than 70% pass rate in the best case).

All this is to say that most people are not doing the latter and their vibes are heavily confounded to the point of being mostly meaningless.