Hacker News: AMavorParker

New comment by AMavorParker in "PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play"

AMavorParker — Wed, 20 May 2026 23:33:34 +0000

The teachers never attempt to solve their own problems, only the students solve problems.

Regarding the TrueSkill of the teachers, the self-play settings we operate in in this paper are zero-sum competitive which means that the population skills cannot both increase together, as the objective of one population is adversarial against the other -- generating difficult tasks (teachers) but making difficult tasks easy (students learning to solve them)

New comment by AMavorParker in "PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play"

AMavorParker — Wed, 20 May 2026 23:30:32 +0000

Thanks for your interest!

Not necessarily. While the held-out downstream evals showed that 1T-1S setups outperformed larger populations like 4T-4S or 8T-8S on some specific benchmarks, that does not invalidate the motivation for population-based training.

The main motivation for larger populations is more diversity in both problems and solutions, which can encourage specialization and broader task coverage. Even if that diversity does not improve on some of the particular benchmarks we used, it is still arguably a desirable property.

Figure 9 in the paper, for example, shows that students trained with larger populations are exposed to a much wider range of tasks than the baseline.

Also, on average, we do see that 4v4 is the best across all benchmarks we measure.

The “creating new population members in seconds” comment refers to operating in LoRA space. The mutation and crossover operators are applied to lightweight LoRA adapters rather than full model weights, making the process very fast and memory efficient.

New comment by AMavorParker in "PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play"

AMavorParker — Wed, 20 May 2026 21:11:55 +0000

We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs. Teachers and students are specialised LoRA adapters on a shared frozen base: teachers propose problems, matched students solve them under a programmatic verifier, and cross-evaluation between sub-populations replaces the self-calibration that limits single-agent self-play. A family of LoRA weight-space evolution operators (mutations and crossovers that produce same-rank population members in seconds) serves as the replacement step of a population-based training loop at 7B scale. We instantiate PopuLoRA on top of Absolute Zero Reasoner and compare it against a per-adapter compute-matched single-agent baseline. Where the single agent self-calibrates to generating easy problems it can reliably solve, the population enters a co-evolutionary arms race: teachers produce increasingly complex problems, student solve rates oscillate, and problem-space coverage keeps expanding throughout training. Despite lower training-time reward, the population mean outperforms the baseline on three code benchmarks (HumanEval+, MBPP+, LiveCodeBench) and seven math benchmarks (AIME 24/25, AMC 23, MATH-500, Minerva, GSM8K, OlympiadBench), and even the weakest member of the population beats the baseline on aggregate.

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

AMavorParker — Wed, 20 May 2026 21:11:55 +0000

Article URL: https://vmax.ai/team/populora-co-evolving-llm-populations-for-reasoning-self-play

Comments URL: https://news.ycombinator.com/item?id=48214188

Points: 33

# Comments: 6