Hacker News: syntacticsalt

New comment by syntacticsalt in "Uv is fantastic, but its package management UX is a mess"

syntacticsalt — Fri, 22 May 2026 06:49:31 +0000

Pixi uses uv as a backend, and I've enjoyed the UI because it's easy to add task aliases for things like nicely-formatted lists of outdated packages. (I have no affiliation with the project.)

Pixi-diff-to-markdown in particular has made scanning automated CI package updates easier. So for something like viewing outdated packages that would be updated, I'd do something like create a task alias for a project:

pixi task add outdated "pixi update --dry-run --json | pixi exec pixi-diff-to-markdown"

And then run the task in the project via:

pixi run outdated

The output is a readable Markdown table of packages that would be updated, old version, and the new version that would be installed using the pixi update command. Your mileage and tastes, of course, may vary.

New comment by syntacticsalt in "Books are not too expensive"

syntacticsalt — Thu, 23 Apr 2026 16:13:05 +0000

I don't disagree about the complicity. However, biology and statistics, even at intro level, have had significant updates in material covered over the last 10-20 years.

More subtly, terminology changes. My copy of Rudin's Principles of Mathematical Analysis is just as correct now as it was when it was published in 1976, but I remember one of my professors describing the terminology as somewhat dated, as of the late 2000s.

New comment by syntacticsalt in "Everything is correlated (2014–23)"

syntacticsalt — Sun, 24 Aug 2025 15:34:18 +0000

My opinion of Gwern's piece is that some of the arguments he makes don't require correlations. For example, A/B tests of differences in means using a zero difference null hypothesis will reject the null, given enough data.

In that A/B testing scenario, I think if someone wants to test whether the difference is zero, that's fine, but if the effect size is small, they shouldn't claim that there's any meaningful difference. I believe the pharma literature calls this scenario equivalence testing.

Assuming a positive difference in means is desirable, I think testing for a null hypothesis of a change of at least some positive value (e.g., +5% of control) is a better idea. I believe the pharma literature calls this scenario superiority testing.

I believe superiority testing is preferable to equivalence testing, and in professional settings, I have made this case to managers. I have not succeeded in persuading them, and thus do the equivalence testing they request.

I don't think the idea of a zero null hypothesis is necessarily mathematically unsound. In cases like the difference in means, a zero null hypothesis is well-posed. However, I agree with you that there are better practices, like a null hypothesis incorporating a nonzero effect.

I don't entirely agree with the arguments Gwern puts forth in the Implications section because some of them seem at odds with one another. Betting on sparsity would imply neglecting some of the correlations he's arguing are so essential to capture. The bit about algorithmic bias strikes me as a bizarre proposition to include with little supporting evidence, especially when there are empirical examples of algorithmic bias.

What I find lacking about Gwern's piece is that it's a bit like lighting a match to widespread statistical practice, and then walking away. Yes, I think null hypothesis statistical testing is widely overused, and that statistical significance alone is not a good determinant of what constitutes a "discovery". I agree that modeling is hard, and that "everything is correlated" is, to an extent, true because the correlations are not literally or exactly zero. But if you're going to take the strong stance that null hypothesis statistical testing is meaningless, I believe you need to provide some kind of concrete alternative. I don't think Gwern's piece explicitly advocates an alternative, and it only hints the alternative might be causal inference. Asking people who may not have much statistics training to leap from frequentist concepts taught in high school to causal inference would be a big ask. If Gwern isn't asking that, then I'd want to know what a suggested alternative would be. Notably, Gwern does not mention testing for nonzero positive effects (e.g., in the vein of the "c > 0.3" case above). If there isn't an alternative, I'm not sure what the argument is. Don't use statistics, perhaps? It's tough to say.

New comment by syntacticsalt in "Everything is correlated (2014–23)"

syntacticsalt — Sat, 23 Aug 2025 06:55:06 +0000

Are you referring to the first figure, from Smith, et al, 2007? If so, I couldn't evaluate whether gwern's claim makes sense without reading that paper to get an idea of, e.g., sample size and how they control for false positives. I don't think it's self-evident from that figure alone.

One rule of thumb for interpreting (presumably Pearson) correlation coefficients is given in [0] and states that correlations with magnitude 0.3 or less are negligible, in which case most of the bins in that histogram correspond to cases that aren't considered meaningful.

[0]: https://pmc.ncbi.nlm.nih.gov/articles/PMC3576830/table/T1/

New comment by syntacticsalt in "Everything is correlated (2014–23)"

syntacticsalt — Fri, 22 Aug 2025 17:14:59 +0000

As you yourself point out, a consistent estimator of a parameter converges to that parameter's value in the infinite sample limit. That limit is zero or it's not.

New comment by syntacticsalt in "Everything is correlated (2014–23)"

syntacticsalt — Fri, 22 Aug 2025 09:20:36 +0000

A frequentist interpretation of inference assumes parameters have fixed, but unknown values. In this paradigm, it is sensible to speak of the statement "this parameter's value is zero" as either true or false.

I do not think it is accurate to portray the author as someone who does not understand asymptotic statistics.

New comment by syntacticsalt in "Everything is correlated (2014–23)"

syntacticsalt — Fri, 22 Aug 2025 08:53:50 +0000

Reporting effect size mitigates this problem. If observed effect size is too small, its statistical significance isn't viewed as meaningful.

New comment by syntacticsalt in "Everything is correlated (2014–23)"

syntacticsalt — Fri, 22 Aug 2025 07:40:21 +0000

I don't disagree with the title, but I'm left wondering what they want us to do about it beyond hinting at causal inference. I'd also be curious what the author thinks of minimum effect sizes (re: Implication 1) and noninferiority testing (re: Implication 2).

New comment by syntacticsalt in "P-Hacking in Startups"

syntacticsalt — Sun, 22 Jun 2025 07:15:23 +0000

Permutation tests don't account for family-wise error rate effects, so I'm curious why you would say that "it doesn't overcorrect like traditional methods".

I'm also curious why you say those "cover every case", because permutation tests tend to be underpowered, and also tend to be cumbersome when it comes to constructing confidence intervals of statistics, compared to something like the bootstrap.

Don't get me wrong -- I like permutation tests, especially for their versatility, but as one tool out of a bunch of methods.

New comment by syntacticsalt in "P-Hacking in Startups"

syntacticsalt — Sun, 22 Jun 2025 06:41:20 +0000

> Most companies don't cost peoples' lives when you get it wrong.

True, but it usually costs money to fix it. I think the themes of "this only matters if lives are on the line" or "it's too rigorous" are straw-men.

We have limited resources -- time, money, people. We'd like to avoid deploying those resources badly. Statistical inference can be one way to give us more information so we avoid using our resources badly, but as you note, statistical inference also has costs: we have to spend resources to get the data we need to do the inference, plus other costs. We can estimate the costs of getting sufficient data using sample size estimation methods. For go/no-go decision-making, if the cost of getting the decision wrong isn't something like at least 10x the cost of doing the statistical inference, I don't think it's worth doing the inference. It may be worth doing the inference for _other_ reasons, but those reasons are out of scope.

As an example, a common use of statistical inference in medical research is to compare the efficacy of a treatment with a placebo. Some of the motivation is to decide whether to invest more resources in developing the treatment, not because people will die if they get a false positive stating that the treatment is effective when it isn't.

> A lot of companies are, arguably, _too rigorous_ when it comes to testing.

My experience in industry has been the opposite. Companies like the idea of data-driven decision-making, but then they discover pain points. They should have some idea of how much of a change they're looking to detect (i.e., an effect size). They should estimate how much data they're likely to need to run their tests (i.e., sample size estimation). They have to consider other issues like model misfit, calibration, multiple-testing corrections, and so on. Then they also have to rig up the infra to be able to _do_ the testing, collect the data, analyze the results, and communicate the results to their internal stakeholders. These pain points are why companies like Eppo and StatSig exist -- A/B testing ends up being more high-touch than developers expect.

Messing up any one of these issues can yield "flaky tests," which developers hate. Failing to gather a sufficiently large sample size for a given effect size is a pretty common failure mode.

> But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.

It's difficult to tell precisely what you mean by "maintain rigor" here. The only context I can gather is that whatever procedure you were using needed more data in order to satisfy the preconditions of the test needed for the nominal design criteria of the test -- usually, its nominal false positive rate. I don't think this is an issue of rigor -- it's an issue of statistical modeling and correctness.

Sometimes, it's possible to use different methods that may require less data at the cost of more (or different) modeling assumptions. Failing to satisfy the assumptions of a test can increase its false positive rate. Whether that matters is really up to you.

> I do like their proposal for "peeking" and subsequent testing.

What the post is suggesting is not a proposal, but a standard class of frequentist statistical inference methods called sequential testing. Daniël Lakens has a good online textbook (https://lakens.github.io/statistical_inferences/) that briefly discusses these methods in Chapter 10 and provides further references.

> We're shipping software. We can change things if we get them wrong.

That's usually true -- as long as you have the resources needed to make those changes, and are willing to spend them that way.

> IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals.

While I don't disagree with the sentiment, I think you're conflating rigor with correctness here.

> If its goals are "stat sig on every test", then sure, treat it like someone might die if you're wrong.

I think that's a false equivalence. Even the American Statistical Association has issued a statement on p-values (see https://www.amstat.org/asa/files/pdfs/p-valuestatement.pdf) that includes "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold."

> But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.

If those are your goals, just ship it; I don't think it makes sense to justify the effort to test in this situation, especially if, as you argue, it's financially feasible to roll back the change or pivot if it doesn't work.