Hacker News: onasta

Show HN: TabPFN Scaling Mode – Tabular Foundation Model on millions of rows

onasta — Wed, 03 Dec 2025 18:56:00 +0000

I’m excited to announce Scaling Mode for TabPFN-2.5, our tabular foundation model (Large Tabular Model). This removes any fixed upper limits on dataset sizes and extends TabPFN to datasets with millions of rows.

What is Scaling Mode? A new pipeline around TabPFN-2.5 designed for large-N workloads, which removes the fixed row limit of TabPFN. The system works with large training sets, constrained only by your compute and memory.

We benchmarked Scaling Mode on datasets ranging from 1M to 10M rows, comparing against CatBoost, XGBoost, and LightGBM. Key findings:

- Scaling Mode enables TabPFN-2.5 to continue improving performance with more data

- Scales dramatically better than TabPFN-2.5 with 50K subsampling

- No evidence of the performance gap to gradient boosting shrinking as we scale up

- Performance continues to improve strongly with more data on the largest tested datasets

To quickly summarise our progress, here's the history of our scaling trajectory:

- TabPFN v2 (Jan 2025): 10K rows

- TabPFN-2.5 (Nov 2025): 50K rows

- Scaling Mode (today): 10M+ rows tested

Current limitations: Scaling Mode is designed for companies for now. If you’re working with data at a large scale and want to test it, access is currently only by request to ensure we can support early users properly.

Full blog post: https://priorlabs.ai/technical-reports/large-data-model

Request access: https://priorlabs.ai/tabpfn/large-data

Would love to hear feedback from anyone working with large tabular datasets!

Comments URL: https://news.ycombinator.com/item?id=46138439

Points: 5

# Comments: 0

Show HN: TabPFN-2.5 – SOTA foundation model for tabular data

onasta — Thu, 06 Nov 2025 18:26:53 +0000

I am excited to announce the release of TabPFN-2.5, our tabular foundation model that now scales to datasets of up to 50,000 samples and 2,000 features - a 5x increase from TabPFN v2, published in the Nature journal earlier this year. TabPFN-2.5 delivers state-of-the-art predictions in one forward pass without hyperparameter tuning across classification and regression tasks.

What’s new in 2.5: TabPFN-2.5 maintains the core approach of v2 - a pretrained transformer trained on more than hundred million synthetic datasets to perform in-context learning and output a predictive distribution for the test data. It natively supports missing values, cateogrical features, text and numerical features is robust to outliers and uninformative features.

The major improvements:

- 5x scale increase: Now handles 50,000 samples × 2,000 features (up from 10,000 × 500 in v2)

- SOTA performance: TabPFN-2.5 outperforms tuned tree-based methods and matches the performance of a complex ensemble (AutoGluon 1.4), that itself includes TabPFN v2, tuned for 4 hours. Tuning the model improves performance, outperforming AutoGluon 1.4 for regression tasks.

- Rebuilt API: New REST interface along with Python SDK with dedicated fit & predict endpoints, making deployment and integration more developer-friendly

- A distillation engine that converts TabPFN-2.5 into a compact MLP or tree ensemble while preserving accuracy and offer low latency inference.

There are still some limitations. The model is designed for datasets up to 50K samples. It can handle larger datasets but that hasn’t been our focus with TabPFN-2.5. The distillation engine is not yet available through the API but only through licenses (though we do show the performance in the model report).

We’re actively working on removing these limitations and intend to release newer models focused on context reasoning, causal inference, graph networks, larger data and time-series. TabPFN-2.5 is available via API and a package on Hugging Face. Would love for you to try it and give us your feedback!

Model report: https://priorlabs.ai/technical-reports/tabpfn-2-5-model-repo...

Package: https://github.com/PriorLabs/TabPFN

Client: https://github.com/PriorLabs/tabpfn-client

Docs: https://docs.priorlabs.ai/quickstart

Comments URL: https://news.ycombinator.com/item?id=45838540

Points: 73

# Comments: 13

New comment by onasta in "Show HN: TabPFN v2 – A SOTA foundation model for small tabular data"

onasta — Thu, 09 Jan 2025 23:30:26 +0000

There have been a ton of improvements! Much better performance overall, way larger data size limit (1K-->10K rows, 100-->500 features), regression support, native categorical data and missing values handling, much better support for uninformative or outlier features etc.

New comment by onasta in "Show HN: TabPFN v2 – A SOTA foundation model for small tabular data"

onasta — Thu, 09 Jan 2025 22:55:43 +0000

TabPFN is better on numerical data since v1 (see figure 6 in the CARTE paper). CARTE's main strength in on text features, which are now also supported for TabPFN v2 API version (https://github.com/PriorLabs/tabpfn-client). We compared this to CARTE and found our model to be generally quite better, and much faster. CARTE multi-table approach is also very interesting, and we want to tackle this setting in the future.

Show HN: TabPFN v2 – A SOTA foundation model for small tabular data

onasta — Thu, 09 Jan 2025 16:38:26 +0000

I am excited to announce the release of TabPFN v2, a tabular foundation model that delivers state-of-the-art predictions on small datasets in just 2.8 seconds for classification and 4.8 seconds for regression compared to strong baselines tuned for 4 hours. Published in Nature, this model outperforms traditional methods on datasets with up to 10,000 samples and 500 features.

The model is available under an open license: a derivative of the Apache 2 license with a single modification, adding an enhanced attribution requirement inspired by the Llama 3 license: https://github.com/PriorLabs/tabpfn. You can also try it via API: https://github.com/PriorLabs/tabpfn-client

TabPFN v2 is trained on 130 million synthetic tabular prediction datasets to perform in-context learning and output a predictive distribution for the test data points. Each dataset acts as one meta-datapoint to train the TabPFN weights with SGD. As a foundation model, TabPFN allows for fine-tuning, density estimation and data generation.

Compared to TabPFN v1, v2 now natively supports categorical features and missing values. TabPFN v2 performs just as well on datasets with or without these. It also handles outliers and uninformative features naturally, problems that often throw off standard neural nets.

TabPFN v2 performs as well with half the data as the next best baseline (CatBoost) with all the data.

We also compared TabPFN to the SOTA AutoML system AutoGluon 1.0. Standard TabPFN already outperforms AutoGluon on classification and ties on regression, but ensembling multiple TabPFNs in TabPFN v2 (PHE) is even better.

There are some limitations: TabPFN v2 is very fast to train and does not require hyperparameter tuning, but inference is slow. The model is also only designed for datasets up to 10k data points and 500 features. While it may perform well on larger datasets, it hasn't been our focus.

We're actively working on removing these limitations and intend to release new versions of TabPFN that can handle larger datasets, have faster inference and perform in additional predictive settings such as time-series and recommender systems.

We would love for you to try out TabPFN v2 and give us your feedback!

Comments URL: https://news.ycombinator.com/item?id=42647343

Points: 153

# Comments: 44

New comment by onasta in "Why do tree-based models still outperform deep learning on tabular data?"

onasta — Thu, 04 Aug 2022 21:20:11 +0000

Super interesting! Do you know the kind of data that it's usually used for? And in the remaining 80% to 60%, do NNs acccount for a large portion of the best models?

Bonus question: are the stats you're mentioning publically available?