✨ #ICML2024 accepted! CARTE: Pretraining and Transfer for Tabular Learning
A short post on CARTE, our accepted ICML paper: Why this is a jump forward for tabular deep learning 🤯, a stepping stone for tabular foundation models 🎉, and a study with much teaching on learning on real tables 👇
Teaser: the contribution leads to sizeable improvements compared to many strong baselines, across 51 datasets.
We worked really hard on the baselines, testing many, some being new combination of tools (many teachings on neural networks and handling categories)
Context: a table often comes with many strings: column names, entries... These are a hurdle for learning (must be encoded/vectorized), but crucial to understand the meaning of a table. We use these to enable learning across datasets.
This is a new angle to tabular learning
The key to enable transfer across table is to represent tables with a graph and all symbols with embeddings (for column names and table entries).
This enables to learn across table with different column names, number of columns...
The architecture is a form of graph transformer with a specific attention that models relation types (distinguishing "G Bush" - "son of" - "G Bush") as in knowledge-base embeddings.
We then pretrain it on a large general-purpose knowledge base, to bring background knowledge
The resulting model (CARTE) helps learning on small-to-midsize datasets compared to a strong set of baselines, using all tricks to handle strings.Great lessons learned also on baselines: @skrub 's TableVectorizer rocks, LLMs process strings well 😀, bagging helps NNs.
As CARTE handles a varying number of features, it can deal with missing values naturally, and it works well 😀
CARTE can transfer without matching entities or columns (schemas). We run two experiments to demonstrate this:
We show that when learning on datasets with entities matched to YAGO or not, CARTE enriches as well as wikipedia embeddings (KEN) in both settings
CARTE enables joint learning across datasets with imperfect correspondance in columns. As baselines here we consider LLM-based learning (a la TabLLM) and CatBoost with manually-matching columns and missing values for column without correspondances
The ability to do joint learning across related dataset without correspondance can be extended to multiple datasets. Here we use datasets in the wild, grabbing datasets on related topics (restaurant ratings, for instance), and show much broader transfer than typically studied.
We also compared to TabLLM on datasets used in its paper. These datasets differ from our focus, because they are mostly numerical and low-cardinality categories. Here, TabPFN shines. It is complementary to CARTE with a focus on numbers.
A negative aspect of CARTE (let's be honest): it is currently much slower than alternatives.
Some of it will be improved with time. Some of it will stay: we are opening the door to pretraining to embark knowledge for tabular learning. It comes at a cost ☹️
To conclude, I'm super excited about CARTE. It's the first time I see a benefit over tree-based methods on the type of data I work with (socio-demographics, housing, healthcare...). It's a first step to tabular foundation models. More will come 🤯🫣
Also, I want to thank the reviewers. They challenged us, gave us a hard time, which improved the paper a lot, but were genuinely open to the science and the evidence. That's the way to do reviewing! Looking forward to present this at ICML 🎉
Head of Sales and Engineering France
5mo#AdaptativeML will be present at [ICML] Int'l Conference on Machine Learning in Vienna from July 23 to 25, Julien Launay CEO co-founder will be there and available. Joint him https://www.adaptive-ml.com/
Lead Data Scientist | Transaction Banking | BNP Paribas
6moCongrats! The idea behind it is really fun. Really looking forward to see what happens for foundation models in this area. Do you happen to have any ETA on its implementation on https://github.com/soda-inria/carte ?
Full Professor at Paris-Saclay University - UVSQ-Versailles (Artificial intelligence, Machine Learning, Data Science)
6moKodjo Mawuena Amekoe