Revisiting the Rashomon Set Argument

Julian Hatwell
Feb 26, 2025

About eighteen months ago, I posted about this paper discussing the Accuracy-Interpretability Trade-Off (AITO) or Performance-Explainability Trade-Off (PET). This paper revisited the sometimes overlooked debate over the validity of this trade-off. That is to say, is it even necessary to accept that such a trade-off or dichotomy exists? Are we really forced to choose between an accurate model and an interpretable one, or must we always compromise our target metrics? You can read my previous blog here

One of the arguments against accepting the Trade-Off is the so-called Rashomon Set (RS) argument. The RS argument suggests that, for many real-world tasks, multiple models from a single function class can achieve nearly the same level of performance. Within this set of models, some will be inherently interpretable. This idea, named after Breiman’s Rashomon Effect, has been discussed extensively but never finally settled. The reason stems from the fact that finding an optimal model is an NP-hard problem and this is at the foundation of machine learning. We approximate a near optimal solution through risk minimization, a paradigm that encourages the thinking that our single, finally selected model is the best we can do. Decades of research into ensemble models hasn’t changed that, because the ensemble takes the seat of a single, risk-minimized model. Rashomon Set theoretical research throws out this limiting paradigm in favour of an exploration of many near optimal models in the space of all possible models in a single function class.

In their paper Exploring the Whole Rashomon Set of Sparse Decision Trees, Xin et al. develop a dynamic programming-based method to generate and sample from the RS of sparse decision trees derived from several benchmark datasets. Three novel applications are presented, including a fascinating take on RS-derived variable importance. Most importantly, there is proof that traditional tree ensemble methods generate only a fraction of the RS several orders of magnitude smaller than its theoretical maximum size.

The question of RS-derived feature importance is explored in fine detail in Partial Order in Chaos: Consensus on Feature Attributions in the Rashomon Set. The authors address the variability in feature attribution explanations. Those of us who have worked with foundational models, such as Random Forests and Boosting methods are familiar with their capability to provide feature importance measures. We are, however, inevitably frustrated by the inconsistency of feature importance across model classes that differ in only trivial ways, and even within multiple runs of the same model while merely adjusting the random seed. In this paper, Laberge et al. show that the same is true even for theoretically stable methods, such as SHAP (Lundberg and Lee, 2017) in a trivial example with a simulated data set for which the ground truth explanation is known. They go on to propose a framework for achieving much more consistent measures of variable importance that is based on consensus within the Rashomon Set.

In the paper On the Existence of Simpler Machine Learning Models, Semanova et al. proposed the Rashomon Ratio as a means to estimate the opportunity of finding a highly interpretable model for any given problem. However, these methods are limited to the Sparse Decision Tree model class and cannot be adapted to incorporate other foundational models that aren’t based on recursive partitioning of binary features. A linear model, for example, with continuous features cannot have its RS enumerated by the application of combinatorics.

The Rashomon Set argument is compelling. So far, however, the strongest evidence remains empirical rather than axiomatic. A formal theoretical proof remains elusive. Nevertheless, ongoing research continues to explore the conditions and methodologies that facilitate the identification of interpretable models within the Rashomon set and, as such, the RS argument remains a fascinating question of theoretical machine learning research.

Tags:

Categories: