Explaining Random Forests with Representative Trees
The paper “Can’t see the forest for the trees: Analyzing groves to explain random forests” explores a novel take on model-specific explanations, as outlined in my own research (e.g. you can look at “CHIRPS: Explaining random forest classification” as a reference). This new paper by Szepannek and von Holt seeks to make Random Forests (RF) more interpretable. RF are notoriously hard to explain due to their complexity and these novel methods works well for both classification and regression, which is a very useful extension to the field.
The authors introduce most representative trees (MRT) and surrogate trees, essentially distilling a simpler model to run side by side with the black box RF. MRTs focus on highlighting individual trees within a random forest that best explain the overall model behavior, while surrogate trees mimic the forest with simpler, more digestible versions. I have some reservations about the latter approach, because my own research showed that any surrogate model comes with a failure rate, which is the number of examples that the surrogate classifies differently than the black box model under scrutiny. I also question the assertion that a model of 10 or 24 decision trees really is so interpretable. Even a model of this reduced size still likely contains far too many components for a human-in-the-loop to consider and understand.
In any case, to give the authors their due credit, they navigate the trade-offs between accuracy and interpretability of both MRT and surrogate tree methods, and propose a novel concept called groves: small collections of decision trees that balance the need for interpretability with predictive accuracy. Groves provide a middle ground by combining the benefits of MRTs and surrogate models, reducing the overall complexity while still offering meaningful insights into how the model operates. This approach aligns with the goal of making models more transparent and trustworthy.
Through various case studies, the paper shows how groves and surrogate trees can be effectively applied to real-world datasets. The trade-off between model accuracy and explainability remains a central challenge. Yet, in these studies, groves provide a workable compromise by making it easier for humans to understand what is driving the model’s predictions without overwhelming them with unnecessary detail.
The discussion also highlights a key challenge in using groves: deciding on the right number of trees to use for explanation. Using too many trees risks overwhelming the user with information (as I have already pointed out), while too few might fail to capture the complexity of the underlying model and run with an untenable failure rate. I dicuss ways to achieve a zero failure rate in my thesis. Keeping explanations concise and accessible is just a part of the complete picture.
In conclusion, this paper underscores the crucial need for enhancing the interpretability of machine learning models, particularly in high-stakes fields like healthcare and finance, where decision transparency is essential. By extending the work in interpretability through methods like groves and surrogate trees, it addresses the challenge of making powerful models like random forests more understandable.