Mushrooms have always fascinated me because of how many different kinds there are. Some mushrooms are small and grow in large clusters that are perfect for soups, and some are big, red and would make you sick for weeks if you tried to eat them. While I love most any mushroom that is safe to eat, trying to figure out which ones are edible is something that has intimidated me and kept me from foraging on my own.
While looking for a dataset to use with some predictive modeling, I came across one on UCI’s Machine Learning Repository that built with 23 different mushroom characteristics and I chose to predict whether a mushroom was poisonous or not (See link here). I then began to explore.
Getting the Characteristic Weights
After some initial cleaning of the dataset, I began to look at which characteristics/features would or wouldn’t be very helpful in determining whether or not the mushroom was edible. From the initial 22 features, I dropped the ‘veil-type’ because it was all the same value for every mushroom in the dataset. I then also dropped the ‘odor’ because it was too subjective in how it was recorded with words like ‘musty’, ‘fishy’, ‘almond’, and ‘none’, to name a few. All words that could be interpreted differently depending on who was recording the data.
Once the data had been cleaned up, I took a look at the permutation importance to see how much weight each characteristic had in a basic Random Forest Tree Classifier model that was used to determine the edibility of each mushroom.
As seen in the graphic to the left, the Spore-print-color and the Gill-size had the most weight in the Random Forest’s calculations. After seeing this I wanted to see what a spore print was and how it is collected.
is collected by first placing the freshly picked mushroom cap on a solid colored or clear surface (often white paper, or plastic slides) and letting it sit for up to twenty-four hours. This allows the mushroom’s spores to gently float down onto the surface where they can be easily seen. The colors often range from white to black with varying shades of red, purple, or brown and sometimes even green in between. The spore print is commonly used for the identification of the genius of the mushroom species and can sometimes help distinguish two very similar-looking species.
Upon making a Shapley Force Plot, I found that the spore-print-color often had one of the top weights in the model’s predictions. In the following example, we can see that although it may not always be the case when the spore print is a chocolate color, the mushroom is more likely to be poisonous than if it were a different color.
The model predicts this mushroom is “Poisonous”, with a 100% probability.
The actual value is “Poisonous”Top 3 reasons for prediction:
1. Spore-print-color is chocolate.
2. Stalk-surface-above-ring is silky.
3. Gill-color is gray.Top counter-argument against prediction:
- Cap-surface is fibrous.
After looking at how the model came to each determination, let’s take a look at the model’s accuracy scores.
I started with a baseline of just guessing that every mushroom was edible causing me to be right about 52% of the time. If you’re going to eat a mushroom and picking wrong would be very painful and possibly deadly, this is pretty bad odds.
My first trial was with a linear regression model, and when I did that I got an accuracy score of 79.1%, which is loads better than the baseline, but I’d still feel pretty weary on relying on those odds. Then finally I took an XGBClassifier model and fit it to the dataset. And I had a hard time believing it but, I got a score of 100% accuracy in telling if a mushroom is poisonous or not, which generally doesn’t happen in machine learning models unless there is some data leakage.
However, after scouring the data for any leakage (a situation where the target feature is somehow hidden within the rest of the data), I couldn’t find any. That doesn’t mean there isn’t a leakage, it means that if there is, I don’t know enough about mushrooms to know if I have any leakage. I even checked with a Confusion matrix to see if it was just rounded and as seen to the left, there were no wrong answers given by the model. I even tried it out on my test set that had been saved until the end to ensure that it was data that the model had not yet seen, and it still got a score of 100%.
Finally, I took the full dataset and plotted it out on a Coefficient Correlation Matrix as seen above. If you look at the top row it shows all relationships with the various columns and not one of them has a correlation any higher than 50%. Showing us that with the given columns there is no direct relationship to edibility that could be considered data leakage. Unless there is something about mushrooms that is synonymous with edibility that I don’t know about, at my current skill level, it looks like this is just a highly predictive model that lends itself well to machine learning.