Confidence is based on two variables:
- The purity of the terminal node
- The number of instances in the node
The purity of the node gives us a base probability of being correct. If there are, say, 10 examples at the node and eight of them are positive, when BigML predicts positive the prediction should be right eighty percent of the time. However, since there are only ten examples, the estimate of that probability might be wrong; it might actually be seventy or ninety percent, which it can be discovered by looking at more data. Since presumably there isn't more data, BigML uses a formula that balances the proportion of the predicted class against the uncertainty of having only a small number of instances. This formula is the lower bound of the Wilson score interval. In the previous example where there are eight correct instances out of a total of ten, the confidence applying that formula would be 49%.
The confidence BigML reports can be considered a pessimistic estimate of the accuracy at the terminal node that we are pretty sure it is lower than the true accuracy at that node.
The image below shows how BigML displays the confidence when using the BigML Dashboard. Take a look at the following model. The selected prediction path assumes that at least 60 instances (customers), which represent 2.25% of the data, will churn at the end of the month with a confidence of 93.98%.
For the regression trees (with a numeric objective field) BigML has expected error, which follows the same approach as the confidence for classification models. It means BigML can expect that the prediction error is better than the error provided at the node you are analyzing. In other words, BigML reports a value for the error that has a high probability of being worse than the average error you will see when using the model.