BigML's algorithm is in essence inspired by Leo Breiman's CART decision tree models. However it uses many proprietary components to make it much more scalable and able to deal with multiple data types at once. BigML allows you to create models with numeric, categorical, date-time and text fields (these last ones treated as tokenized and stemmed bag-of-words).
- The Basics:
For classification models BigML chooses splits based on information gain. For regression models the splits reduce the mean squared error (or variance reduction). When generating candidate split points for numeric fields, BigML does not consider every possible split, instead BigML uses streaming histograms to choose the split candidates (normally 32 candidates per split).
- Strategies for missing data:
When encountering missing data in input fields during training there are two possible approaches: either ignoring the missing instances when generating candidate splits, or explicitly including them with the MIA approach (e.g., age > 30 or age is missing).
When encountering missing data at prediction time there are also two strategies: on the one hand BigML generates a prediction at the tree node whose split encounters a missing value (ignoring the node's possible children), or on the other hand BigML evaluates the node's children and combine the predictions from both subtrees (similar to C4.5 algorithm and sometimes referred to as "distribution-based imputation").
- Stopping criteria:
BigML's decision tree models stop growing when they have either exhausted their ability to better fit the training data, when they have hit a user-defined depth limit, or when they have hit a user-defined node limit. When using the node limit, BigML chooses which nodes to expand by selecting those which reduce training error the most. This is more akin to an A* search than a breadth or depth first expansion.
- Pruning:
BigML offers statistical pruning to help alleviate overfitting whenever training decision tree models. The pruning is based on statistical confidence estimates and it is similar to the pruning in C4.5 algorithm.
- Weighting:
BigML's decision tree models also accept weighted training data. For regression models each instance may be weighted individually, however the data may be weighted by class when working with classification models.
Please read the models chapter for more information about decision tree models.