For classification models, the evaluation results are based on the following metrics or measures:
- Accuracy: is the number of correct predictions over the number of total instances that have been evaluated.
- Precision: the higher this number is, the more you were able to pinpoint all positives correctly. If this is a low score, you predicted a lot of positives where there were none.
- Recall: if this score is high, you didn’t miss a lot of positives. But as it gets lower, you are not predicting the positives that are actually there.
- F-Measure or F1-score: this is the balanced harmonic mean of Recall and Precision, giving both metrics equal weight. The higher the F-Measure is, the better.
- Phi Coefficient: the Phi Coefficient takes the True Negatives (explicitly) into account. Which one is appropriate depends on how much attention you want to pay to all of those negative examples you’re getting right. If those are very important to you (as they might be in medical diagnosis), Phi may be better. If they’re not very important (like in trying to return relevant search results), F-Measure is often more appropriate.
Knowing whether these measures are good enough to start making predictions with unseen data is quite subjective. You will need to decide which level of accuracy is the one you accept to trust your training model. For instance, imagine you own a supermarket and you are purchasing two kinds of products: the fresh ones that need to be sold quickly and other products you do not mind having in stock (you would still like to maintain optimal levels of stock, however). In the second case, getting a model with 85% accuracy may be acceptable for you. For the fresh products, however, a better value for accuracy might be required.
We recommend that you read the evaluations chapter of the BigML Dashboard documentation to learn more insights about these measures, and understand how BigML computes them.