BigML supports many proven Machine Learning techniques including decision tree models, ensembles (e.g. Bagging, Random Decision Forests, Boosted Trees), clusters (K-means and G-means), anomalies or Anomaly Detection (Isolation Forest), associations or Association Discovery, Logistic Regression and Topic Models (LDA).
- Models or Decision Tree Models:
The BigML CART-style binary decision tree predicts the value of the target field based on the input fields. It can be used both for classification and regression problems. Each tree node tries to split the data in the most optimal way so that the classification splits maximize information gain and regression splits minimize squared errors. For text fields, each word is treated as a separate value in essence becoming tokens.
The main advantage of the BigML decision tree models is that they are very easy-to-understand compared to other Machine Learning techniques. Decision tree models express human readable rules that can be exported to make new predictions. They can handle redundant or irrelevant variables, and can offer multiple strategies to handle missing data. Furthermore, due to the simplicity of these models they are easy to tune up.
If you are a developer and want to know more about the BigML decision tree models, please click here for more detailed instruction. We also have this information available when creating and configuring models through the BigML Dashboard, please read the models chapter.
An ensemble combines several individual models built out of different subsamples of your data. Ensembles are a robust method that usually reduces overfitting and increase model performance. Random Decision Forest is among the top tier performers of all Machine Learning algorithms.
Currently, you can build ensembles following three basic Machine Learning techniques: Bagging, Random Decision Forests, and Boosting. Please read the ensembles chapter of this document to see how you can employ these powerful techniques thru the BigML Dashboard, or click here to do it thru the BigML API.
- Logistic Regression:
Known as the workhorse of Machine Learning algorithms, logistic regression is a supervised Machine Learning method for solving classification problems. Available through the BigML Dashboard and API, it seeks to learn the coefficient values from the training data using non-linear optimization techniques. It commonly serves as a benchmark model for other techniques due to its simplicity and fast training speed.
You can create a logistic regression model by selecting which fields from your dataset you want to use as input fields (or predictors) and which categorical field you want to predict (the objective field). Learn more about Logistic Regression in its release page. You will find documentation on how to use Logistic Regression with the BigML Dashboard and the BigML API. You can also watch the webinar, see the slideshow, and read a series of blog posts about this topic.
- Time Series:
Time Series is a supervised learning method for analyzing time-based data when historical patterns can explain future behavior. This new resource is available from the BigML Dashboard, API, as well as from WhizzML for its automation, and it is commonly used for predicting stock prices, sales forecasting, website traffic, production and inventory analysis as well as weather forecasting, among many other use cases.
Please visit the dedicated release page to learn all you need to know about Time Series. It includes a series of six blog posts, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.
Clustering splits your data into several similar groups or clusters to better analyze, explore and filter your data. You can use it before training a model if you like (where membership in a given cluster can become another input field) or simply to cluster your data to have a different overview. This type of modeling can handle several strategies for missing values.
You can select different clustering techniques: K-means, when you need to specify in advance the number of clusters to be found, and G-means, when BigML learns the number of different clusters by iteratively taking existing clusters and testing whether the cluster's neighbourhood appears Gaussian in its distribution. Here you can find a deeper explanation about how clusters work in the BigML Dashboard. If you are a developer, click here to learn how to build clusters programmatically.
- Anomalies or Anomaly Detector:
The Anomaly Detector identifies the instances within a dataset that do not conform to an expected pattern. BigML uses the Isolation Forest algorithm to detect anomalies. This algorithm uses an ensemble of randomized trees to generate anomaly scores. The basic idea behind is to overfit decision tree models and generate an anomaly score based on how many splits are needed to isolate an instance from the rest of the data points. As such, this algorithm does not need labeled data as some less versatile anomaly detection methods require.
Anomaly Detectors, also called anomalies, are scalable, competitive and almost parameter-free. They can handle missing data and categorical fields, and explain which fields contributed most to an anomaly. There is no data rescaling needed nor distance metric required.
- Associations or Association Discovery:
BigML is the first Machine Learning service offering Association Discovery in the cloud. Association Discovery, also called associations, is a well-known method to find interesting associations between values, rather than variables, in high-dimensional datasets. It can discover meaningful relations among values across thousands of variables, which traditional statistical methods cannot deal with. Association Discovery is commonly used for a wide variety of purposes such as market basket analysis, web usage patterns, intrusion detection, fraud detection, or bioinformatics, to analyze public genomic and proteomic databases among others.
Specifically, BigML uses Magnum Opus technology, a proprietary implementation of the standard Association Discovery algorithms, optimized for performance. It was developed over the course of a decade by Professor Geoff Webb of Melbourne’s Monash University and was later acquired by BigML in July 2015.
See below some useful links to get familiar with Association Discovery modeling:
- Topic Models:
With Topic Models, BigML's implementation of the underlying Latent Dirichlet Allocation (LDA) technique, words in your text data that often occur together are grouped into different topics. With the model, you can assign a given instance a score for each topic, which indicates the relevance of that topic to the instance. You can then use these topic scores as input features to train other models, as a starting point for collaborative filtering, or for assessing document similarity, among many other uses.
To find out more about Topic Models we recommend that you visit the dedicated release page and check our documentation to create Topic Models, interpret them, and predict with them through the BigML Dashboard and API, as well as a series of six blog posts that guide you through this resource and the complete webinar video.