BigML offers two different algorithms for clustering: K-means and G-means. Both algorithms group the most similar instances in your dataset. The difference between them is how they accomplish the task.
- With K-means you need to select the number of clusters to create. It is up to you to decide how each field in your dataset influences which group each instance belongs to. K-means is the default algorithm when you select CONFIGURE CLUSTER from the configure option menu. BigML creates eight clusters and applies automatic scaling to all the numeric fields. The number of clusters, weightening, as well as field scaling and can easily be modified when you configure your cluster from the BigML Dashboard (see image below) or through the BigML API by defining the appropriate arguments. K-means is useful when you already know the number of clusters you will get from your dataset when grouping all your instances.
- G-means (Gaussian-means algorithm), on the other hand, is the default algorithm for the 1-click action menu and it discovers the number of clusters automatically using a statistical test to decide whether to split a k-means center into two. This algorithm takes a hierarchical approach to detect the number of clusters, based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution (continuous function which approximates the exact binomial distribution of events), and if not it splits the cluster. It starts with a small number of centers, say one cluster only (k=1), then the algorithm splits it into two centers (k=2) and splits each of these two centers again (k=4), having four centers in total. If G-means does not accept these four centers then the answer is the previous step: two centers in this case (k=2). This is the number of clusters your dataset will be divided into. G-means is very useful when you do not have an estimation of the number of clusters you will get after grouping your instances. Notice that an inconvenient choice for the "k" parameter might give you wrong results.