What the BigML clustering algorithm attempts to do is basically grouping the data points together by proximity to one another. This proximity is differently computed depending on the field type.
- For numeric fields it is measured with the Euclidean distance, where the total distance from each data point to its assign centroid is minimized.
- For categorical fields, BigML uses a special binary distance (0 or 1) function where:
if valA == valB then
distance = 0
else
distance = 1 or user-defined scale value
endif
BigML also assigns as the centroid the most common category of the member instances and then compute the euclidean distance as normal.
- For text and items fields BigML follows a different approach and uses cosine similarity to calculate the distance metric. The terms the algorithm picks for a centroid are the terms that minimize the average cosine distance between the centroid and the points in its neighborhood.