Datasets are the fundamental building block for your BigML workflows and the starting point for any modeling procedure. A dataset in BigML is a structured version of your data. BigML computes both general statistics for the dataset and individual statistics per field. For the general statistics, BigML provides the count of valid instances, the missing values, and errors. For each field of your dataset, BigML analyzes and computes its minimum, mean, median, maximum, standard deviation, kurtosis, skewness, terms count, among others. The statistics provided per field differ for each type of field (numeric, categorical, text and items, date-time).
The main goal of datasets is enabling effective wrangling of your data, so you can build the right BigML model for your problem. This is a key step to ultimately achieve the best results for your Machine Learning tasks.
Datasets can be built from an existing:
- Source
- Dataset, to sample your data
- Dataset, to split it
- Dataset, to filter it
- Dataset, to extend it
To learn more about datasets, please:
- Watch this video for a gentle introduction to datasets. This demo showcases the options listed above.
- Read the detailed documentation about datasets here if you are using the BigML Dashboard, or check the Datasets API documentation for more details on the arguments and properties of a dataset.
To continue building your Machine Learning workflows in BigML, please:
- Visit these questions: What is a Model?, Which models does BigML work with?, What is a BigML resource?