201 Data Sufficiency¶
Use Capacity Progression to determine -
Does my data have a discernible function? and How much data do I really need?
Capacity progression measures the learnability of a dataset, by plotting the number of decisions needed to memorize the function presented by the training data relative to the number of instances presented to the predictor (for an ideal model).* From the Brainome Glossary
Random Data Set Capacity Progression
Deterministic Data Set CP
Real World Data Set CP
Measures to Improve CP
This notebook assumes brainome is installed as per notebook brainome_101_Quick_Start
The training data sets are:
vehicle.csv for deterministic data.
titanic_train.csv for real world data.
!python3 -m pip install brainome --quiet !brainome -version
1. Random Data Set Capacity Progression¶
The Capacity Progression of a random data set displays an ever increasing linear function.
!brainome https://download.brainome.ai/data/public/test_data9.csv -y -target SomeEmail -measureonly | grep -A 1 Capacity -
Capacity Progression: at [ 5%, 10%, 20%, 40%, 80%, 100% ] Ideal Machine Learner: 6, 7, 8, 9, 9, 10
2. Deterministic Data Set CP¶
The Capacity Progression of a deterministic data set displays a plateau. In this case, 40% of the data has enough information content to train a strong model.
!brainome https://download.brainome.ai/data/public/vehicle.csv -y -measureonly | grep -A 1 Capacity -
Capacity Progression: at [ 5%, 10%, 20%, 40%, 80%, 100% ] Ideal Machine Learner: 6, 7, 8, 9, 9, 9
3. Real World Data Set CP¶
The Capacity Progression of a real world data set is somewhere in the middle.
!brainome https://download.brainome.ai/data/public/titanic_train.csv -y -measureonly | grep -A 1 Capacity -
Capacity Progression: at [ 5%, 10%, 20%, 40%, 80%, 100% ] Ideal Machine Learner: 6, 7, 8, 8, 9, 9
4. How to improve your data set’s CP¶
Typical actions to improve the learnability of your data:
Add more instances
Check out brainome_203_Right_Sizing_Data_Sets