201 Data Sufficiency
Contents
201 Data Sufficiency¶
Use Capacity Progression to determine -
Does my data have a discernible function? and How much data do I really need?
Capacity progression measures the learnability of a dataset, by plotting the number of decisions needed to memorize the function presented by the training data relative to the number of instances presented to the predictor (for an ideal model).* From the Brainome Glossary
Random Data Set Capacity Progression
Deterministic Data Set CP
Real World Data Set CP
Measures to Improve CP
Prerequisites¶
This notebook assumes brainome is installed as per notebook brainome_101_Quick_Start
The training data sets are:
test_data9.csv for random data. Sourced from Kaggle
vehicle.csv for deterministic data.
titanic_train.csv for real world data.
!python3 -m pip install brainome --quiet
!brainome -version
WARNING: You are using pip version 22.0.3; however, version 22.0.4 is available.
You should consider upgrading via the '/opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install --upgrade pip' command.
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import MultiIndex, Int64Index
brainome v1.8-120-prod
1. Random Data Set Capacity Progression¶
The Capacity Progression of a random data set displays an ever increasing linear function.
!brainome https://download.brainome.ai/data/public/test_data9.csv -y -target SomeEmail -measureonly | grep -A 1 Capacity -
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import MultiIndex, Int64Index
Capacity Progression: at [ 5%, 10%, 20%, 40%, 80%, 100% ]
Ideal Machine Learner: 6, 7, 8, 9, 9, 10
2. Deterministic Data Set CP¶
The Capacity Progression of a deterministic data set displays a plateau. In this case, 40% of the data has enough information content to train a strong model.
!brainome https://download.brainome.ai/data/public/vehicle.csv -y -measureonly | grep -A 1 Capacity -
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import MultiIndex, Int64Index
Capacity Progression: at [ 5%, 10%, 20%, 40%, 80%, 100% ]
Ideal Machine Learner: 6, 7, 8, 9, 9, 9
3. Real World Data Set CP¶
The Capacity Progression of a real world data set is somewhere in the middle.
!brainome https://download.brainome.ai/data/public/titanic_train.csv -y -measureonly | grep -A 1 Capacity -
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import MultiIndex, Int64Index
Capacity Progression: at [ 5%, 10%, 20%, 40%, 80%, 100% ]
Ideal Machine Learner: 6, 7, 8, 8, 9, 9
4. How to improve your data set’s CP¶
Typical actions to improve the learnability of your data:
Add more instances
Reduce features
Refactor features
Next Steps¶
Check out brainome_203_Right_Sizing_Data_Sets