201 Data Sufficiency

Use Capacity Progression to determine -

Does my data have a discernible function? and How much data do I really need?

Capacity progression measures the learnability of a dataset, by plotting the number of decisions needed to memorize the function presented by the training data relative to the number of instances presented to the predictor (for an ideal model).* From the Brainome Glossary

  1. Random Data Set Capacity Progression

  2. Deterministic Data Set CP

  3. Real World Data Set CP

  4. Measures to Improve CP

Prerequisites

This notebook assumes brainome is installed as per notebook brainome_101_Quick_Start

The training data sets are:

!python3 -m pip install brainome  --quiet
!brainome -version
WARNING: You are using pip version 22.0.3; however, version 22.0.4 is available.
You should consider upgrading via the '/opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install --upgrade pip' command.

/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
brainome v1.8-120-prod

1. Random Data Set Capacity Progression

The Capacity Progression of a random data set displays an ever increasing linear function.

!brainome https://download.brainome.ai/data/public/test_data9.csv -y -target SomeEmail -measureonly | grep -A 1 Capacity -
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
Capacity Progression:             at [ 5%, 10%, 20%, 40%, 80%, 100% ]
    Ideal Machine Learner:              6,   7,   8,   9,   9,  10

2. Deterministic Data Set CP

The Capacity Progression of a deterministic data set displays a plateau. In this case, 40% of the data has enough information content to train a strong model.

!brainome https://download.brainome.ai/data/public/vehicle.csv -y -measureonly | grep -A 1 Capacity -
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
Capacity Progression:             at [ 5%, 10%, 20%, 40%, 80%, 100% ]
    Ideal Machine Learner:              6,   7,   8,   9,   9,   9

3. Real World Data Set CP

The Capacity Progression of a real world data set is somewhere in the middle.

!brainome https://download.brainome.ai/data/public/titanic_train.csv -y -measureonly | grep -A 1 Capacity -
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
Capacity Progression:             at [ 5%, 10%, 20%, 40%, 80%, 100% ]
    Ideal Machine Learner:              6,   7,   8,   8,   9,   9

4. How to improve your data set’s CP

Typical actions to improve the learnability of your data:

  • Add more instances

  • Reduce features

  • Refactor features

Next Steps

  • Check out brainome_203_Right_Sizing_Data_Sets