brainome logo

106 Describing Your Data Set

Brainome assumes your CSV file has certain characteristics:

  • the first row is the column headers

  • the target is the last column

  • we train using all columns

Use these parameters to change our assumptions.

  1. -headerless CSV file

  2. Selecting the -target column

  3. -ignorecolumns to omit unique identifiers

Prerequisites

This notebook assumes brainome is installed as per notebook brainome_101_Quick_Start

!python3 -m pip install brainome  --quiet
!brainome -version
WARNING: You are using pip version 22.0.3; however, version 22.0.4 is available.
You should consider upgrading via the '/opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install --upgrade pip' command.

/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
brainome v1.8-120-prod

1. -headerless CSV file

Brainome assumes your CSV file has a header row.

Use -headerless when your CSV file omits the header row.

In this example, we use bank.csv

import urllib.request as request
response1 = request.urlretrieve('https://download.brainome.ai/data/public/bank.csv', 'bank.csv')
print(" Headerless data set bank.csv ".center(80,"-"))
!head -4 bank.csv
print("\n"," Ranking an headerless data file ".center(80,"-"))
!brainome bank.csv -headerless -y -o predictor_106_headerless.py | grep -A 6 "Attribute Ranking:"
------------------------- Headerless data set bank.csv -------------------------
3.6216,8.6661,-2.8073,-0.44699,0
4.5459,8.1674,-2.4586,-1.4621,0
3.866,-2.6383,1.9242,0.10645,0
3.4566,9.5228,-4.0112,-3.5944,0
 ----------------------- Ranking an headerless data file ------------------------
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
    Attribute Ranking:
                                      Feature | Relative Importance
                                            0 :   0.5880
                                            1 :   0.2494
                                            2 :   0.1482
                                            3 :   0.0144
         

2. Selecting the -target column

Brainome assumes the last column is the target.

Use -target to specify a different column.

In this example, we use titanic_train.csv but rather than predicting Survived, we predict Cabin_Class

!brainome https://download.brainome.ai/data/public/titanic_train.csv -target Cabin_Class -y -o predictor_106_target.py | grep "Target Column:"
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
    Target Column:              Cabin_Class

3. -ignorecolumns to omit unique identifiers

Brainome will use all the columns in your data set. Most data sets include unique identifiers to tie the predictions to an external source.

Use -ignorecolumns to omit features from your model.

In this example, we ignore PassengerId and Ticket_Number from titanic_train.csv

!brainome https://download.brainome.ai/data/public/titanic_train.csv -ignorecolumns "PassengerId,Ticket_Number" -y -o predictor_106_ignorecolumns.py | grep -A 10 "Attribute Ranking:"
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
    Attribute Ranking:
                                      Feature | Relative Importance
                                          Sex :   0.5270
                                  Cabin_Class :   0.1876
                                 Cabin_Number :   0.0661
                                          Age :   0.0522
                               Sibling_Spouse :   0.0502
                                         Fare :   0.0331
                                         Name :   0.0304
                          Port_of_Embarkation :   0.0289
                              Parent_Children :   0.0246