105 Sourcing Your Data Set¶

Brainome accepts CSV files from many sources

Local file system
HTTP/HTTPS URL
Compressed data sets
Multiple data sets

Prerequisites¶

This notebook assumes brainome is installed as per notebook brainome_101_Quick_Start

!python3 -m pip install brainome -quiet
!brainome -version

Usage:   
  /opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install [options] <requirement specifier> [package-index-options] ...
  /opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install [options] -r <requirements file> [package-index-options] ...
  /opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install [options] [-e] <vcs project url> ...
  /opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install [options] [-e] <local project path> ...
  /opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install [options] <archive url/path> ...

no such option: -u

/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index

brainome v1.8-120-prod

1. Local file system¶

Brainome defaults to reading data files from the current directory.

In this example, we download cancer.csv to the local file system before using it.

import urllib.request as request
response1 = request.urlretrieve('https://download.brainome.ai/data/public/cancer.csv', 'cancer.csv')
print("Downloaded cancer.csv to local file system")
%ls -lh cancer.csv
print("\nRunning brainome")
!brainome cancer.csv -y  -o predictor_105_local.py | grep -A 6 "Data:"

Downloaded cancer.csv to local file system

-rw-r--r-- 1 runner docker 20K Mar 12 21:07 cancer.csv

Running brainome

/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index

Data:
    Input:                      cancer.csv
    Target Column:              1.0
    Number of instances:        481
    Number of attributes:         9 out of 9
    Number of classes:            2

2. HTTP/HTTPS URL¶

Brainome can download a CSV data set from an HTTP URL.

In this example, we use titanic_train.csv

!brainome https://download.brainome.ai/data/public/titanic_train.csv -y -o predictor_105_http.py | grep -A 6 "Data:"

/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index

Data:
    Input:                      https://download.brainome.ai/data/public/titanic_train.csv
    Target Column:              Survived
    Number of instances:        800
    Number of attributes:        11 out of 11
    Number of classes:            2

3. Compressed data sets¶

Brainome can stream a compressed data set.

In this example, we use titanic_compressed.csv.gz

!brainome https://download.brainome.ai/data/public/titanic_compressed.csv.gz -y  -o predictor_105_gz.py | grep -A 6 "Data:"

/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index

Data:
    Input:                      https://download.brainome.ai/data/public/titanic_compressed.csv.gz
    Target Column:              Survived
    Number of instances:        800
    Number of attributes:        11 out of 11
    Number of classes:            2

4. Multiple data sets¶

Brainome can accept multiple data sets. They need to all have the same columns.

In this example, we use vehicle.csv, vehicle_A.csv.gz, and vehicle_B.csv.gz

!brainome https://download.brainome.ai/data/public/vehicle.csv https://download.brainome.ai/data/public/vehicle_A.csv.gz https://download.brainome.ai/data/public/vehicle_B.csv.gz -y  -o predictor_105_multi.py | grep -A 6 "Data:"

/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index

Data:
    Input:                      https://download.brainome.ai/data/public/vehicle.csv https://download.brainome.ai/data/public/vehicle_A.csv.gz https://download.brainome.ai/data/public/vehicle_B.csv.gz
    Target Column:              Class
    Number of instances:       2538
    Number of attributes:        18 out of 18
    Number of classes:            4

Next Steps¶

Check out 106 Describe Your CSV
Check out Using Measurement to Create Better Models

Brainome Jupyter Tutorials

105 Sourcing Your Data Set

Contents