brainome logo

105 Sourcing Your Data Set

Brainome accepts CSV files from many sources

  1. Local file system

  2. HTTP/HTTPS URL

  3. Compressed data sets

  4. Multiple data sets

Prerequisites

This notebook assumes brainome is installed as per notebook brainome_101_Quick_Start

!python3 -m pip install brainome -quiet
!brainome -version
Usage:   
  /opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install [options] <requirement specifier> [package-index-options] ...
  /opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install [options] -r <requirements file> [package-index-options] ...
  /opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install [options] [-e] <vcs project url> ...
  /opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install [options] [-e] <local project path> ...
  /opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install [options] <archive url/path> ...

no such option: -u
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
brainome v1.8-120-prod

1. Local file system

Brainome defaults to reading data files from the current directory.

In this example, we download cancer.csv to the local file system before using it.

import urllib.request as request
response1 = request.urlretrieve('https://download.brainome.ai/data/public/cancer.csv', 'cancer.csv')
print("Downloaded cancer.csv to local file system")
%ls -lh cancer.csv
print("\nRunning brainome")
!brainome cancer.csv -y  -o predictor_105_local.py | grep -A 6 "Data:"
Downloaded cancer.csv to local file system
-rw-r--r-- 1 runner docker 20K Mar 12 21:07 cancer.csv
Running brainome
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
Data:
    Input:                      cancer.csv
    Target Column:              1.0
    Number of instances:        481
    Number of attributes:         9 out of 9
    Number of classes:            2

2. HTTP/HTTPS URL

Brainome can download a CSV data set from an HTTP URL.

In this example, we use titanic_train.csv

!brainome https://download.brainome.ai/data/public/titanic_train.csv -y -o predictor_105_http.py | grep -A 6 "Data:"
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
Data:
    Input:                      https://download.brainome.ai/data/public/titanic_train.csv
    Target Column:              Survived
    Number of instances:        800
    Number of attributes:        11 out of 11
    Number of classes:            2

3. Compressed data sets

Brainome can stream a compressed data set.

In this example, we use titanic_compressed.csv.gz

!brainome https://download.brainome.ai/data/public/titanic_compressed.csv.gz -y  -o predictor_105_gz.py | grep -A 6 "Data:"
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
Data:
    Input:                      https://download.brainome.ai/data/public/titanic_compressed.csv.gz
    Target Column:              Survived
    Number of instances:        800
    Number of attributes:        11 out of 11
    Number of classes:            2

4. Multiple data sets

Brainome can accept multiple data sets. They need to all have the same columns.

In this example, we use vehicle.csv, vehicle_A.csv.gz, and vehicle_B.csv.gz

!brainome https://download.brainome.ai/data/public/vehicle.csv https://download.brainome.ai/data/public/vehicle_A.csv.gz https://download.brainome.ai/data/public/vehicle_B.csv.gz -y  -o predictor_105_multi.py | grep -A 6 "Data:"
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
Data:
    Input:                      https://download.brainome.ai/data/public/vehicle.csv https://download.brainome.ai/data/public/vehicle_A.csv.gz https://download.brainome.ai/data/public/vehicle_B.csv.gz
    Target Column:              Class
    Number of instances:       2538
    Number of attributes:        18 out of 18
    Number of classes:            4