105 Sourcing Your Data Set
Contents
105 Sourcing Your Data Set¶
Brainome accepts CSV files from many sources
Local file system
HTTP/HTTPS URL
Compressed data sets
Multiple data sets
Prerequisites¶
This notebook assumes brainome is installed as per notebook brainome_101_Quick_Start
!python3 -m pip install brainome -quiet
!brainome -version
Usage:
/opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install [options] <requirement specifier> [package-index-options] ...
/opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install [options] -r <requirements file> [package-index-options] ...
/opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install [options] [-e] <vcs project url> ...
/opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install [options] [-e] <local project path> ...
/opt/hostedtoolcache/Python/3.9.10/x64/bin/python3 -m pip install [options] <archive url/path> ...
no such option: -u
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import MultiIndex, Int64Index
brainome v1.8-120-prod
1. Local file system¶
Brainome defaults to reading data files from the current directory.
In this example, we download cancer.csv to the local file system before using it.
import urllib.request as request
response1 = request.urlretrieve('https://download.brainome.ai/data/public/cancer.csv', 'cancer.csv')
print("Downloaded cancer.csv to local file system")
%ls -lh cancer.csv
print("\nRunning brainome")
!brainome cancer.csv -y -o predictor_105_local.py | grep -A 6 "Data:"
Downloaded cancer.csv to local file system
-rw-r--r-- 1 runner docker 20K Mar 12 21:07 cancer.csv
Running brainome
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import MultiIndex, Int64Index
Data:
Input: cancer.csv
Target Column: 1.0
Number of instances: 481
Number of attributes: 9 out of 9
Number of classes: 2
2. HTTP/HTTPS URL¶
Brainome can download a CSV data set from an HTTP URL.
In this example, we use titanic_train.csv
!brainome https://download.brainome.ai/data/public/titanic_train.csv -y -o predictor_105_http.py | grep -A 6 "Data:"
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import MultiIndex, Int64Index
Data:
Input: https://download.brainome.ai/data/public/titanic_train.csv
Target Column: Survived
Number of instances: 800
Number of attributes: 11 out of 11
Number of classes: 2
3. Compressed data sets¶
Brainome can stream a compressed data set.
In this example, we use titanic_compressed.csv.gz
!brainome https://download.brainome.ai/data/public/titanic_compressed.csv.gz -y -o predictor_105_gz.py | grep -A 6 "Data:"
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import MultiIndex, Int64Index
Data:
Input: https://download.brainome.ai/data/public/titanic_compressed.csv.gz
Target Column: Survived
Number of instances: 800
Number of attributes: 11 out of 11
Number of classes: 2
4. Multiple data sets¶
Brainome can accept multiple data sets. They need to all have the same columns.
In this example, we use vehicle.csv, vehicle_A.csv.gz, and vehicle_B.csv.gz
!brainome https://download.brainome.ai/data/public/vehicle.csv https://download.brainome.ai/data/public/vehicle_A.csv.gz https://download.brainome.ai/data/public/vehicle_B.csv.gz -y -o predictor_105_multi.py | grep -A 6 "Data:"
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import MultiIndex, Int64Index
Data:
Input: https://download.brainome.ai/data/public/vehicle.csv https://download.brainome.ai/data/public/vehicle_A.csv.gz https://download.brainome.ai/data/public/vehicle_B.csv.gz
Target Column: Class
Number of instances: 2538
Number of attributes: 18 out of 18
Number of classes: 4
Next Steps¶
Check out 106 Describe Your CSV