Data loading and clining
Contents
Data loading and clining¶
Loading data¶
We will use data stored in a CSV file. In order to process the data we will use Pandas framework.
import pandas as pd
import numpy as np
from IPython.display import display
pd.set_option("display.precision", 2)
pd.options.display.max_columns = 50
CSV_FILENAME = './res/cleveland_data.csv'
names = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']
raw_data = pd.read_csv(
filepath_or_buffer = CSV_FILENAME,
names = names,
index_col = False
)
Data cleaning¶
Checking and removing incorrect data¶
First of all, let’s check if the dataset contains non-numeric data:
dtypes = raw_data.dtypes
non_num_dtypes = dtypes[(dtypes != np.float64) & (dtypes != np.int64)]
print('non-numeric columns:\n', non_num_dtypes)
non-numeric columns:
ca object
thal object
dtype: object
After we have found that there are potentially incorrect data in columns ca and thal, let’s check them
print('unique values in "ca": ', raw_data['ca'].unique())
print('unique values in "thal": ', raw_data['thal'].unique())
unique values in "ca": ['0.0' '3.0' '2.0' '1.0' '?']
unique values in "thal": ['6.0' '3.0' '7.0' '?']
Now we can see that some of the fields contain the ‘?’ symbol. Let’s get rid of them:
raw_data = raw_data.drop(raw_data[(raw_data['ca'] == '?') | (raw_data['thal'] == '?')].index)
raw_data['ca'] = pd.to_numeric(raw_data['ca']); raw_data['thal'] = pd.to_numeric(raw_data['thal'])
raw_data = raw_data.reset_index(drop = True)
Splitting categorical fields¶
We cannot directly use categorical parameters, because the numbers they contain do not quantify them, but only show the presence of a certain feature. Let’s replace each of these parameters with a one-hot vector:
cp_one_hot = pd.get_dummies(
data = raw_data['cp'],
dtype = np.float64
).set_axis(
labels = ['typical angina', 'atypical angina', 'non-anginal pain', 'asymptomatic'],
axis = 'columns'
)
thal_one_hot = pd.get_dummies(
data = raw_data['thal'],
dtype = np.float64
).set_axis(
labels = ['thal norm', 'thal fixed def', 'thal reversable def'],
axis = 'columns'
)
restecg_one_hot = pd.get_dummies(
data = raw_data['restecg'],
dtype = np.float64
).set_axis(
labels = ['ecg norm', 'ecg ST-T abnormal', 'ecg hypertrophy'],
axis = 'columns'
)
original_data = raw_data.drop(columns = ['cp', 'thal', 'restecg']).copy(deep = True)
original_data = pd.concat(
objs = [original_data, cp_one_hot, thal_one_hot, restecg_one_hot],
axis = 'columns',
join = 'outer',
ignore_index = False
)
Then we need to make some sence of the ‘slope’ values:
# Update values: flat = 0.0, upsloping = 1.0, downsloping = -1.0
original_data.loc[original_data['slope'] == 1.0, 'slope'] = 1.0
original_data.loc[original_data['slope'] == 2.0, 'slope'] = 0.0
original_data.loc[original_data['slope'] == 3.0, 'slope'] = -1.0
And finally, in the ‘num’ columns, assign a value of “0” to healthy patients, and a value of “1” to patients with heart disease (regardless the narrowing percentage):
# Update values: 0.0 = no heart disease; 1.0 = heart disease
original_data.loc[original_data['num'] == 0.0, 'num'] = 0.0
original_data.loc[original_data['num'] > 0.0, 'num'] = 1.0
%store original_data
Stored 'original_data' (DataFrame)