Train a Neural Network for Predict whether a mammogram mass is benign or malignant
We'll be using the "mammographic masses" public data set from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)
This data contains 961 instances of masses detected in mammograms, and contains the following attributes:
Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.
A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.
This data contains 961 instances of masses detected in mammograms, and contains the following attributes:
- BI-RADS assessment: 1 to 5 (ordinal)
- Age: patient's age in years (integer)
- Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
- Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
- Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
- Severity: benign=0 or malignant=1 (binominal)
Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.
A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.
Let's Start
Prepare the data set
Start by importing the mammographic_masses.data.txt file into a Pandas data frame.
Make sure you use the optional parameters in read_csv to convert missing data (indicated by a ?) into NaN, and to add the appropriate column names (BI_RADS, age, shape, margin, density, and severity):
import pandas as pd
masses_data = pd.read_csv('mammographic_masses.data.txt', na_values=['?'], names = ['BI-RADS', 'age', 'shape', 'margin', 'density', 'severity'])
There are quite a few missing values in the data set. Before we just
drop every row that's missing data, let's make sure we don't bias our
data in doing so. Does there appear to be any sort of correlation to
what sort of data has missing fields? If there were, we'd have to try
and go back and fill that data in.
masses_data.loc[(masses_data['age'].isnull()) |
(masses_data['shape'].isnull()) |
(masses_data['margin'].isnull()) |
(masses_data['density'].isnull())]
If the missing data seems randomly distributed, go ahead and drop rows with missing data.
masses_data.dropna(inplace=True)
Next you'll need to convert the Pandas dataframes into numpy arrays that
can be used by scikit_learn. Create an array that extracts only the
feature data we want to work with (age, shape, margin, and density) and
another array that contains the classes (severity). You'll also need an
array of the feature name labels.
all_features = masses_data[['age', 'shape', 'margin', 'density']].values
all_classes = masses_data['severity'].values
feature_names = ['age', 'shape', 'margin', 'density']
Some of our models require the input data to be normalized, so go ahead and normalize the attribute data. We will use preprocessing.StandardScaler() to do that.
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
all_features_scaled = scaler.fit_transform(all_features)
Now set up an actual MLP model using Keras:
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
def create_model():
model = Sequential()
#4 feature inputs going into an 6-unit layer (more does not seem to help - in fact you can go down to 4)
model.add(Dense(6, input_dim=4, kernel_initializer='normal', activation='relu'))
# Output layer with a binary classification (benign or malignant)
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
# Compile model; rmsprop seemed to work best
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
from sklearn.model_selection import cross_val_score
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
# Wrap our Keras model in an estimator compatible with scikit_learn
estimator = KerasClassifier(build_fn=create_model, epochs=100, verbose=0)
# Now we can use scikit_learn's cross_val_score to evaluate this model identically to the others
cv_scores = cross_val_score(estimator, all_features_scaled, all_classes, cv=10)
cv_scores.mean()
So if someone need the full code you can visit my GitHub account and take the code. The post contains all the necessary details that you have to follow to create the trained data model.
Thanks for sharing the info, keep up the good work going.... I really enjoyed exploring your site. good resource...
ReplyDeleteOn Demand Service Apps For Android and iOS are highly in demand as through this online platform. Would you like to create an clone On Demand Gift Delivery App? If yes then contact us.