Machine Learning (Vol 4) - Neuroon Networks

Breaking

Tuesday, May 7, 2019

Machine Learning (Vol 4)

Train a Neural Network for Predict whether a mammogram mass is benign or malignant


We'll be using the "mammographic masses" public data set from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)
This data contains 961 instances of masses detected in mammograms, and contains the following attributes:
  1. BI-RADS assessment: 1 to 5 (ordinal)
  2. Age: patient's age in years (integer)
  3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
  4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
  5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
  6. Severity: benign=0 or malignant=1 (binominal)
BI-RADS is an assessment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.
Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

Let's Start

Prepare the data set

Start by importing the mammographic_masses.data.txt file into a Pandas data frame.

Make sure you use the optional parameters in read_csv to convert missing data (indicated by a ?) into NaN, and to add the appropriate column names (BI_RADS, age, shape, margin, density, and severity):

import pandas as pd

masses_data = pd.read_csv('mammographic_masses.data.txt', na_values=['?'], names = ['BI-RADS', 'age', 'shape', 'margin', 'density', 'severity'])

There are quite a few missing values in the data set. Before we just drop every row that's missing data, let's make sure we don't bias our data in doing so. Does there appear to be any sort of correlation to what sort of data has missing fields? If there were, we'd have to try and go back and fill that data in.

masses_data.loc[(masses_data['age'].isnull()) |
              (masses_data['shape'].isnull()) |
              (masses_data['margin'].isnull()) |
              (masses_data['density'].isnull())]

If the missing data seems randomly distributed, go ahead and drop rows with missing data.

masses_data.dropna(inplace=True)

Next you'll need to convert the Pandas dataframes into numpy arrays that can be used by scikit_learn. Create an array that extracts only the feature data we want to work with (age, shape, margin, and density) and another array that contains the classes (severity). You'll also need an array of the feature name labels.

all_features = masses_data[['age', 'shape', 'margin', 'density']].values
all_classes = masses_data['severity'].values
feature_names = ['age', 'shape', 'margin', 'density']

Some of our models require the input data to be normalized, so go ahead and normalize the attribute data. We will use preprocessing.StandardScaler() to do that.

from sklearn import preprocessing

scaler = preprocessing.StandardScaler()
all_features_scaled = scaler.fit_transform(all_features)

Now set up an actual MLP model using Keras:

from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

def create_model():
    model = Sequential()
    #4 feature inputs going into an 6-unit layer (more does not seem to help - in fact you can go down to 4)
    model.add(Dense(6, input_dim=4, kernel_initializer='normal', activation='relu'))

    # Output layer with a binary classification (benign or malignant)
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))

    # Compile model; rmsprop seemed to work best
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model

from sklearn.model_selection import cross_val_score
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

# Wrap our Keras model in an estimator compatible with scikit_learn
estimator = KerasClassifier(build_fn=create_model, epochs=100, verbose=0)
# Now we can use scikit_learn's cross_val_score to evaluate this model identically to the others
cv_scores = cross_val_score(estimator, all_features_scaled, all_classes, cv=10)
cv_scores.mean()

So if someone need the full code you can visit my GitHub account and take the code. The post contains all the necessary details that you have to follow to create the trained data model.

1 comment:

  1. Thanks for sharing the info, keep up the good work going.... I really enjoyed exploring your site. good resource...

    On Demand Service Apps For Android and iOS are highly in demand as through this online platform. Would you like to create an clone On Demand Gift Delivery App? If yes then contact us.

    ReplyDelete