Logistic Regression Applied to Classification of Breast Tumors¶

In this notebook, we use logistic regression to classify breast tumors in two classes, benign or malignant. The dataset used in this short tutorial is available here: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/. Note: there were a few missing data (label as ‘?’) which were replaced with zeros.

The whole documentation of the dataset can be seen in the breast-cancer-wisconsin.names file available in the link above. Nonetheless, I will briefly mention the characteristics of this dataset.

This dataset has nine interger-valued features that biologically characterizes a given tumor, e.g., size of the cell, clump thickness, etc. Every sample in the dataset has a label (or class) which indicates whether the tumor is benign or malignant. Benign samples have class == 2 whereas malignant samples have class == 4.

1. Data Visualization¶

Let’s load and visualize the dataset using Pandas

In [1]:

import pandas as pd
import numpy as np
np.random.seed(123)

In [2]:

names = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size',
          'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size',
          'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class']

In [3]:

breast_cancer_df = pd.read_csv('breast-cancer-wisconsin.data', names=names)

In [4]:

breast_cancer_df

Out[4]:

	Sample code number	Clump Thickness	Uniformity of Cell Size	Uniformity of Cell Shape	Marginal Adhesion	Single Epithelial Cell Size	Bare Nuclei	Bland Chromatin	Normal Nucleoli	Mitoses	Class
0	1000025	5	1	1	1	2	1	3	1	1	2
1	1002945	5	4	4	5	7	10	3	2	1	2
2	1015425	3	1	1	1	2	2	3	1	1	2
3	1016277	6	8	8	1	3	4	3	7	1	2
4	1017023	4	1	1	3	2	1	3	1	1	2
5	1017122	8	10	10	8	7	10	9	7	1	4
6	1018099	1	1	1	1	2	10	3	1	1	2
7	1018561	2	1	2	1	2	1	3	1	1	2
8	1033078	2	1	1	1	2	1	1	1	5	2
9	1033078	4	2	1	1	2	1	2	1	1	2
10	1035283	1	1	1	1	1	1	3	1	1	2
11	1036172	2	1	1	1	2	1	2	1	1	2
12	1041801	5	3	3	3	2	3	4	4	1	4
13	1043999	1	1	1	1	2	3	3	1	1	2
14	1044572	8	7	5	10	7	9	5	5	4	4
15	1047630	7	4	6	4	6	1	4	3	1	4
16	1048672	4	1	1	1	2	1	2	1	1	2
17	1049815	4	1	1	1	2	1	3	1	1	2
18	1050670	10	7	7	6	4	10	4	1	2	4
19	1050718	6	1	1	1	2	1	3	1	1	2
20	1054590	7	3	2	10	5	10	5	4	4	4
21	1054593	10	5	5	3	6	7	7	10	1	4
22	1056784	3	1	1	1	2	1	2	1	1	2
23	1057013	8	4	5	1	2	0	7	3	1	4
24	1059552	1	1	1	1	2	1	3	1	1	2
25	1065726	5	2	3	4	2	7	3	6	1	4
26	1066373	3	2	1	1	1	1	2	1	1	2
27	1066979	5	1	1	1	2	1	2	1	1	2
28	1067444	2	1	1	1	2	1	2	1	1	2
29	1070935	1	1	3	1	2	1	1	1	1	2
...	...	...	...	...	...	...	...	...	...	...	...
669	1350423	5	10	10	8	5	5	7	10	1	4
670	1352848	3	10	7	8	5	8	7	4	1	4
671	1353092	3	2	1	2	2	1	3	1	1	2
672	1354840	2	1	1	1	2	1	3	1	1	2
673	1354840	5	3	2	1	3	1	1	1	1	2
674	1355260	1	1	1	1	2	1	2	1	1	2
675	1365075	4	1	4	1	2	1	1	1	1	2
676	1365328	1	1	2	1	2	1	2	1	1	2
677	1368267	5	1	1	1	2	1	1	1	1	2
678	1368273	1	1	1	1	2	1	1	1	1	2
679	1368882	2	1	1	1	2	1	1	1	1	2
680	1369821	10	10	10	10	5	10	10	10	7	4
681	1371026	5	10	10	10	4	10	5	6	3	4
682	1371920	5	1	1	1	2	1	3	2	1	2
683	466906	1	1	1	1	2	1	1	1	1	2
684	466906	1	1	1	1	2	1	1	1	1	2
685	534555	1	1	1	1	2	1	1	1	1	2
686	536708	1	1	1	1	2	1	1	1	1	2
687	566346	3	1	1	1	2	1	2	3	1	2
688	603148	4	1	1	1	2	1	1	1	1	2
689	654546	1	1	1	1	2	1	1	1	8	2
690	654546	1	1	1	3	2	1	1	1	1	2
691	695091	5	10	10	5	4	5	4	4	1	4
692	714039	3	1	1	1	2	1	1	1	1	2
693	763235	3	1	1	1	2	1	2	1	2	2
694	776715	3	1	1	1	3	2	1	1	1	2
695	841769	2	1	1	1	2	1	1	1	1	2
696	888820	5	10	10	3	7	3	8	10	2	4
697	897471	4	8	6	4	3	4	10	6	1	4
698	897471	4	8	8	5	4	5	10	4	1	4

699 rows × 11 columns

In [5]:

features = ['Clump Thickness', 'Uniformity of Cell Size',
            'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size',
            'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses']

In [6]:

corr = []
for f in features:
    c = breast_cancer_df[f].corr(breast_cancer_df['Class'], method='spearman')
    corr.append(c)

In [7]:

corr

Out[7]:

[0.68245186937823676,
85548668244535364,
83639412545877556,
7279952033877698,
76273086721512906,
81376763955180775,
74035036553976241,
74382258149235514,
52676617489092259]

Let’s take a look at the distribution of the dataset:

In [8]:

benign_samples = breast_cancer_df[breast_cancer_df['Class'] == 2]

In [9]:

malignant_samples = breast_cancer_df[breast_cancer_df['Class'] == 4]

In [10]:

print("Percentage of benign examples: {}%".format(np.round(len(benign_samples) / len(breast_cancer_df) * 100)))

Percentage of benign examples: 66.0%

In [11]:

print("Percentage of malignant examples: {}%".format(np.round(len(malignant_samples) / len(breast_cancer_df) * 100)))

Percentage of malignant examples: 34.0%

2. Model fitting¶

Let’s use Scikit-learn to split the dataset in training set and testing set:

In [12]:

from sklearn.model_selection import train_test_split

In [13]:

X_train, X_test, y_train, y_test = train_test_split(breast_cancer_df.loc[:, 'Clump Thickness':'Mitoses'],
                                                    breast_cancer_df['Class'] / 2 - 1, test_size=.3)

Note that I scaled the 'Class' label such that 0 represents benign sample and 1 represents malignant samples. This has to be done solely because of the assumptions of the logistic regression algorithm implemented in macaw.

Now, let’s import the LogisticRegression objective function from macaw:

In [14]:

from macaw.objective_functions import LogisticRegression

See https://mirca.github.io/macaw/api/objective_functions.html#macaw.objective_functions.LogisticRegression for documentation.

Let’s instantiate an object from LogisticRegression passing the labels y_train and the features X_train:

In [15]:

logreg = LogisticRegression(y=np.array(y_train, dtype=float), X=np.array(X_train, dtype=float))

Let’s use the method fit to get the maximum likelihood weights.

Note that we need to pass an initial estimate for the linear weights and bias of the ``LogisiticRegression``:

In [16]:

res = logreg.fit(x0=np.zeros(X_train.shape[1] + 1))

The maximum likelihood weights can accessed using the .x attribute:

In [17]:

res.x

Out[17]:

array([  0.6716211 ,  -0.12269987,   0.22323592,   0.37896363,
        -0.06950043,   0.48099004,   0.65926442,   0.25699509,
         0.58662442, -11.18542664])

Additionally, we can check the status of the fit and the number of iterations that it took to converge.

In [18]:

res.status

Out[18]:

'Success: parameters have not changed by 1e-06 since the previous iteration.'

In [19]:

print("Number of iterations needed: {}".format(res.niters))

Number of iterations needed: 237

Now, let’s compute the accuracy of our model using the test set. For that we can use the predict method passing the testing samples. This method outputs the class of each samples:

In [20]:

logreg.predict(np.array(X_test))

Out[20]:

array([ 1.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  0.,
        0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,
        0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,
        0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,
        1.,  0.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  0.,  1.,  0.,  0.,
        0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,
        0.,  1.,  0.,  1.,  1.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,
        1.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
        1.,  1.,  1.,  1.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  1.,  0.,
        0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,
        0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,
        0.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,
        1.,  0.,  1.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,
        0.,  0.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,
        0.,  0.])

Now we can compute the percentage of samples correctly classified:

In [21]:

accuracy = np.round((np.array(y_test) == logreg.predict(np.array(X_test))).sum() / len(np.array(y_test)) * 100, decimals=5)

In [22]:

print('The accuracy of the model is {}%'.format(accuracy))

The accuracy of the model is 96.19048%

3. Comparison against scikit-learn¶

Let’s compare macaw against scikit-learn:

In [23]:

from sklearn.linear_model import LogisticRegression

In [24]:

logit = LogisticRegression()

In [25]:

logit.fit(X_train, y_train)

Out[25]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [26]:

logit.score(X_test, y_test)

Out[26]:

0.96190476190476193

Looks like macaw has a good agreement with sklearn :)!

4. Logistic Regression with L1 Regularization¶

In [27]:

from macaw.objective_functions import L1LogisticRegression

In [28]:

alpha = [.1, 1., 10., 100.]

In [29]:

acc = []
for a in alpha:
    l1logreg = L1LogisticRegression(y=np.array(y_train, dtype=float), X=np.array(X_train, dtype=float), alpha=a)
    res_l1 = l1logreg.fit(x0=np.zeros(X_train.shape[1] + 1) + 1e-1)
    accuracy = np.round((np.array(y_test) == l1logreg.predict(np.array(X_test))).sum() / len(np.array(y_test)) * 100,
                        decimals=5)
    acc.append(accuracy)

In [30]:

acc

Out[30]:

[95.238100000000003,
714290000000005,
666669999999996,
380949999999999]

In [31]:

import matplotlib.pyplot as plt
%matplotlib inline

In [32]:

plt.loglog(alpha, acc, '*', markersize=15)
plt.ylabel('accuracy')
plt.xlabel('alpha')

Out[32]:

<matplotlib.text.Text at 0x112580278>

../../_images/ipython_notebooks_breast_cancer_breast-cancer_52_1.png