Robust fitting of linear models using Least Absolute Deviations¶

https://en.wikipedia.org/wiki/Least_absolute_deviations

In [18]:

%matplotlib inline
%config IPython.matplotlib.backend = "retina"
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams["figure.dpi"] = 300
rcParams["savefig.dpi"] = 300
rcParams['figure.figsize'] = [7, 5]
rcParams["text.usetex"] = True

In [19]:

import tensorflow as tf
import numpy as np

In [20]:

np.random.seed(123)

In [21]:

from lad import lad

Let’s create some fake data with outliers:

In [22]:

x = np.linspace(0, 10, 100)
noise = 10 * np.random.normal(size=len(x))
y = 10 * x + 10 + noise

In [23]:

mask = np.arange(1, len(x)+1, 1) % 5 == 0
y[mask] = np.linspace(6, 3, len(y[mask])) * y[mask]

In [24]:

plt.plot(x[~mask], y[~mask], 'ok', markersize=3., alpha=.5)
plt.plot(x[mask], y[mask], 'o', markersize=3., color='red', alpha=.5, label='Outliers')
plt.plot(x, 10 * x + 10, 'k', label='True line')
plt.legend()

Out[24]:

<matplotlib.legend.Legend at 0x12c8f2128>

../_images/ipython_notebooks_getting-started_9_1.png

In [25]:

X = np.vstack([x, np.ones(len(x))])

In [26]:

X.shape

Out[26]:

(2, 100)

In [27]:

y.shape

Out[27]:

(100,)

Let’s compute the linear least squares model using tf.matrix_solve_ls:

In [28]:

with tf.Session() as sess:
    X_tensor = tf.convert_to_tensor(X.T, dtype=tf.float64)
    y_tensor = tf.reshape(tf.convert_to_tensor(y, dtype=tf.float64), (-1, 1))
    coeffs = tf.matrix_solve_ls(X_tensor, y_tensor)
    m, b = sess.run(coeffs)

In [29]:

plt.plot(x[~mask], y[~mask], 'ok', markersize=3., alpha=.5)
plt.plot(x[mask], y[mask], 'o', markersize=3., color='red', alpha=.5, label='Outliers')
plt.plot(x, 10 * x + 10, 'k', label="True line")
plt.plot(x, m * x + b, 'g', label="Least squares line")
plt.legend()

Out[29]:

<matplotlib.legend.Legend at 0x12de64e80>

../_images/ipython_notebooks_getting-started_15_1.png

Now let’s use lad to compute the linear model with the least absolute deviations:

In [30]:

l1coeffs, cov = lad(X.T, y, yerr=np.std(y), cov=True)

In [31]:

sess = tf.Session()
m_l1, b_l1 = sess.run(l1coeffs)

And the estimated values are:

In [32]:

m_l1, b_l1

Out[32]:

(array([ 10.5233137]), array([ 13.6857041]))

The uncertanties on the fitted coefficients are:

In [33]:

unc = np.sqrt(np.diag(sess.run(cov)))
unc

Out[33]:

array([ 0.42935153,  3.5996013 ])

In [34]:

plt.plot(x[~mask], y[~mask], 'ok', markersize=3., alpha=.5)
plt.plot(x[mask], y[mask], 'o', markersize=3., color='red', alpha=.5, label='Outliers')
plt.plot(x, 10 * x + 10, 'k', label="True line")
plt.plot(x, m * x + b, 'g', label="Least squares line")
plt.plot(x, m_l1 * x + b_l1, 'blue', label="Least absolute deviations line")
plt.fill_between(x, (m_l1 - 1.96 * unc[0]) * x + b_l1 - 1.96 * unc[1],
                    (m_l1 + 1.96 * unc[0]) * x + b_l1 + 1.96 * unc[1], alpha=.5)
plt.legend()

Out[34]:

<matplotlib.legend.Legend at 0x12f80def0>

../_images/ipython_notebooks_getting-started_23_1.png

The shaded region is the \(95\%\) confidence interval.