Data Fitting with Error¶

Suppose you have a set of $N$ data points $\{x_i, y_i\}$ and a set of estimated uncertainties for the $y$ values $\{\delta y_i\}$. You wish to fit the data to the function $$y(x) = a x + b,$$ where the $a$ and $b$ are adjustable parameters that give the "best fit." This page just describes the programming that needs to be done to find the best fit parameters $a$ and $b$ and their associated uncertainties and compute a $\chi^2$ value to determine the quality of the fit. We will use data from an experiment to measure Planck's constant to illustrate the process. You can read more about what constitues the "best fit" on the Scientific Computing Toolbox website or refer to one of the books referenced below.

A table of the data with uncertainties is shown below. A text file with this data called FakeData_with_error.csv is included in the fitting_with_error.zip file.

$x$ $y$ $\delta_{y}$
8.213 3.261 9.71
7.402 2.52 5.59
6.876 2.239 7.08
5.491 1.299 6.83
5.196 1.175 8.93

The code is very similar to the code for a fit without uncertainties, but this time we must read the uncertainties from the file, and pass an array containing these uncertainties to curvefit().

The first step is to import the needed Python modules.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

We also need to define the linear function to be fit.

In [2]:
def linearFunc(x,intercept,slope):
    """This function defines the function to be fit. In this case a linear
    function.
    
    Parameters
    ----------
    x : independent variable
    slope : slope
    intercept : intercept
    
    Returns
    -------
    y : dependent variable
    """
    y = intercept + slope * x
    return y

Next we have our program read the data from the FakeData_with_error.csv file. It's simple to read the the uncertainties into an array. The file contains three columns of data. The first column contains the $x$ data values, the second the $y$ data values, and the third the uncertainties in $y$.

In [3]:
xdata,ydata,d_y = np.loadtxt('FakeData_with_error.csv',delimiter=',',unpack=True)

We pass the uncertainties to curve_fit by adding the argument sigma=d_y to the function call.

In [4]:
a_fit,cov=curve_fit(linearFunc,xdata,ydata,sigma=d_y,absolute_sigma=True)

We need to include absolute_sigma=True to make sure sigma is used in an absolute sense and the estimated parameter covariance cov is computed with the actual sigma values. If absolute_sigma=False (default), the relative magnitudes of the sigma values are used to weight the fit, but the estimated slope and intercept uncertainties are computed assuming $\chi_r^2 = 1$. See the Scientific Computing Toobox: Data Fitting page or the references listed at the end of this page for a more complete discussion of this subtle distinction.

We get the best fit slope and the uncertainties from a_fit and cov.

In [5]:
inter = a_fit[0]
slope = a_fit[1]
d_inter = np.sqrt(cov[0][0])
d_slope = np.sqrt(cov[1][1])

We use the function errorbar() to plot the data showing error bars on the data points. This function works very much like the plot() command. We just have to specify an additonal array containing the uncertainties with the argument yerr=d_y. The errorbar() function also requires that we specify the format to be used to display the data points. In this case the argument fmt='r.' will display the data as red dots.

In [6]:
# Create a graph showing the data.
plt.errorbar(xdata,ydata,yerr=d_y,fmt='r.',label='Data')

# Compute a best fit line from the fit intercept and slope.
yfit = inter + slope*xdata

# Create a graph of the fit to the data. We just use the ordinary plot
# command for this.
plt.plot(xdata,yfit,label='Fit')

# Display a legend, label the x and y axes and title the graph.
plt.legend()
plt.xlabel('x')
plt.ylabel('y')

# Save the figure to a file
plt.savefig('FakeDataPlot_with_error.png',dpi=300)

# Show the graph in a new window on the users screen.
plt.show()

Now we can display the numerical result.

In [7]:
print(f'The slope = {slope}, with uncertainty {d_slope}')
print(f'The intercept = {inter}, with uncertainty {d_inter}')
The slope = 0.6656028702881751, with uncertainty 0.03157719566099185
The intercept = -2.3430681719234285, with uncertainty 0.21311098958953478

When we have estimated uncertainties in the data, then we can estimate the goodness of fit by computing the reduced chi-squared statistic. For a linear fit to a set of $N$ data points $\{x_i,y_i\}$ that have esimated uncertainties in the $y_i$ values of $\{\delta y_i\}$,
$$ \chi_r^2 = \frac{1}{N-2}\sum_{i=1}^N \frac{\left(y_i-y(x_i)\right)^2}{\delta y_i}, $$ where for a linear fit $y(x) = a + bx$. For a good fit, $\chi_r^2$ should be approximatly equal to one.

In [8]:
chisqr = sum((ydata-linearFunc(xdata,inter,slope))**2/d_y**2)
dof = len(ydata) - 2
chisqr_red = chisqr/dof
print(f'Reduced chi^2 = {chisqr_red}')
Reduced chi^2 = 1.2633310164063059

In this case, the computed value of $\chi_r^2$ is close to one indicating a good fit. See the Scientific Computing Toobox: Data Fitting page or the references listed at the end of this page for a more complete discussion of the reduced chi-square statistic.

Summary¶

The code above is all you should need to do linear fits to data. When you're confronted with a new data analysis problem that requires a fit, it is usually faster to modify existing code than to reinvent it every time. I've included a Python script, the data file, and the Jupyter Notebook that generated this page in a ZIP file. (Download the fitting_with_error.zip file) Feel free to modify the program and use it when you need to do a linear fit.

A table with a short description of the Python functions used in this tutorial is posted on the Scientific Computing Toolbox: Command Summary Page.


References

  1. Bevington and Robinson, Data Reduction and Error Analysis for the Physical Sciences 3rd Edition (McGraw-Hill Education, 2002)
  2. Taylor, An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements 2nd Edition, (University Science Books, 1996)