Suppose you have a set of $N$ data points $\{x_i, y_i\}$ and a set of estimated uncertainties for the $y$ values $\{\delta y_i\}$. You wish to fit the data to the function $$y(x) = a x + b,$$ where the $a$ and $b$ are adjustable parameters that give the "best fit." This page just describes the programming that needs to be done to find the best fit parameters $a$ and $b$ and their associated uncertainties and compute a $\chi^2$ value to determine the quality of the fit. We will use data from an experiment to measure Planck's constant to illustrate the process. You can read more about what constitues the "best fit" on the Scientific Computing Toolbox website or refer to one of the books referenced below.
A table of the data with uncertainties is shown below. A text file with this data called FakeData_with_error.csv
is included in the fitting_with_error.zip
file.
$x$ | $y$ | $\delta_{y}$ |
---|---|---|
8.213 | 3.261 | 9.71 |
7.402 | 2.52 | 5.59 |
6.876 | 2.239 | 7.08 |
5.491 | 1.299 | 6.83 |
5.196 | 1.175 | 8.93 |
The code is very similar to the code for a fit without uncertainties, but this time we must read the uncertainties from the file, and pass an array containing these uncertainties to curvefit()
.
The first step is to import the needed Python modules.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
We also need to define the linear function to be fit.
def linearFunc(x,intercept,slope):
"""This function defines the function to be fit. In this case a linear
function.
Parameters
----------
x : independent variable
slope : slope
intercept : intercept
Returns
-------
y : dependent variable
"""
y = intercept + slope * x
return y
Next we have our program read the data from the FakeData_with_error.csv
file. It's simple to read the the uncertainties into an array. The file contains three columns of data. The first column contains the $x$ data values, the second the $y$ data values, and the third the uncertainties in $y$.
xdata,ydata,d_y = np.loadtxt('FakeData_with_error.csv',delimiter=',',unpack=True)
We pass the uncertainties to curve_fit
by adding the argument sigma=d_y
to the function call.
a_fit,cov=curve_fit(linearFunc,xdata,ydata,sigma=d_y,absolute_sigma=True)
We need to include absolute_sigma=True
to make sure sigma
is used in an absolute sense and the estimated parameter covariance cov
is computed with the actual sigma
values. If absolute_sigma=False
(default), the relative magnitudes of the sigma
values are used to weight the fit, but the estimated slope and intercept uncertainties are computed assuming $\chi_r^2 = 1$. See the Scientific Computing Toobox: Data Fitting page or the references listed at the end of this page for a more complete discussion of this subtle distinction.
We get the best fit slope and the uncertainties from a_fit
and cov
.
inter = a_fit[0]
slope = a_fit[1]
d_inter = np.sqrt(cov[0][0])
d_slope = np.sqrt(cov[1][1])
We use the function errorbar()
to plot the data showing error bars on the data points. This function works very much like the plot()
command. We just have to specify an additonal array containing the uncertainties with the argument yerr=d_y
. The errorbar()
function also requires that we specify the format to be used to display the data points. In this case the argument fmt='r.'
will display the data as red dots.
# Create a graph showing the data.
plt.errorbar(xdata,ydata,yerr=d_y,fmt='r.',label='Data')
# Compute a best fit line from the fit intercept and slope.
yfit = inter + slope*xdata
# Create a graph of the fit to the data. We just use the ordinary plot
# command for this.
plt.plot(xdata,yfit,label='Fit')
# Display a legend, label the x and y axes and title the graph.
plt.legend()
plt.xlabel('x')
plt.ylabel('y')
# Save the figure to a file
plt.savefig('FakeDataPlot_with_error.png',dpi=300)
# Show the graph in a new window on the users screen.
plt.show()
Now we can display the numerical result.
print(f'The slope = {slope}, with uncertainty {d_slope}')
print(f'The intercept = {inter}, with uncertainty {d_inter}')
The slope = 0.6656028702881751, with uncertainty 0.03157719566099185 The intercept = -2.3430681719234285, with uncertainty 0.21311098958953478
When we have estimated uncertainties in the data, then we can estimate the goodness of fit by computing the reduced chi-squared statistic. For a linear fit to a set of $N$ data points $\{x_i,y_i\}$ that have esimated uncertainties in the $y_i$ values of $\{\delta y_i\}$,
$$
\chi_r^2 = \frac{1}{N-2}\sum_{i=1}^N \frac{\left(y_i-y(x_i)\right)^2}{\delta y_i},
$$
where for a linear fit $y(x) = a + bx$. For a good fit, $\chi_r^2$ should be approximatly equal to one.
chisqr = sum((ydata-linearFunc(xdata,inter,slope))**2/d_y**2)
dof = len(ydata) - 2
chisqr_red = chisqr/dof
print(f'Reduced chi^2 = {chisqr_red}')
Reduced chi^2 = 1.2633310164063059
In this case, the computed value of $\chi_r^2$ is close to one indicating a good fit. See the Scientific Computing Toobox: Data Fitting page or the references listed at the end of this page for a more complete discussion of the reduced chi-square statistic.
The code above is all you should need to do linear fits to data. When you're confronted with a new data analysis problem that requires a fit, it is usually faster to modify existing code than to reinvent it every time. I've included a Python script, the data file, and the Jupyter Notebook that generated this page in a ZIP file. (Download the fitting_with_error.zip
file) Feel free to modify the program and use it when you need to do a linear fit.
A table with a short description of the Python functions used in this tutorial is posted on the Scientific Computing Toolbox: Command Summary Page.