0:00:00.820,0:00:03.673
When performing linear regression, we have a number of data points.
0:00:03.673,0:00:09.043
Let's say that we have 1, 2, 3 and so on up through M data points.
0:00:09.043,0:00:12.013
Each data point has an output variable, Y,
0:00:12.013,0:00:15.400
and a number of input variables, X1 through X N.
0:00:16.410,0:00:20.400
So in our baseball example Y is the lifetime number of home runs.
0:00:20.400,0:00:24.370
And our X1 and XN are things like height and weight.
0:00:24.370,0:00:27.762
Our one through M samples might be different baseball players.
0:00:27.762,0:00:32.572
So maybe data point one is Derek Jeter, data point two is Barry Bonds, and
0:00:32.572,0:00:34.870
data point M is Babe Ruth.
0:00:34.870,0:00:37.823
Generally speaking, we are trying to predict the values of
0:00:37.823,0:00:41.779
the output variable for each data point, by multiplying the input variables by
0:00:41.779,0:00:45.579
some set of coefficients that we're going to call theta 1 through theta N.
0:00:45.579,0:00:49.105
Each theta, which we'll from here on out call the parameters or
0:00:49.105,0:00:52.826
the weights of the model, tell us how important an input variable is
0:00:52.826,0:00:55.589
when predicting a value for the output variable.
0:00:55.589,0:00:57.381
So if theta 1 is very small,
0:00:57.381,0:01:01.880
X1 must not be very important in general when predicting Y.
0:01:01.880,0:01:03.940
Whereas if theta N is very large,
0:01:03.940,0:01:07.300
then XN is generally a big contributor to the value of Y.
0:01:07.300,0:01:10.420
This model is built in such a way that we can multiply each X by
0:01:10.420,0:01:13.500
the corresponding theta, and sum them up to get Y.
0:01:13.500,0:01:17.080
So that our final equation will look something like the equation down here.
0:01:17.080,0:01:20.250
Theta 1 plus X1 plus theta 2 times X2,
0:01:20.250,0:01:23.948
all the way up to theta N plus XN equals Y.
0:01:23.948,0:01:26.670
And we'd want to be able to predict Y for each of our M data points.
0:01:27.780,0:01:31.844
In this illustration, the dark blue points represent our reserve data points,
0:01:31.844,0:01:34.712
whereas the green line shows the predictive value of Y for
0:01:34.712,0:01:37.598
every value of X given the model that we may have created.
0:01:37.598,0:01:41.380
The best equation is the one that's going to minimize the difference across all
0:01:41.380,0:01:44.160
data points between our predicted Y, and our observed Y.
0:01:45.280,0:01:48.876
What we need to do is find the thetas that produce the best predictions.
0:01:48.876,0:01:52.960
That is, making these differences as small as possible.
0:01:52.960,0:01:56.650
If we wanted to create a value that describes the total areas of our model,
0:01:56.650,0:01:58.380
we'd probably sum up the areas.
0:01:58.380,0:02:02.440
That is, sum over all of our data points from I equals 1, to M.
0:02:02.440,0:02:05.290
The predicted Y minus the actual Y.
0:02:05.290,0:02:08.320
However, since these errors can be both negative and
0:02:08.320,0:02:11.570
positive, if we simply sum them up, we could have
0:02:11.570,0:02:17.040
a total error term that's very close to 0, even if our model is very wrong.
0:02:17.040,0:02:20.710
In order to correct this, rather than simply adding up the error terms,
0:02:20.710,0:02:23.100
we're going to add the square of the error terms.
0:02:23.100,0:02:26.940
This guarantees that the magnitude of each individual error term,
0:02:26.940,0:02:29.680
Y predicted minus Y actual is positive.
0:02:30.920,0:02:33.620
Why don't we make sure the distinction between input of variables and
0:02:33.620,0:02:34.830
output of variables is clear.