Regression Lines and Estimation
Scope Label
Core 9758. Regression lines, line selection, interpolation, extrapolation, and reliability of estimates are core parts of correlation and linear regression.
Role in the Topic
This branch starts after the scatter diagram suggests that a linear model is sensible.
Use it with:
Regression as Best-Fit Modelling
A regression line is a best-fit line used to estimate one variable from another.
The phrase “best fit” is made precise by least squares.
For regression of on , the line is chosen to minimise the sum of squared vertical errors.
For regression of on , the line is chosen to minimise the sum of squared horizontal errors.
Caption: Regression of on minimises vertical errors, while regression of on minimises horizontal errors.
This is the key reason the two regression lines usually differ.
Regression of on
The regression line of on is used when is modelled or estimated from .
It has the form
or equivalent calculator notation.
This line treats as the explanatory or input variable and as the response or output variable.
If the fitted line is
then the gradient means:
for each one-unit increase in , the predicted value of changes by approximately units.
The intercept is the predicted value of when , but it should only be interpreted if is meaningful in context and not far outside the observed data range.
Regression of on
The regression line of on is used when is modelled or estimated from .
It has the form
If this line is later rearranged to draw it on usual - axes, it is still the regression line of on . The direction of least-squares error has not changed.
Choosing the Correct Regression Line
The safest question is:
Which variable is being modelled as depending on the other?
If the context gives a dependence direction, follow it.
| Context | Suitable model |
|---|---|
| depends on | regression of on |
| depends on | regression of on |
| no clear dependence, estimate from | regression of on |
| no clear dependence, estimate from | regression of on |
Caption: Choosing the regression line depends first on dependence direction, then on the prediction target when no direction is clear.
The subtle case is calibration. If an instrument reading depends on the true concentration , the modelling direction may still be on even when the question asks for from a given reading.
Mean Point
Both regression lines pass through
This point is the centre of the bivariate data.
Caption: Both regression lines pass through , but they usually have different gradients.
The stronger the linear correlation, the closer the two regression lines are. If or , all points are collinear and the two regression lines coincide.
When the regression line of on is rearranged into the usual form , it has a larger absolute gradient than the regression line of on , unless the two lines coincide in the perfect linear case.
Caption: Stronger linear correlation makes the two regression lines closer; perfect linear correlation makes them coincide.
Interpolation and Extrapolation
An estimate is interpolation if the input value lies within the observed data range.
An estimate is extrapolation if the input value lies outside the observed data range.
Caption: Estimates inside the observed data range are more defensible than extrapolated estimates outside it.
Interpolation is generally more reliable because it stays within the evidence supplied by the data.
Extrapolation assumes the same relationship continues outside the observed range. That assumption may be false even when is close to within the sample.
Reliability of Estimates
To judge reliability, ask:
- Is the input value within the observed data range?
- Is the scatter diagram roughly linear?
- Is close to ?
- Are there outliers or clusters that weaken the model?
- Does the estimate make sense in context?
Caption: A regression estimate is more reliable when it stays within the observed range and is supported by a strong linear pattern.
A good reliability statement names both range and linear strength:
The estimate is likely to be reliable because the input is within the observed range and the scatter diagram shows a strong linear pattern.
or:
The estimate is not reliable because it is extrapolation beyond the observed data range.
Core Example: Choosing a Line
Suppose is a machine reading and is the true concentration of a chemical. The machine reading depends on the true concentration.
If the question gives a machine reading and asks for the concentration, it may seem natural to use the regression line of on .
But if the modelling relationship is
then the regression line of on may be the correct calibration model. After fitting it, solve the equation for .
The reason is not algebraic convenience. It is that the reading error is in .
Core Example: Reliability
Suppose the observed -values range from to , and the regression line is used to estimate when .
This is interpolation, so it may be reasonable if the scatter diagram is roughly linear and is close to .
If the same line is used at , the estimate is extrapolation. It should be treated as unreliable unless there is strong external justification that the same linear pattern continues.
Common Pitfalls
- Choosing the regression line only by looking at which variable is unknown.
- Forgetting that context may determine the dependence direction.
- Treating the two regression lines as interchangeable.
- Rearranging and forgetting it is still regression of on .
- Forgetting both regression lines pass through .
- Forgetting that the rearranged on line has a larger absolute gradient than the on line on usual axes.
- Interpreting an intercept even when is outside the meaningful context.
- Calling an extrapolated estimate reliable just because is close to .
- Ignoring outliers or curvature when using a regression line.
Revision Checklist
- Can you explain why regression of on minimises vertical errors?
- Can you explain why regression of on minimises horizontal errors?
- Can you choose the correct regression line from the context?
- Can you explain why both regression lines pass through ?
- Can you interpret the gradient and intercept of a regression line in context?
- Can you distinguish interpolation from extrapolation?
- Can you write a reliability sentence that mentions both range and linear strength?