Regression Lines and Estimation

Scope Label

Core 9758. Regression lines, line selection, interpolation, extrapolation, and reliability of estimates are core parts of correlation and linear regression.

Role in the Topic

This branch starts after the scatter diagram suggests that a linear model is sensible.

Use it with:

Regression as Best-Fit Modelling

A regression line is a best-fit line used to estimate one variable from another.

The phrase “best fit” is made precise by least squares.

For regression of on , the line is chosen to minimise the sum of squared vertical errors.

For regression of on , the line is chosen to minimise the sum of squared horizontal errors.

Caption: Regression of on minimises vertical errors, while regression of on minimises horizontal errors.

This is the key reason the two regression lines usually differ.

Regression of on

The regression line of on is used when is modelled or estimated from .

It has the form

or equivalent calculator notation.

This line treats as the explanatory or input variable and as the response or output variable.

If the fitted line is

then the gradient means:

for each one-unit increase in , the predicted value of changes by approximately units.

The intercept is the predicted value of when , but it should only be interpreted if is meaningful in context and not far outside the observed data range.

Regression of on

The regression line of on is used when is modelled or estimated from .

It has the form

If this line is later rearranged to draw it on usual - axes, it is still the regression line of on . The direction of least-squares error has not changed.

Choosing the Correct Regression Line

The safest question is:

Which variable is being modelled as depending on the other?

If the context gives a dependence direction, follow it.

ContextSuitable model
depends on regression of on
depends on regression of on
no clear dependence, estimate from regression of on
no clear dependence, estimate from regression of on

Caption: Choosing the regression line depends first on dependence direction, then on the prediction target when no direction is clear.

The subtle case is calibration. If an instrument reading depends on the true concentration , the modelling direction may still be on even when the question asks for from a given reading.

Mean Point

Both regression lines pass through

This point is the centre of the bivariate data.

Caption: Both regression lines pass through , but they usually have different gradients.

The stronger the linear correlation, the closer the two regression lines are. If or , all points are collinear and the two regression lines coincide.

When the regression line of on is rearranged into the usual form , it has a larger absolute gradient than the regression line of on , unless the two lines coincide in the perfect linear case.

Caption: Stronger linear correlation makes the two regression lines closer; perfect linear correlation makes them coincide.

Interpolation and Extrapolation

An estimate is interpolation if the input value lies within the observed data range.

An estimate is extrapolation if the input value lies outside the observed data range.

Caption: Estimates inside the observed data range are more defensible than extrapolated estimates outside it.

Interpolation is generally more reliable because it stays within the evidence supplied by the data.

Extrapolation assumes the same relationship continues outside the observed range. That assumption may be false even when is close to within the sample.

Reliability of Estimates

To judge reliability, ask:

  1. Is the input value within the observed data range?
  2. Is the scatter diagram roughly linear?
  3. Is close to ?
  4. Are there outliers or clusters that weaken the model?
  5. Does the estimate make sense in context?

Caption: A regression estimate is more reliable when it stays within the observed range and is supported by a strong linear pattern.

A good reliability statement names both range and linear strength:

The estimate is likely to be reliable because the input is within the observed range and the scatter diagram shows a strong linear pattern.

or:

The estimate is not reliable because it is extrapolation beyond the observed data range.

Core Example: Choosing a Line

Suppose is a machine reading and is the true concentration of a chemical. The machine reading depends on the true concentration.

If the question gives a machine reading and asks for the concentration, it may seem natural to use the regression line of on .

But if the modelling relationship is

then the regression line of on may be the correct calibration model. After fitting it, solve the equation for .

The reason is not algebraic convenience. It is that the reading error is in .

Core Example: Reliability

Suppose the observed -values range from to , and the regression line is used to estimate when .

This is interpolation, so it may be reasonable if the scatter diagram is roughly linear and is close to .

If the same line is used at , the estimate is extrapolation. It should be treated as unreliable unless there is strong external justification that the same linear pattern continues.

Common Pitfalls

  • Choosing the regression line only by looking at which variable is unknown.
  • Forgetting that context may determine the dependence direction.
  • Treating the two regression lines as interchangeable.
  • Rearranging and forgetting it is still regression of on .
  • Forgetting both regression lines pass through .
  • Forgetting that the rearranged on line has a larger absolute gradient than the on line on usual axes.
  • Interpreting an intercept even when is outside the meaningful context.
  • Calling an extrapolated estimate reliable just because is close to .
  • Ignoring outliers or curvature when using a regression line.

Revision Checklist

  • Can you explain why regression of on minimises vertical errors?
  • Can you explain why regression of on minimises horizontal errors?
  • Can you choose the correct regression line from the context?
  • Can you explain why both regression lines pass through ?
  • Can you interpret the gradient and intercept of a regression line in context?
  • Can you distinguish interpolation from extrapolation?
  • Can you write a reliability sentence that mentions both range and linear strength?