Scatter, Correlation, and PMCC
Scope Label
Core 9758. Scatter diagrams and product moment correlation coefficient are core tools for describing bivariate data.
Role in the Topic
This branch explains the first half of correlation and regression: before fitting a regression line, you must decide what the paired data show.
Use it with the hub:
Bivariate Data
Bivariate data consists of paired observations:
Each pair belongs to one case. If the pairings are broken, the relationship is broken.
For example, if students’ Mathematics and Physics scores are recorded, means one student scored in Mathematics and in Physics. The relationship is not captured by two separate unpaired lists.
Scatter Diagrams
A scatter diagram plots each pair as a point. It should be inspected before computing or interpreting .
Ask five questions:
- Direction: Does tend to increase or decrease as increases?
- Strength: Are the points close to a line, or widely scattered?
- Shape: Is the pattern roughly linear, curved, or irregular?
- Outliers: Are there unusual points that may distort the model?
- Clusters: Are there subgroups that should not be treated as one simple relationship?
Caption: Scatter diagrams can show positive or negative linear trends, no clear linear trend, or a non-linear pattern that regression should not ignore.
Independent and Dependent Variables
In some contexts, one variable is naturally the input and the other is the response.
For example:
Here advertising time may be treated as independent and sales as dependent.
But not every association has a clear dependence direction. Mathematics score and English score may be associated without either score directly causing the other.
This distinction matters later because regression lines are directional.
Product Moment Correlation Coefficient
The product moment correlation coefficient measures the direction and strength of a linear relationship.
When full data are given, may be calculated from the summary form
In practice, the graphing calculator often computes directly, but the formula shows the structure: compares how and vary together with how much each variable varies on its own.
The sign gives direction:
- : positive linear association
- : negative linear association
The magnitude gives strength:
- close to : strong linear association
- close to : weak or no linear association
Caption: The sign of gives direction, while reflects the strength of a linear relationship.
Calculating from Summarised Data
Sometimes a question gives summary values instead of the full dataset.
For example, suppose and
Then
So
This indicates a strong positive linear relationship, provided the scatter diagram does not reveal a serious structural problem such as curvature or an influential outlier.
Further Properties of
The coefficient is dimensionless. It has no units, even when and have units.
The value of is unchanged by linear changes of scale or origin, such as converting temperatures from Celsius to Fahrenheit. This is because such transformations preserve linear strength.
However, can change if:
- new data pairs are added
- an outlier is removed
- a different subset of the data is used
So belongs to the particular dataset being analysed. It is not a permanent property of the real-world variables.
What Does Not Say
The coefficient is not a measure of every possible kind of relationship.
If , the correct interpretation is:
There is little evidence of a linear relationship.
It is not:
There is no relationship.
A curved pattern may have close to even when the variables are strongly related.
Correlation Is Not Causation
A strong correlation does not prove that one variable causes the other.
Possible explanations include:
- one variable may influence the other
- a third hidden variable may influence both
- both variables may change over time for unrelated reasons
- the association may be coincidental
So an exam interpretation should not say “causes” unless the context provides causal evidence.
Why the Scatter Diagram Still Matters
Different datasets can have similar values of but very different structures.
Caption: Datasets with similar correlation coefficients can have very different scatter-plot structures.
This is why the order should be:
- inspect the scatter diagram
- decide whether a linear model is sensible
- interpret in light of the diagram
not the other way around.
Core Example
Suppose a scatter diagram is roughly linear and the calculator gives
A good interpretation is:
There is a strong negative linear relationship between the two variables for the observed data.
A poor interpretation is:
One variable causes the other to decrease.
The first sentence describes association. The second claims causation without evidence.
Common Pitfalls
- Saying “no relationship” when is close to .
- Ignoring a curved scatter diagram because the calculator gives .
- Forgetting that outliers can heavily affect .
- Treating correlation as causation.
- Describing strength without mentioning linearity.
- Using before checking the scatter diagram.
- Forgetting that has no units.
- Treating as fixed even after changing the dataset.
Revision Checklist
- Can you explain why bivariate data must preserve pairings?
- Can you read direction, strength, shape, outliers, and clusters from a scatter diagram?
- Can you interpret the sign and magnitude of ?
- Can you explain why measures linear association only?
- Can you calculate from summarised data if required?
- Can you explain why is dimensionless?
- Can you distinguish association from causation in words?