Which of the following scatterplots shows a correlation affected by an influential point

Sometimes in regression analysis, a few data points have disproportionate effects on the slope of the regression equation. In this lesson, we describe how to identify those influential points.

Note: Your browser does not support HTML5 video. If you view this web page on a different browser (e.g., a recent version of Edge, Chrome, Firefox, or Opera), you can watch a video treatment of this lesson.

Outliers

Data points that diverge in a big way from the overall pattern are called outliers. There are four ways that a data point might be considered an outlier.

  • It might be distant from the rest of the data, even without extreme X or Y values.

Each type of outlier is depicted graphically in the scatterplots below.

Extreme X value

Extreme Y value

Extreme X and Y

Distant data point

Influential Points

An influential point is an outlier that greatly affects the slope of the regression line. One way to test the influence of an outlier is to compute the regression equation with and without the outlier.

This type of analysis is illustrated below. The scatterplots are identical, except that one plot includes an outlier. When the outlier is present, the slope is flatter (-4.10 vs. -3.32); so this outlier would be considered an influential point.

Without Outlier

Regression equation: ŷ = 104.78 - 4.10x
Coefficient of determination: R2 = 0.94

With Outlier

Regression equation: ŷ = 97.51 - 3.32x
Coefficient of determination: R2 = 0.55

The charts below compare regression statistics for another data set with and without an outlier. Here, one chart has a single outlier, located at the high end of the X axis (where x = 24). As a result of that single outlier, the slope of the regression line changes greatly, from -2.5 to -1.6; so the outlier would be considered an influential point.

Without Outlier

Regression equation: ŷ = 92.54 - 2.5x
Slope: b0 = -2.5
Coefficient of determination: R2 = 0.46

With Outlier

Regression equation: ŷ = 87.59 - 1.6x
Slope: b0 = -1.6
Coefficient of determination: R2 = 0.52

Sometimes, an influential point will cause the coefficient of determination to be bigger; sometimes, smaller. In the first example above, the coefficient of determination is smaller when the influential point is present (0.94 vs. 0.55). In the second example, it is bigger (0.46 vs. 0.52).

If your data set includes an influential point, here are some things to consider.

  • An influential point may represent bad data, possibly the result of measurement error. If possible, check the validity of the data point.
  • Compare the decisions that would be made based on regression equations defined with and without the influential point. If the equations lead to contrary decisions, use caution.

Test Your Understanding

In the context of regression analysis, which of the following statements are true?

I. When the data set includes an influential point, the data set is nonlinear.
II. Influential points always reduce the coefficient of determination.
III. All outliers are influential data points.

(A) I only
(B) II only
(C) III only
(D) All of the above
(E) None of the above

Solution

The correct answer is (E). Data sets with influential points can be linear or nonlinear. Influential points do not always reduce the coefficient of determination. In this lesson, we went over an example in which an influential point increased the coefficient of determination. With respect to regression, outliers are influential only if they have a big effect on the regression equation. Sometimes, outliers do not have big effects. For example, when the data set is very large, a single outlier may not have a big effect on the regression equation.

a.
1.
Residual:
The residual corresponding to a predictor variable is given as the difference between actual value of the response variable and the predicted value. That is, e=y−yˆ, where y be the actual value of the response variable and yˆ be the predicted value of the response variable for same predictor variable.
Leverage:
An observation, whose predictor variables values (x values) are far from the mean of the predictor variables values (x values) is called as leverage point. Leverage point pulls the regression line to it and has a large effect on the regression line. An observation having high leverage has small residual.
The point pulls the regression line to it and has a large effect on the regression line. In addition, the difference between observed and predicted value of response variable corresponding to this point is high.
Thus, the point has a high leverage with a high residual.
2.
Influential point:
A point, which does not belong in a data set and the omission of which from the data results in a very different regression model, is called as influential point.
The point is far from the mean of the explanatory variable. Moreover, the omission of the point from the data results in a very different regression model as it reinforces the association. In addition, including the point scatterplot shows an overall positive direction that is not the actual direction.
Thus, the point is an influential point.
3.
Association:
Association between two variables implies that if two variables are associated or related then the value of one variable gives information about the value of the other variable.
Correlation measure the linear relationship between two variables.
The point supports the positive association. Removing of this point it would weaken the association.
As a result of this the correlation would become weaker. Thus, removing the point result in a weaker correlation.
4.
In a linear regression model y^=b0+b1x, where yˆ be the predicted values of response variable and x be the predictor variable, the b1b1 be the slope and b0b0 be the intercept of the line.
Slope gives the rapidly change of y with respect to x and slope estimate is given as,
b1=r((sy)/(sx)), where r be the correlation between x and y, sy be the standard deviation of y and sx be the standard deviation of x.
The slope of the regression line would increase from negative slope to a slope near 0.
Thus, if the point were removed, would the slope of the regression line would be nearly flat.
b.
1.
The point pulls the regression line to it and has a large effect on the regression line.
Thus, the point has a high leverage with a small residual.
2.
The point is far from the actually scattered points and direction of scatterplot is positive due to this point when the points are actually scattered. Moreover, the omission of the point from the data results in a very different regression model.
Thus, the point is an influential point.
3.
Removing of this point it would weaken the association. Except the point there would be little evidence of linear association.
As a result of this the correlation would become weaker.
Thus, removing the point result in a weaker correlation.
4.
As the point is not influential, thus removing of the point is not result in a very different regression result.
The slope of the regression line would increase from negative slope to a slope near 0.
Thus, if the point were removed, would the slope of the regression line would be nearly flat.
c.
1.
The point does not pull the regression line to it and has not a large effect on the regression line. The difference between observed and the predicted value of response variable corresponding to that point is quite high.
Thus, the point has a little leverage with a high residual.
2.
The point is close to the mean of the explanatory variable. Moreover, the omission of the point from the data results in not a different regression model.
Thus, the point is not influential point.
3.
Removing of this point it would reinforce the association as the point detracts from the overall pattern.
Thus, removing the point result in a slightly stronger correlation, decreasing to become negative.
4.
As the point is not influential, thus removing of the point is not result in a very different regression result.
The slope of the regression line would not be affected.
Thus, if the point were removed, would the slope of the regression line would be remain same.
d.
1.
The point pulls the regression line to it and has a large effect on the regression line.
Thus, the point has a high leverage with a small residual.
2.
The point is far from the mean of the explanatory variable gives the high leverage. Moreover, the omission of the point from the data results in not a very different regression model as it reinforces the association.
Thus, the point is not an influential point.
3.
The point supports the negative association. Removing of this point it would weaken the association.
As a result of this the correlation would become weaker.
Thus, removing the point result in a weaker correlation.
4.
As the point is not influential, thus removing of the point is not result in a very different regression result.
Thus, if the point were removed, would the slope of the regression line remain same

Did you like this example?

Subscribe for all access

Toplist

Neuester Beitrag

Stichworte