Which of the following is not valid assumption for performing linear regression?

Linear regression is an extremely powerful tool for making predictions. But how do we know that we can trust the results we get from linear regression?

We can only get the best and reliable results if three conditions are met. Without these, we will get biased and unreliable predictions. In this blog, I will discuss the three assumptions and how you can check for them to ensure that you can trust the results when performing linear regression. For simplicities sake, I will assume a simple linear regression.

Linearity

The first assumption may be the most obvious assumption. Linearity means that there must be a linear relationship between the independent (feature) variable and the dependent (target) variable. Without this linear relationship, accurate predictions cannot be made.

The way in which we can test for linearity is to create a scatterplot of the independent and dependent variables. Once the scatterplot is created, we are looking for two things. The first is a roughly linear shape (directionality is not important). The second is that if there is a linear shape, we want to ensure that there are no outliers.

Linear Regression requires that the independent and dependent variable have a linear relationship. Source

You can use the following code to graph the scatterplot.

import matplotlib.pyplot as plt
plt.scatter(df.x, df.y)
plt.show();

Should you find that you do not have a linear relationship, you can perform a transformation (such as a log transformation) to achieve a linear shape. There are also other regression tools, such as polynomial regression, that may yield better results.

If you have an outlier, you want to think about how this outlier may have come to be and if you should ignore it.

Normality

Normality, which is often an assumption sounds straight forward, but it actually does NOT mean that the independent variable is normally distributed. In fact, a linear regression can be successful with non-normal distributions of variables. Instead, the normality assumption means that the residuals that result from the linear regression model should be normally distributed.

We can only collect the residuals after we have created the model. To collect the residuals we can use the following code:

residuals = model.resid

One of the first things we can do to visually check for normality of residuals is to use a Q-Q (quantile-quantile) plot. A Q-Q plot is a probability plot that is a graphical method that compares two probability distributions by plotting their quantiles against each other. This plot allows us to predict if a set of data comes from a known distribution (such as the normal distribution). With Q-Q plots, they do confirm is something is NOT normally distributed

We can graph a Q-Q plot using our residuals and the following:

import scipy.stats as stats
fig = sm.graphics.qqplot(residuals, dist=stats.norm,
line='45', fit=True)
fig.show()
We are looking for our dots (blue) to roughly align with the red line. Source

When we look at a Q-Q plot we are looking for the dots to line up with the 45-degree line. If there is too much deviation from the red, we can assume that the residuals are not normally distributed. We should also expect some curvature at the edges of the plot. When it fits too perfectly, we want to be cautious about the data.

This Q-Q plot is not what we are looking for when checking the normality assumption. Source

It is also important to check the histogram of the residuals to see if it is normal.

Homoscedasticity

When we talk about homoscedasticity we also want to talk about heteroscedasticity.

Heteroscedasticity refers to the circumstance in which the dependent variable is unequal across the range of values of the predictor(s).

Therefore, we can say the homoscedasticity refers to the circumstance in which the dependent variable is equal across the range of values of the predictor(s).

What this means is that if you were to plot the residuals against the independent variable, it should be roughly symmetric around the x-axis and it should be consistently spread across the predictor values. Being symmetric around the x-axis means that the residual values are similar in both the positive and negative directions. The consistent spread means that a specific predictor value is not a stronger influence on the model because the residuals vary with these values.

We want to have homoscedasticity. To check for homoscedasticity we use a scatterplot of the residuals against the independent variable. Once this plot is created we are looking for a rectangular shape. This rectangle is indicative of homoscedasticity. If we see a cone shape, that is indicative of heteroscedasticity.

Homoscedasticity is the third assumption for linear regression. Source

Conclusion

When we create a model, we not only want to make a prediction, but we also want to make sure that the prediction is reliable. With linear regression we have three assumptions that need to be met to be confident in our results, linearity, normality, and homoscedasticity.

Which of the following is not a valid assumption of performing linear regression?

Error terms are independent: Multicollinearity is not an issue for linear regression model.

Which of the following are assumptions of linear regression?

There are four assumptions associated with a linear regression model: Linearity: The relationship between X and the mean of Y is linear. Homoscedasticity: The variance of residual is the same for any value of X. Independence: Observations are independent of each other.

What are the 5 assumptions of linear regression?

Assumptions of Linear Regression: 5 Assumptions With Examples.
Linear relationship..
No auto-correlation or independence..
No Multicollinearity..
Homoscedasticity..
Normal distribution of error terms..

What are the four assumptions of linear regression?

Assumption 1: Linear Relationship..
Assumption 2: Independence..
Assumption 3: Homoscedasticity..
Assumption 4: Normality..