Regression
Regression is defined as, “the measure of the average relationship between two or more variables in terms of the original units of the data”.
Correlation analysis attempts to study the relationship between the two variables ‘x’ and ‘y’. Regression analysis attempts to predict the average ‘x’ for a given ‘y’. In regression, it is attempted to quantify the dependence of one variable on the other. For example, if there are two variables ‘x’ and ‘y’ and ‘y’ depends on ‘x’, then the dependence is expressed in the form of the equation.
Regression Analysis
Regression analysis is used to estimate the values of the dependent variables from the values of the independent variables. Independent variables are also known as regressor or predictor or explanatory analysis while the dependent variable is also known as regressed or explained variable. Regression analysis is used to get a measure of the error involved while using the regression line as a basis of estimation. The regression coefficient y on x is the coefficient of the variable ‘x’ in the line of regression y on x. Regression coefficients are used to calculate correlation coefficient. The square of correlation is product of regression coefficients.
Regression Lines (Linear Regression)
When the variables in the bivariate distribution are plotted in a scattered diagram and the curve so formed is a straight line then it is said to be a linear regression between the variables.
The line of regression is the line which gives the best estimate to the value one variable for any specific value of the other variable. The line of regression is the line of best fit and is obtained by the principle of least square.
For a set of paired observations there exist two straight lines. The line drawn in such a way that the sum of vertical deviation is zero and the sum of their squares is minimum, is called regression line of ‘y’ on ‘x’. It is used to estimate ‘y’ values for given ‘x’ values. The line drawn in such a way that the sum of horizontal deviation is zero and sum of their squares is minimum, is called regression line of ‘x’ on ‘y’. It is used to estimate ‘x’ values for given ‘y’ values. The smaller the angle between these lines, the higher is the correlation between the variables. The regression lines always intersect at (x’, y’).
The regression lines have equations,
The regression equation of ‘y’ on ‘x’ is given by
\[y-\bar{y}={{b}_{yx}}\left( x-\bar{x} \right)\]
The regression equation of ‘x’ on ‘y’ is given by
\[x-\bar{x}={{b}_{xy}}\left( y-\bar{y} \right)\]
Where,
\[{{b}_{xy}}=\frac{N\sum{xy-\left( \sum{x} \right)\left( \sum{y} \right)}}{N\sum{{{y}^{2}}-{{\left( \sum{y} \right)}^{2}}}}=\frac{\sum{\left( X-\bar{X} \right)\left( Y-\bar{Y} \right)}}{\sum{{{\left( Y-\bar{Y} \right)}^{2}}}}=r\frac{{{\sigma }_{x}}}{{{\sigma }_{y}}}=\frac{Cov\left( xy \right)}{\sigma _{y}^{2}}\]
\[{{b}_{yx}}=\frac{N\sum{xy-\left( \sum{x} \right)\left( \sum{y} \right)}}{N\sum{{{x}^{2}}-{{\left( \sum{x} \right)}^{2}}}}=\frac{\sum{\left( X-\bar{X} \right)\left( Y-\bar{Y} \right)}}{\sum{{{\left( X-\bar{X} \right)}^{2}}}}=r\frac{{{\sigma }_{y}}}{{{\sigma }_{x}}}=\frac{Cov\left( xy \right)}{\sigma _{x}^{2}}\]
‘b_{yx}’ and ‘b_{xy}’ are called regression coefficients and r is the correlation coefficient.
Regression Coefficient
When a regression is linear, then the regression coefficient is given by the slope of the regression line.
The geometric mean of regression coefficients gives the correlation coefficient.
\[{{b}_{yx}}.{{b}_{xy}}={{r}^{2}}\]
\[\Rightarrow \sqrt{{{b}_{yx}}.{{b}_{xy}}}=r\]
\[\text{Note that}~~r=\frac{{{\mu }_{11}}}{{{\sigma }_{X}}{{\sigma }_{Y}}},~~{{b}_{yx}}=\frac{{{\mu }_{11}}}{\sigma _{X}^{2}},~~{{b}_{xy}}=\frac{{{\mu }_{11}}}{\sigma _{Y}^{2}}\]
Properties:
1. The product of regression coefficients is always less than 1, that is , b_{yx}. b_{xy} ≤ 1
2. Regression coefficient is independent of the change of origin but not of scale.
3. It is an absolute measure.
Angle between two lines of Regression
If θ is the acute angle between two lines of regression, then
\[\theta ={{\tan }^{-1}}\left\{ \frac{1-{{r}^{2}}}{|r|}\left( \frac{{{\sigma }_{X}}{{\sigma }_{Y}}}{\sigma _{X}^{2}+\sigma _{Y}^{2}} \right) \right\}\]
Case 1: If r = 0, implies θ = π/2, that is if two variables are uncorrelated, the lines of regression become perpendicular to each other.
Case 2: If r = ±1, tanθ = 0, implies θ = 0 or π. In this case two lines of regression either coincide or they are parallel to each other. But since two lines of regression pass thorough the point (x’, y’) so they cannot be parallel. Hence in this case positive correlation, positive or negative, the two lines of regression coincide.
The differences between correlation and regression coefficient
Correlation Coefficient | Regression Coefficient |
The correlation coefficients, r_{xy }= r_{yx} | The regression coefficients, b_{yx} ≠ b_{xy} |
‘r’ lies between -1 and 1 | ‘b_{yx}’ can be greater than one, in which case ‘b_{xy}’ must be less than one such that b_{yx} .b_{xy} < 1 |
It has no units attached to it. | It has units attached to it. |
It is not based on cause and effect relationship. | It is based on cause and effect relationship. |
It indirectly helps in estimation. | It is meant for estimation. |
Example 01 |
Find regression equation from the data represented in table. Then calculate correlation coefficient.
Age of Husband | 12 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 |
Age of Wife | 17 | 17 | 18 | 18 | 19 | 19 | 19 | 20 | 21 | 22 |
Solution:
Data required for calculation of correlation and regression coefficients
Age of Husband ( x) | dx = x – 22 | dx^{2} | Age of Wife( y) | dy = y – 19 | dy^{2} | dx dx |
18 | -4 | 16 | 17 | -2 | 4 | 8 |
19 | -3 | 9 | 17 | -2 | 4 | 6 |
20 | -2 | 4 | 18 | -1 | 1 | 2 |
21 | -1 | 1 | 18 | -1 | 1 | 1 |
22 | 0 | 0 | 19 | 0 | 0 | 0 |
23 | 1 | 1 | 19 | 0 | 0 | 0 |
24 | 2 | 4 | 19 | 0 | 0 | 0 |
25 | 3 | 9 | 20 | 1 | 1 | 3 |
26 | 4 | 16 | 21 | 2 | 4 | 8 |
27 | 5 | 25 | 22 | 3 | 9 | 15 |
Total = 225 | 5 | 85 | 190 | 0 | 24 | 43 |
\[\bar{X}=\frac{225}{10}=22.5~~~~\bar{Y}=\frac{190}{10}=19\]
Regression equation of Y on X is given by
\[Y-\bar{Y}={{b}_{yx}}\left( X-\bar{X} \right)\]
\[{{b}_{yx}}=\frac{N\sum{dxdy-\left( \sum{dx} \right)\left( \sum{dy} \right)}}{N\sum{d{{x}^{2}}-{{\left( \sum{dx} \right)}^{2}}}}\]
\[\Rightarrow {{b}_{yx}}=\frac{10\times 43-(5)(0)}{10\times 85-{{(5)}^{2}}}=\frac{430}{825}=0.521\]
\[\therefore Y-19=0.521\left( X-22.5 \right)\]
\[\Rightarrow Y=0.521X+7.2775\]
Regression equation of X on Y is given by
\[X-\bar{X}={{b}_{xy}}\left( Y-\bar{Y} \right)\]
\[{{b}_{xy}}=\frac{N\sum{dxdy-\left( \sum{dx} \right)\left( \sum{dy} \right)}}{N\sum{d{{y}^{2}}-{{\left( \sum{dy} \right)}^{2}}}}\]
\[\Rightarrow {{b}_{xy}}=\frac{10\times 43-(5)(0)}{10\times 24-{{(0)}^{2}}}=\frac{43}{24}=1.792\]
\[\therefore X-22.5=1.792\left( Y-19 \right)\]
\[\Rightarrow X=1.792Y-11.548\]
\[\therefore r=\sqrt{0.521\times 1.792}=0.966\]
Hence, the correlation coefficient ‘r’ is 0.966.
Example 02 |
The table shows the results that were worked out from scores in Statistics and Mathematics in a certain examination.
Scores in Statistics (X) | Scores in Mathematics (Y) | |
Mean | 40 | 48 |
Standard Deviation | 10 | 15 |
Karl Pearson’s correlation coefficients between ‘x’ and ‘y’ is =+0.42. Find the regression lines ‘x’ on ‘y’ and ‘y’ on ‘x’. Use the regression lines to find the value of ‘y’ when x = 50 and of ‘x’ when y = 30.
Solution:
Given the following data:
\[\bar{X}=40,~~~~\bar{Y}=40,~~~~{{\sigma }_{x}}=10,~~~~{{\sigma }_{y}}=15,~~~~r=0.42\]
The regression line X on Y is:
\[\left( X-\bar{X} \right)=r\frac{{{\sigma }_{x}}}{{{\sigma }_{y}}}\left( Y-\bar{Y} \right)……….(1)\]
The regression line Y on X is:
\[\left( Y-\bar{Y} \right)=r\frac{{{\sigma }_{y}}}{{{\sigma }_{x}}}\left( X-\bar{X} \right)………..(2)\]
Therefore substituting the values we get the respective equation as:
\[X=0.279Y+26.608……….(3)\]
\[Y=0.63X+22.80………..(4)\]
Therefore,
When Y = 30; X = 35.518 using equation (3) and when X = 50; Y = 54.3 by equation (4).