Warning: mkdir(): Permission denied in /home/virtual/lib/view_data.php on line 81

Warning: fopen(upload/ip_log/ip_log_2024-12.txt): failed to open stream: No such file or directory in /home/virtual/lib/view_data.php on line 83

Warning: fwrite() expects parameter 1 to be resource, boolean given in /home/virtual/lib/view_data.php on line 84
Statistical notes for clinical researchers: simple linear regression 1 – basic concepts
Skip Navigation
Skip to contents

Restor Dent Endod : Restorative Dentistry & Endodontics

OPEN ACCESS

Articles

Page Path
HOME > Restor Dent Endod > Volume 43(2); 2018 > Article
Open Lecture on Statistics Statistical notes for clinical researchers: simple linear regression 1 – basic concepts
Hae-Young Kimorcid

DOI: https://doi.org/10.5395/rde.2018.43.e21
Published online: April 12, 2018

Department of Health Policy and Management, College of Health Science, and Department of Public Health Science, Graduate School, Korea University, Seoul, Korea.

Correspondence to Hae-Young Kim, DDS, PhD. Professor, Department of Health Policy and Management, Korea University College of Health Science, and Department of Public Health Science, Korea University Graduate School, 145 Anam-ro, Seongbuk-gu, Seoul 02841, Korea. kimhaey@korea.ac.kr

Copyright © 2018. The Korean Academy of Conservative Dentistry

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

  • 27 Views
  • 2 Download
prev next
In the last section, correlation was discussed as a measure describing linear relationship between 2 variables [1]. Regression is another method describing their relationship by using a regression line which is the best fitting straight line that we can draw through data points on a scattered plot. Correlation depicts the direction and strength of the relationship, while regression provides explanation or prediction of the response variable (Y) using one or more predictor variables (X).
Figure 1A depicts a bivariate relationship with Y, which might represent health problem score, and X, pollution level. As we found in the Correlation section, the scattered points of X and Y pairs show positive relationships which represent a tendency that as X values move positively from their mean, the corresponding Y values also move positively from their own mean, and vice versa. We can try to draw a straight line as close as possible to those data points. However, we cannot connect data points as an exact straight line because generally they do not have a mathematical relationship. The mathematical relationship is expressed by a straight line and the values on the line correspond to values of b0 + b1X for various X values as in Figure 1A. We call b1 and b0 as slope and intercept, respectively.
Figure 1
(A) Description of relationship of 2 variables, X and Y; (B) A conceptual relationship between the regression line and surrounding subgroups in the regression model.
rde-43-e21-g001.jpg
Why do we try to express the scattered data points as a straight line? The main idea is that the line represents means of Y for subgroups of X, not individual data points (Figure 1B). Subgroups of X have certain distributions around their means, e.g., normal distribution. Therefore, individual data points can have some distances from the straight line by various amounts of deviations from their means. We call this model as a ‘regression model’ and especially a ‘simple linear regression model’ when only one X variable is included. In the regression model the values on the line is considered as the mean of Y corresponding to each X value and we call the Y values on the line as Ŷ (Y hat), the predicted Y values.
The regression model has been developed as a typical statistical model based on the idea by Francis Galton in 1886 [2]. To establish the simplest typical regression model, we set following four assumptions for the regression model with the acronym ‘LINE’.
  • Linear: The means of X subgroups are on the straight line representing the linear relationship between X and Y. In Figure 1b, points on the line represent subgroup means and they are connected as a straight line.

  • Independent: The observations are independent to each other, which is a common assumption for general classical statistical models.

  • Normal: The X subgroups have normal distribution. Based on this assumption, we could express the full nature of a subgroup only using the mean and variance without any further explanation.

  • Equal variance: All the subgroup variances are equal. Based on this assumption, we can simplify the calculation procedure as obtaining a common variance estimate instead of calculating each subgroup variance separately.

Figure 1B depicts the concepts of linear regression by showing the subgroup means (dots) on the straight line and normal distribution of subgroups with equal variances. The scattered dots in Figure 1A are observed points from the subgroup distribution.
Above, we tried to draw a straight line as ‘close’ as possible to the data points. In other words, the difference between the line and data points should be minimized. To accomplish the concept mathematically we use the ‘least squares method’. As depicted in Figure 2, the vertical difference between the line and the corresponding data point represents a deviation from the estimated mean (on the line) and actually observed Y value for a specific X value. The deviation is called an ‘error’. The theoretical distribution of errors is assumed as normal distribution as in Figure 1B. The mean of errors in a subgroup is zero and the variance of subgroups is set to a value, e.g., σ2.
Figure 2
Error represents the vertical distance between the line and data point.
rde-43-e21-g002.jpg
The errors are squared and summed to make a squared sum of errors. The least square method aims to minimize the squared sum of errors and to obtain the slope and intercept values which suffice the purpose. The estimated line should show best fit which lies closest to the observed data points. Specifically, we use the first derivatives of squared sum of errors and set the value into zero to get the slope and intercept which produce minimum values of sum of squared errors.
The least square estimates for slope b1 and intercept b0 are represented as follows [3]:
b1=(x-x-)(y-y-)(x-x-)2
b0=y--b1x-
Table 1 displays the calculation procedure using EXCEL. Deviations from means of X and Y (X and YȲ), squares of deviations of X [(X)2] and cross products of deviations of X and Y [(X)(YȲ)] are calculated. Finally, sum of squares of deviations of X and cross products of deviations of X and Y were calculated as 2,912.95 and 2,072.85, respectively. The straight line with least sum of squared errors is obtained finally as: Ŷ = 30.79 + 0.71X.
Table 1

Calculation of least squares estimators of slope and intercept in simple linear regression.

No X Y XX̿ YY̿ (X)2 (X)(YȲ)
1 73 90 0.55 7.65 0.30 4.21
2 52 74 −20.45 −8.35 418.20 170.76
3 68 91 −4.45 8.65 19.80 −38.49
4 47 62 −25.45 −20.35 647.70 517.91
5 60 63 −12.45 −19.35 155.00 240.91
6 71 78 −1.45 −4.35 2.10 6.31
7 67 60 −5.45 −22.35 29.70 121.81
8 80 89 7.55 6.65 57.00 50.21
9 86 82 13.55 −0.35 183.60 −4.74
10 91 105 18.55 22.65 344.10 420.16
11 67 76 −5.45 −6.35 29.70 34.61
12 73 82 0.55 −0.35 0.30 −0.19
13 71 93 −1.45 10.65 2.10 −15.44
14 57 73 −15.45 −9.35 238.70 144.46
15 86 82 13.55 −0.35 183.60 −4.74
16 76 88 3.55 5.65 12.60 20.06
17 91 97 18.55 14.65 344.10 271.76
18 69 80 −3.45 −2.35 11.90 8.11
19 87 87 14.55 4.65 211.70 67.66
20 77 95 4.55 12.65 20.70 57.56
= 72.45 Ȳ = 82.35 ∑ = 2,912.95 ∑ = 2,072.85

b1=(xx-)(yy-)(xx-)2=2072.852912.95=0.7116

b0=y-b1x-=82.350.7116*72.45=30.79

Using the straight line, we can predict the mean (or predicted) values of Y corresponding to specific X values. Let's discuss an observed data point (73, 90). The predicted value of Y, Ŷ is calculated as 30.79 + 0.7116*73 ≅ 82.74. The error, the deviation between the observed Y and predicted Y value, is around 7.26 (= 90 − 82.74). Like this example, we can apply the estimated regression line in predicting expected Y values and related errors. After establishing estimated regression equation, we need to evaluate it in the aspect of basic assumption and its predicting ability.
Recalling the Pearson correlation coefficient [1], r was calculated as follows:
r=Cov(x,y)SDX × SD(Y)=(x-x-)(y-y-)n-1SDX × SD(Y)=109.112.38×12.01=0.73.
The slope can be expressed by r following some procedure such as:
b1=(x-x-)(y-y-)(x-x-)2=(x-x-)(y-y-)n-1(x-x-)2n-1=(x-x-)(y-y-)n-1[SD(X)]2=r×SD(Y)SD(X)=0.73×12.0112.380.71.
We can use this relationship to calculate slope estimate as well.
  • 1. Kim HY. Statistical notes for clinical researchers: covariance and correlation. Restor Dent Endod 2018;43:e4. PubMedPMC
  • 2. Galton F. Regression towards mediocrity in hereditary stature. J Anthropol Inst G B Irel 1886;15:246-263.Article
  • 3. Daniel WW. Biostatistics: basic concepts and methodology for health science. 9th ed. New York (NY): John Wiley & Sons; 2010. p. 410-440.
Appendix 1

Procedure of analysis of simple linear regression model using IBM SPSS

The procedure of logistic regression using IBM SPSS Statistics for Windows Version 23.0 (IBM Corp., Armonk, NY, USA) is as follows.
rde-43-e21-a001.jpg

Tables & Figures

REFERENCES

    Citations

    Citations to this article as recorded by  

      • ePub LinkePub Link
      • Cite
        CITE
        export Copy Download
        Close
        Download Citation
        Download a citation file in RIS format that can be imported by all major citation management software, including EndNote, ProCite, RefWorks, and Reference Manager.

        Format:
        • RIS — For EndNote, ProCite, RefWorks, and most other reference management software
        • BibTeX — For JabRef, BibDesk, and other BibTeX-specific software
        Include:
        • Citation for the content below
        Statistical notes for clinical researchers: simple linear regression 1 – basic concepts
        Restor Dent Endod. 2018;43(2):e21  Published online April 12, 2018
        Close
      • XML DownloadXML Download
      Figure
      • 0
      • 1
      Statistical notes for clinical researchers: simple linear regression 1 – basic concepts
      Image Image
      Figure 1 (A) Description of relationship of 2 variables, X and Y; (B) A conceptual relationship between the regression line and surrounding subgroups in the regression model.
      Figure 2 Error represents the vertical distance between the line and data point.
      Statistical notes for clinical researchers: simple linear regression 1 – basic concepts

      Calculation of least squares estimators of slope and intercept in simple linear regression.

      NoXYXX̿YY̿(X)2(X)(YȲ)
      173900.557.650.304.21
      25274−20.45−8.35418.20170.76
      36891−4.458.6519.80−38.49
      44762−25.45−20.35647.70517.91
      56063−12.45−19.35155.00240.91
      67178−1.45−4.352.106.31
      76760−5.45−22.3529.70121.81
      880897.556.6557.0050.21
      9868213.55−0.35183.60−4.74
      109110518.5522.65344.10420.16
      116776−5.45−6.3529.7034.61
      1273820.55−0.350.30−0.19
      137193−1.4510.652.10−15.44
      145773−15.45−9.35238.70144.46
      15868213.55−0.35183.60−4.74
      1676883.555.6512.6020.06
      17919718.5514.65344.10271.76
      186980−3.45−2.3511.908.11
      19878714.554.65211.7067.66
      2077954.5512.6520.7057.56
      = 72.45Ȳ = 82.35∑ = 2,912.95∑ = 2,072.85

      b1=(xx-)(yy-)(xx-)2=2072.852912.95=0.7116

      b0=y-b1x-=82.350.7116*72.45=30.79

      Table 1 Calculation of least squares estimators of slope and intercept in simple linear regression.

      b1=(xx-)(yy-)(xx-)2=2072.852912.95=0.7116

      b0=y-b1x-=82.350.7116*72.45=30.79


      Restor Dent Endod : Restorative Dentistry & Endodontics
      Close layer
      TOP