Fundamentals of data analysis. Methods of mathematical statistics. Regression analysis

In his works dating back to 1908. He described it using the example of the work of an agent selling real estate. In his records, the house sales specialist kept track of a wide range of input data for each specific building. Based on the results of the auction, it was determined which factor had the greatest influence on the transaction price.

Analysis of a large number of transactions yielded interesting results. The final price was influenced by many factors, sometimes leading to paradoxical conclusions and even obvious “outliers” when a house with high initial potential was sold at a reduced price.

The second example of the application of such an analysis is the work of which was entrusted with determining employee remuneration. The complexity of the task lay in the fact that it required not the distribution of a fixed amount to everyone, but its strict correspondence to the specific work performed. The emergence of many problems with practically similar solutions required a more detailed study of them at the mathematical level.

A significant place was allocated to the section “regression analysis”, which combined practical methods used to study dependencies that fall under the concept of regression. These relationships are observed between data obtained from statistical studies.

Among the many tasks to be solved, the main goals are three: determination of a general regression equation; constructing estimates of parameters that are unknowns that are part of the regression equation; testing of statistical regression hypotheses. In the course of studying the relationship that arises between a pair of quantities obtained as a result of experimental observations and constituting a series (set) of the type (x1, y1), ..., (xn, yn), they rely on the provisions of regression theory and assume that for one quantity Y there is a certain probability distribution, while the other X remains fixed.

The result Y depends on the value of the variable X; this dependence can be determined by various patterns, while the accuracy of the results obtained is influenced by the nature of the observations and the purpose of the analysis. The experimental model is based on certain assumptions that are simplified but plausible. The main condition is that the parameter X is a controlled quantity. Its values ​​are set before the start of the experiment.

If a pair of uncontrolled variables XY is used during an experiment, then regression analysis is carried out in the same way, but methods are used to interpret the results, during which the relationship of the random variables under study is studied. Methods of mathematical statistics are not an abstract topic. They find application in life in various spheres of human activity.

In the scientific literature, the term linear regression analysis is widely used to define the above method. For variable X, the term regressor or predictor is used, and dependent Y variables are also called criterion variables. This terminology reflects only the mathematical dependence of the variables, but not the cause-and-effect relationship.

Regression analysis is the most common method used in processing the results of a wide variety of observations. Physical and biological dependencies are studied using this method; it is implemented both in economics and in technology. A lot of other fields use regression analysis models. Analysis of variance and multivariate statistical analysis work closely with this method of study.

In statistical modeling, regression analysis is a study used to evaluate the relationship between variables. This mathematical method includes many other methods for modeling and analyzing multiple variables where the focus is on the relationship between a dependent variable and one or more independent ones. More specifically, regression analysis helps us understand how the typical value of a dependent variable changes if one of the independent variables changes while the other independent variables remain fixed.

In all cases, the target estimate is a function of the independent variables and is called a regression function. In regression analysis, it is also of interest to characterize the change in the dependent variable as a function of regression, which can be described using a probability distribution.

Regression Analysis Problems

This statistical research method is widely used for forecasting, where its use has significant advantage, but sometimes it can lead to illusion or false relationships, so it is recommended to use it carefully in the said matter, since, for example, correlation does not mean causation.

A large number of methods have been developed for regression analysis, such as linear and ordinary least squares regression, which are parametric. Their essence is that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression allows its function to lie within a specific set of functions, which can be infinite-dimensional.

As a statistical research method, regression analysis in practice depends on the form of the data generation process and how it relates to the regression approach. Since the true form of the data process generating is usually an unknown number, regression analysis of the data often depends to some extent on assumptions about the process. These assumptions are sometimes testable if there is enough data available. Regression models are often useful even when the assumptions are moderately violated, although they may not perform at peak efficiency.

In a narrower sense, regression may refer specifically to the estimation of continuous response variables, as opposed to the discrete response variables used in classification. The continuous output variable case is also called metric regression to distinguish it from related problems.

Story

The earliest form of regression is the well-known least squares method. It was published by Legendre in 1805 and Gauss in 1809. Legendre and Gauss applied the method to the problem of determining from astronomical observations the orbits of bodies around the Sun (mainly comets, but later also newly discovered minor planets). Gauss published a further development of least squares theory in 1821, including a version of the Gauss–Markov theorem.

The term "regression" was coined by Francis Galton in the 19th century to describe a biological phenomenon. The idea was that the height of descendants from that of their ancestors tends to regress downwards towards the normal mean. For Galton, regression had only this biological meaning, but later his work was continued by Udney Yoley and Karl Pearson and brought into a more general statistical context. In the work of Yule and Pearson, the joint distribution of response and explanatory variables is assumed to be Gaussian. This assumption was rejected by Fischer in papers of 1922 and 1925. Fisher suggested that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this regard, Fischer's proposal is closer to Gauss's formulation of 1821. Before 1970, it sometimes took up to 24 hours to get the result of a regression analysis.

Regression analysis methods continue to be an area of ​​active research. In recent decades, new methods have been developed for robust regression; regressions involving correlated responses; regression methods that accommodate different types of missing data; nonparametric regression; Bayesian regression methods; regressions in which predictor variables are measured with error; regression with more predictors than observations, and cause-and-effect inference with regression.

Regression models

Regression analysis models include the following variables:

  • Unknown parameters, designated beta, which can be a scalar or a vector.
  • Independent Variables, X.
  • Dependent Variables, Y.

Different fields of science where regression analysis is used use different terms in place of dependent and independent variables, but in all cases the regression model relates Y to a function of X and β.

The approximation is usually written as E(Y | X) = F(X, β). To carry out regression analysis, the type of function f must be determined. Less commonly, it is based on knowledge about the relationship between Y and X, which does not rely on data. If such knowledge is not available, then the flexible or convenient form F is chosen.

Dependent variable Y

Let us now assume that the vector of unknown parameters β has length k. To perform regression analysis, the user must provide information about the dependent variable Y:

  • If N data points of the form (Y, X) are observed, where N< k, большинство классических подходов к регрессионному анализу не могут быть выполнены, так как система уравнений, определяющих модель регрессии в качестве недоопределенной, не имеет достаточного количества данных, чтобы восстановить β.
  • If exactly N = K are observed and the function F is linear, then the equation Y = F(X, β) can be solved exactly rather than approximately. This amounts to solving a set of N-equations with N-unknowns (elements β) that has a unique solution as long as X is linearly independent. If F is nonlinear, there may be no solution, or many solutions may exist.
  • The most common situation is where N > data points are observed. In this case, there is enough information in the data to estimate a unique value for β that best fits the data, and a regression model where the application to the data can be viewed as an overdetermined system in β.

In the latter case, regression analysis provides tools for:

  • Finding a solution for the unknown parameters β, which will, for example, minimize the distance between the measured and predicted value of Y.
  • Under certain statistical assumptions, regression analysis uses excess information to provide statistical information about the unknown parameters β and the predicted values ​​of the dependent variable Y.

Required number of independent measurements

Consider a regression model that has three unknown parameters: β 0 , β 1 and β 2 . Suppose the experimenter makes 10 measurements on the same value of the independent variable vector X. In this case, regression analysis does not produce a unique set of values. The best you can do is estimate the mean and standard deviation of the dependent variable Y. Similarly, by measuring two different values ​​of X, you can obtain enough data for regression with two unknowns, but not with three or more unknowns.

If the experimenter's measurements were made at three different values ​​of the independent variable vector X, then the regression analysis will provide a unique set of estimates for the three unknown parameters in β.

In the case of general linear regression, the above statement is equivalent to the requirement that the matrix X T X is invertible.

Statistical assumptions

When the number of measurements N is greater than the number of unknown parameters k and the measurement errors ε i , then, as a rule, the excess information contained in the measurements is then disseminated and used for statistical predictions regarding the unknown parameters. This excess information is called the regression degree of freedom.

Fundamental Assumptions

Classic assumptions for regression analysis include:

  • Sampling is representative of inference prediction.
  • The error term is a random variable with a mean of zero, which is conditional on the explanatory variables.
  • Independent variables are measured without errors.
  • As independent variables (predictors), they are linearly independent, that is, it is not possible to express any predictor as a linear combination of the others.
  • The errors are uncorrelated, that is, the error covariance matrix of the diagonals and each non-zero element is the error variance.
  • The error variance is constant across observations (homoscedasticity). If not, then weighted least squares or other methods can be used.

These sufficient conditions for least squares estimation have the required properties; in particular, these assumptions mean that parameter estimates will be objective, consistent, and efficient, especially when taken into account in the class of linear estimators. It is important to note that evidence rarely satisfies the conditions. That is, the method is used even if the assumptions are not correct. Variation from the assumptions can sometimes be used as a measure of how useful the model is. Many of these assumptions can be relaxed in more advanced methods. Statistical analysis reports typically include analysis of tests on sample data and methodology for the usefulness of the model.

Additionally, variables in some cases refer to values ​​measured at point locations. There may be spatial trends and spatial autocorrelations in variables that violate statistical assumptions. Geographic weighted regression is the only method that deals with such data.

A feature of linear regression is that the dependent variable, which is Yi, is a linear combination of parameters. For example, simple linear regression uses one independent variable, x i , and two parameters, β 0 and β 1 , to model n-points.

In multiple linear regression, there are multiple independent variables or functions of them.

When a random sample is taken from a population, its parameters allow one to obtain a sample linear regression model.

In this aspect, the most popular is the least squares method. It is used to obtain parameter estimates that minimize the sum of squared residuals. This kind of minimization (which is typical of linear regression) of this function leads to a set of normal equations and a set of linear equations with parameters, which are solved to obtain parameter estimates.

Under the further assumption that population error is generally propagated, a researcher can use these standard error estimates to create confidence intervals and conduct hypothesis tests about its parameters.

Nonlinear regression analysis

An example where the function is not linear with respect to the parameters indicates that the sum of squares should be minimized using an iterative procedure. This introduces many complications that define the differences between linear and nonlinear least squares methods. Consequently, the results of regression analysis when using a nonlinear method are sometimes unpredictable.

Calculation of power and sample size

There are generally no consistent methods regarding the number of observations versus the number of independent variables in the model. The first rule was proposed by Dobra and Hardin and looks like N = t^n, where N is the sample size, n is the number of independent variables, and t is the number of observations needed to achieve the desired accuracy if the model had only one independent variable. For example, a researcher builds a linear regression model using a data set that contains 1000 patients (N). If the researcher decides that five observations are needed to accurately define the line (m), then the maximum number of independent variables that the model can support is 4.

Other methods

Although regression model parameters are typically estimated using the least squares method, there are other methods that are used much less frequently. For example, these are the following methods:

  • Bayesian methods (for example, Bayesian linear regression).
  • Percentage regression, used for situations where reducing percentage errors is considered more appropriate.
  • Smallest absolute deviations, which is more robust in the presence of outliers leading to quantile regression.
  • Nonparametric regression, which requires a large number of observations and calculations.
  • A distance learning metric that is learned to find a meaningful distance metric in a given input space.

Software

All major statistical software packages perform least squares regression analysis. Simple linear regression and multiple regression analysis can be used in some spreadsheet applications as well as some calculators. Although many statistical software packages can perform various types of nonparametric and robust regression, these methods are less standardized; different software packages implement different methods. Specialized regression software has been developed for use in areas such as examination analysis and neuroimaging.

The regression analysis method is used to determine the technical and economic parameters of products belonging to a specific parametric series in order to build and align value relationships. This method is used to analyze and justify the level and price ratios of products characterized by the presence of one or more technical and economic parameters that reflect the main consumer properties. Regression analysis allows us to find an empirical formula that describes the dependence of price on the technical and economic parameters of products:

P=f(X1X2,...,Xn),

where P is the value of the unit price of the product, rub.; (X1, X2, ... Xn) - technical and economic parameters of products.

The method of regression analysis - the most advanced of the used normative-parametric methods - is effective when carrying out calculations based on the use of modern information technologies and systems. Its application includes the following main steps:

  • determination of classification parametric groups of products;
  • selection of parameters that most influence the price of the product;
  • selection and justification of the form of connection between price changes when parameters change;
  • construction of a system of normal equations and calculation of regression coefficients.

The main qualification group of products, the price of which is subject to equalization, is a parametric series, within which products can be grouped into different designs depending on their application, operating conditions and requirements, etc. When forming parametric series, automatic classification methods can be used, which allow distinguish homogeneous groups from the total mass of products. The selection of technical and economic parameters is made based on the following basic requirements:

  • the selected parameters include parameters recorded in standards and technical specifications; in addition to technical parameters (power, load capacity, speed, etc.), indicators of product serialization, complexity coefficients, unification, etc. are used;
  • the set of selected parameters should sufficiently fully characterize the design, technological and operational properties of the products included in the series, and have a fairly close correlation with price;
  • parameters should not be interdependent.

To select technical and economic parameters that significantly affect the price, a matrix of pair correlation coefficients is calculated. Based on the magnitude of the correlation coefficients between the parameters, one can judge the closeness of their connection. At the same time, a correlation close to zero shows an insignificant influence of the parameter on the price. The final selection of technical and economic parameters is carried out in the process of step-by-step regression analysis using computer technology and appropriate standard programs.

In pricing practice, the following set of functions is used:

linear

P = ao + alXl + ... + antXn,

linear-power

P = ao + a1X1 + ... + anXn + (an+1Xn) (an+1Xn) +... + (an+nXn2) (an+nXn2)

inverse logarithm

P = a0 + a1: In X1 + ... + an: In Xn,

power

P = a0 (X1^a1) (X2^a2) .. (Xn^an)

indicative

P = e^(a1+a1X1+...+anXn)

hyperbolic

P = ao + a1:X1 + a2:X2 + ... + ap:Xn,

where P is price equalization; X1 X2,..., Xn - the value of the technical and economic parameters of the products of the series; a0, a1 ..., аn - calculated coefficients of the regression equation.

In practical work on pricing, depending on the form of relationship between prices and technical and economic parameters, other regression equations can be used. The type of function of the connection between price and a set of technical and economic parameters can be preset or selected automatically during computer processing. The closeness of the correlation between the price and the set of parameters is assessed by the value of the multiple correlation coefficient. Its proximity to one indicates a close connection. Using the regression equation, equalized (calculated) price values ​​for products of a given parametric series are obtained. To evaluate the results of equalization, the relative values ​​of the deviation of the calculated price values ​​from the actual ones are calculated:

Tsr = Rf - Rr: R x 100

where Рф, Рр - actual and calculated prices.

The value of CR should not exceed 8-10%. In case of significant deviations of calculated values ​​from actual ones, it is necessary to investigate:

  • the correctness of the formation of a parametric series, since it may contain products that, in their parameters, differ sharply from other products in the series. They must be excluded;
  • correct selection of technical and economic parameters. A set of parameters is possible that is weakly correlated with price. In this case, it is necessary to continue searching and selecting parameters.

The procedure and methodology for conducting regression analysis, finding unknown parameters of the equation and economic assessment of the results obtained are carried out in accordance with the requirements of mathematical statistics.

During their studies, students very often encounter a variety of equations. One of them - the regression equation - is discussed in this article. This type of equation is used specifically to describe the characteristics of the relationship between mathematical parameters. This type of equality is used in statistics and econometrics.

Definition of regression

In mathematics, regression means a certain quantity that describes the dependence of the average value of a set of data on the values ​​of another quantity. The regression equation shows, as a function of a particular characteristic, the average value of another characteristic. The regression function has the form of a simple equation y = x, in which y acts as a dependent variable, and x as an independent variable (feature-factor). In fact, regression is expressed as y = f (x).

What are the types of relationships between variables?

In general, there are two opposing types of relationships: correlation and regression.

The first is characterized by the equality of conditional variables. In this case, it is not reliably known which variable depends on the other.

If there is no equality between the variables and the conditions say which variable is explanatory and which is dependent, then we can talk about the presence of a connection of the second type. In order to construct a linear regression equation, it will be necessary to find out what type of relationship is observed.

Types of regressions

Today, there are 7 different types of regression: hyperbolic, linear, multiple, nonlinear, pairwise, inverse, logarithmically linear.

Hyperbolic, linear and logarithmic

The linear regression equation is used in statistics to clearly explain the parameters of the equation. It looks like y = c+t*x+E. A hyperbolic equation has the form of a regular hyperbola y = c + m / x + E. A logarithmically linear equation expresses the relationship using a logarithmic function: In y = In c + m * In x + In E.

Multiple and nonlinear

The two more complex types of regression are multiple and nonlinear. The multiple regression equation is expressed by the function y = f(x 1, x 2 ... x c) + E. In this situation, y acts as a dependent variable, and x acts as an explanatory variable. The E variable is stochastic; it includes the influence of other factors in the equation. The nonlinear regression equation is a bit controversial. On the one hand, relative to the indicators taken into account, it is not linear, but on the other hand, in the role of evaluating indicators, it is linear.

Inverse and paired types of regressions

An inverse is a type of function that needs to be converted to a linear form. In the most traditional application programs, it has the form of a function y = 1/c + m*x+E. A pairwise regression equation shows the relationship between the data as a function of y = f (x) + E. Just like in other equations, y depends on x, and E is a stochastic parameter.

Concept of correlation

This is an indicator demonstrating the existence of a relationship between two phenomena or processes. The strength of the relationship is expressed as a correlation coefficient. Its value fluctuates within the interval [-1;+1]. A negative indicator indicates the presence of feedback, a positive indicator indicates direct feedback. If the coefficient takes a value equal to 0, then there is no relationship. The closer the value is to 1, the stronger the relationship between the parameters; the closer to 0, the weaker it is.

Methods

Correlation parametric methods can assess the strength of the relationship. They are used on the basis of distribution estimation to study parameters that obey the law of normal distribution.

The parameters of the linear regression equation are necessary to identify the type of dependence, the function of the regression equation and evaluate the indicators of the selected relationship formula. The correlation field is used as a connection identification method. To do this, all existing data must be depicted graphically. All known data must be plotted in a rectangular two-dimensional coordinate system. This is how a correlation field is formed. The values ​​of the describing factor are marked along the abscissa axis, while the values ​​of the dependent factor are marked along the ordinate axis. If there is a functional relationship between the parameters, they are lined up in the form of a line.

If the correlation coefficient of such data is less than 30%, we can speak of an almost complete absence of connection. If it is between 30% and 70%, then this indicates the presence of medium-close connections. A 100% indicator is evidence of a functional connection.

A nonlinear regression equation, just like a linear one, must be supplemented with a correlation index (R).

Correlation for Multiple Regression

The coefficient of determination is an indicator of the square of multiple correlation. He talks about the close relationship of the presented set of indicators with the characteristic being studied. It can also talk about the nature of the influence of parameters on the result. The multiple regression equation is estimated using this indicator.

In order to calculate the multiple correlation indicator, it is necessary to calculate its index.

Least square method

This method is a way to estimate regression factors. Its essence is to minimize the sum of squared deviations obtained as a result of the dependence of the factor on the function.

A pairwise linear regression equation can be estimated using such a method. This type of equations is used when a paired linear relationship is detected between indicators.

Equation Parameters

Each parameter of the linear regression function has a specific meaning. The paired linear regression equation contains two parameters: c and m. The parameter m demonstrates the average change in the final indicator of the function y, provided that the variable x decreases (increases) by one conventional unit. If the variable x is zero, then the function is equal to the parameter c. If the variable x is not zero, then the factor c does not carry economic meaning. The only influence on the function is the sign in front of the factor c. If there is a minus, then we can say that the change in the result is slow compared to the factor. If there is a plus, then this indicates an accelerated change in the result.

Each parameter that changes the value of the regression equation can be expressed through an equation. For example, factor c has the form c = y - mx.

Grouped data

There are task conditions in which all information is grouped by attribute x, but for a certain group the corresponding average values ​​of the dependent indicator are indicated. In this case, the average values ​​characterize how the indicator depending on x changes. Thus, the grouped information helps to find the regression equation. It is used as an analysis of relationships. However, this method has its drawbacks. Unfortunately, average indicators are often subject to external fluctuations. These fluctuations do not reflect the pattern of the relationship; they just mask its “noise.” Averages show patterns of relationship much worse than a linear regression equation. However, they can be used as a basis for finding an equation. By multiplying the number of an individual population by the corresponding average, one can obtain the sum y within the group. Next, you need to add up all the amounts received and find the final indicator y. It is a little more difficult to make calculations with the sum indicator xy. If the intervals are small, we can conditionally take the x indicator for all units (within the group) to be the same. You should multiply it with the sum of y to find out the sum of the products of x and y. Next, all the amounts are added together and the total amount xy is obtained.

Multiple pairwise regression equation: assessing the importance of a relationship

As discussed earlier, multiple regression has a function of the form y = f (x 1,x 2,…,x m)+E. Most often, such an equation is used to solve the problem of supply and demand for a product, interest income on repurchased shares, and to study the causes and type of the production cost function. It is also actively used in a wide variety of macroeconomic studies and calculations, but at the microeconomics level this equation is used a little less frequently.

The main task of multiple regression is to build a model of data containing a huge amount of information in order to further determine what influence each of the factors has individually and in their totality on the indicator that needs to be modeled and its coefficients. The regression equation can take on a wide variety of values. In this case, to assess the relationship, two types of functions are usually used: linear and nonlinear.

The linear function is depicted in the form of the following relationship: y = a 0 + a 1 x 1 + a 2 x 2,+ ... + a m x m. In this case, a2, a m are considered “pure” regression coefficients. They are necessary to characterize the average change in parameter y with a change (decrease or increase) in each corresponding parameter x by one unit, with the condition of stable values ​​of other indicators.

Nonlinear equations have, for example, the form of a power function y=ax 1 b1 x 2 b2 ...x m bm. In this case, the indicators b 1, b 2 ..... b m are called elasticity coefficients, they demonstrate how the result will change (by how much%) with an increase (decrease) in the corresponding indicator x by 1% and with a stable indicator of other factors.

What factors need to be taken into account when constructing multiple regression

In order to correctly build multiple regression, it is necessary to find out which factors should be paid special attention to.

It is necessary to have some understanding of the nature of the relationships between economic factors and what is being modeled. Factors that will need to be included must meet the following criteria:

  • Must be subject to quantitative measurement. In order to use a factor that describes the quality of an object, in any case it should be given a quantitative form.
  • There should be no intercorrelation of factors, or functional relationship. Such actions most often lead to irreversible consequences - the system of ordinary equations becomes unconditional, and this entails its unreliability and unclear estimates.
  • In the case of a huge correlation indicator, there is no way to find out the isolated influence of factors on the final result of the indicator, therefore, the coefficients become uninterpretable.

Construction methods

There are a huge number of methods and methods that explain how you can select factors for an equation. However, all these methods are based on the selection of coefficients using a correlation indicator. Among them are:

  • Elimination method.
  • Switching method.
  • Stepwise regression analysis.

The first method involves filtering out all coefficients from the total set. The second method involves introducing many additional factors. Well, the third is the elimination of factors that were previously used for the equation. Each of these methods has a right to exist. They have their pros and cons, but they can all solve the issue of eliminating unnecessary indicators in their own way. As a rule, the results obtained by each individual method are quite close.

Multivariate analysis methods

Such methods for determining factors are based on consideration of individual combinations of interrelated characteristics. These include discriminant analysis, shape recognition, principal component analysis, and cluster analysis. In addition, there is also factor analysis, but it appeared due to the development of the component method. All of them apply in certain circumstances, subject to certain conditions and factors.

What is regression?

Consider two continuous variables x=(x 1 , x 2 , .., x n), y=(y 1 , y 2 , ..., y n).

Let's place the points on a two-dimensional scatter plot and say that we have linear relation, if the data is approximated by a straight line.

If we believe that y depends on x, and changes in y are caused precisely by changes in x, we can determine the regression line (regression y on x), which best describes the linear relationship between these two variables.

The statistical use of the word regression comes from the phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

He showed that although tall fathers tend to have tall sons, the average height of sons is shorter than that of their tall fathers. The average height of sons "regressed" and "moved backward" towards the average height of all fathers in the population. Thus, on average, tall fathers have shorter (but still quite tall) sons, and short fathers have taller (but still quite short) sons.

Regression line

A mathematical equation that estimates a simple (pairwise) linear regression line:

x called the independent variable or predictor.

Y- dependent variable or response variable. This is the value we expect for y(on average) if we know the value x, i.e. this is the "predicted value" y»

  • a- free member (intersection) of the evaluation line; this is the meaning Y, When x=0(Fig.1).
  • b- slope or gradient of the estimated line; it represents the amount by which Y increases on average if we increase x per one unit.
  • a And b are called regression coefficients of the estimated line, although this term is often used only for b.

Pairwise linear regression can be extended to include more than one independent variable; in this case it is known as multiple regression.

Fig.1. Linear regression line showing the intercept a and the slope b (the amount Y increases as x increases by one unit)

Least square method

We perform regression analysis using a sample of observations where a And b- sample estimates of the true (general) parameters, α and β, which determine the linear regression line in the population (general population).

The simplest method for determining coefficients a And b is least square method(MNC).

The fit is assessed by looking at the residuals (the vertical distance of each point from the line, e.g. residual = observed y- predicted y, Rice. 2).

The line of best fit is chosen so that the sum of squares of the residuals is minimal.

Rice. 2. Linear regression line with residuals depicted (vertical dotted lines) for each point.

Linear Regression Assumptions

So, for each observed value, the remainder is equal to the difference and the corresponding predicted value. Each remainder can be positive or negative.

You can use residuals to test the following assumptions behind linear regression:

  • The residuals are normally distributed with a mean of zero;

If the assumptions of linearity, normality, and/or constant variance are questionable, we can transform or and calculate a new regression line for which these assumptions are satisfied (for example, use a logarithmic transformation, etc.).

Anomalous values ​​(outliers) and influence points

An "influential" observation, if omitted, changes one or more model parameter estimates (ie, slope or intercept).

An outlier (an observation that is inconsistent with the majority of values ​​in a data set) can be an "influential" observation and can be easily detected visually by inspecting a bivariate scatterplot or residual plot.

Both for outliers and for “influential” observations (points), models are used, both with and without their inclusion, and attention is paid to changes in estimates (regression coefficients).

When conducting an analysis, you should not automatically discard outliers or influence points, since simply ignoring them can affect the results obtained. Always study the reasons for these outliers and analyze them.

Linear regression hypothesis

When constructing linear regression, the null hypothesis is tested that the general slope of the regression line β is equal to zero.

If the slope of the line is zero, there is no linear relationship between and: the change does not affect

To test the null hypothesis that the true slope is zero, you can use the following algorithm:

Calculate the test statistic equal to the ratio , which is subject to a distribution with degrees of freedom, where the standard error of the coefficient


,

- estimation of the dispersion of the residuals.

Typically, if the significance level is reached, the null hypothesis is rejected.


where is the percentage point of the distribution with degrees of freedom, which gives the probability of a two-sided test

This is the interval that contains the general slope with a probability of 95%.

For large samples, say, we can approximate with a value of 1.96 (that is, the test statistic will tend to be normally distributed)

Assessing the quality of linear regression: coefficient of determination R 2

Because of the linear relationship and we expect that changes as , and call it the variation that is due to or explained by regression. The residual variation should be as small as possible.

If this is true, then most of the variation will be explained by regression, and the points will lie close to the regression line, i.e. the line fits the data well.

The proportion of total variance that is explained by regression is called coefficient of determination, usually expressed as a percentage and denoted R 2(in paired linear regression this is the quantity r 2, square of the correlation coefficient), allows you to subjectively assess the quality of the regression equation.

The difference represents the percentage of variance that cannot be explained by regression.

There is no formal test to evaluate; we must rely on subjective judgment to determine the goodness of fit of the regression line.

Applying a Regression Line to Forecast

You can use a regression line to predict a value from a value at the extreme end of the observed range (never extrapolate beyond these limits).

We predict the mean of observables that have a particular value by plugging that value into the equation of the regression line.

So, if we predict as Use this predicted value and its standard error to estimate a confidence interval for the true population mean.

Repeating this procedure for different values ​​allows you to construct confidence limits for this line. This is the band or area that contains the true line, for example at 95% confidence level.

Simple regression plans

Simple regression designs contain one continuous predictor. If there are 3 observations with predictor values ​​P, such as 7, 4, and 9, and the design includes a first-order effect P, then the design matrix X will be

and the regression equation using P for X1 is

Y = b0 + b1 P

If a simple regression design contains a higher order effect on P, such as a quadratic effect, then the values ​​in column X1 in the design matrix will be raised to the second power:

and the equation will take the form

Y = b0 + b1 P2

Sigma-constrained and overparameterized coding methods do not apply to simple regression designs and other designs containing only continuous predictors (because there are simply no categorical predictors). Regardless of the encoding method chosen, the values ​​of the continuous variables are incremented accordingly and used as values ​​for the X variables. In this case, no recoding is performed. In addition, when describing regression plans, you can omit consideration of the design matrix X, and work only with the regression equation.

Example: Simple Regression Analysis

This example uses the data presented in the table:

Rice. 3. Table of initial data.

Data compiled from a comparison of the 1960 and 1970 censuses in randomly selected 30 counties. County names are presented as observation names. Information regarding each variable is presented below:

Rice. 4. Table of variable specifications.

Research problem

For this example, the correlation between the poverty rate and the degree that predicts the percentage of families that are below the poverty line will be analyzed. Therefore, we will treat variable 3 (Pt_Poor) as the dependent variable.

We can put forward a hypothesis: changes in population size and the percentage of families that are below the poverty line are related. It seems reasonable to expect that poverty leads to out-migration, so there would be a negative correlation between the percentage of people below the poverty line and population change. Therefore, we will treat variable 1 (Pop_Chng) as a predictor variable.

View results

Regression coefficients

Rice. 5. Regression coefficients of Pt_Poor on Pop_Chng.

At the intersection of the Pop_Chng row and the Param column.<.05 . Обратите внимание на не стандартизованный коэффициент, который также является коэффициентом корреляции Пирсона для простых регрессионных планов, равен -.65, который означает, что для каждого уменьшения стандартного отклонения численности населения происходит увеличение стандартного отклонения уровня бедности на.65.

the unstandardized coefficient for the regression of Pt_Poor on Pop_Chng is -0.40374. This means that for every one unit decrease in population, there is an increase in poverty rate of .40374. The upper and lower (default) 95% confidence limits for this unstandardized coefficient do not include zero, so the regression coefficient is significant at the p level

Variable distribution

Correlation coefficients can become significantly overestimated or underestimated if large outliers are present in the data. Let's study the distribution of the dependent variable Pt_Poor by district. To do this, let's build a histogram of the variable Pt_Poor.

Rice. 6. Histogram of the Pt_Poor variable.

As you can see, the distribution of this variable differs markedly from the normal distribution. However, although even two counties (the two right columns) have a higher percentage of families that are below the poverty line than expected under a normal distribution, they appear to be "within the range."

Rice. 7. Histogram of the Pt_Poor variable.

This judgment is somewhat subjective. The rule of thumb is that outliers should be considered if the observation (or observations) do not fall within the interval (mean ± 3 times the standard deviation). In this case, it is worth repeating the analysis with and without outliers to ensure that they do not have a major effect on the correlation between population members.

Scatterplot

If one of the hypotheses is a priori about the relationship between given variables, then it is useful to test it on the graph of the corresponding scatterplot.

The scatterplot shows a clear negative correlation (-.65) between the two variables. It also shows the 95% confidence interval for the regression line, i.e., there is a 95% probability that the regression line lies between the two dotted curves.

Significance criteria

Rice. 9. Table containing significance criteria.

The test for the Pop_Chng regression coefficient confirms that Pop_Chng is strongly related to Pt_Poor , p<.001 .

Bottom line

This example showed how to analyze a simple regression design. Interpretations of unstandardized and standardized regression coefficients were also presented. The importance of studying the response distribution of a dependent variable is discussed, and a technique for determining the direction and strength of the relationship between a predictor and a dependent variable is demonstrated.