Lecture. Dependence between quantities. Types of dependencies

View

Ссылка на yuotube

At the previous lecture  we looked at the distribution of one feature, that is, we singled out some feature and studied it individually.

Quite often, the task of research is to study several features together, and study the dependence between these features or dependence between these features.

Thus, in the simplest case we have 2 feature, i.e. we have a set of pairs of values (xi; yi), where xi is feature value of X in k experiment, yi is feature value of Y in k experiment.

It is required to determine whether these features are dependent or not, and to restore the dependence.

Before we talk about how to find whether there is dependence or not, you need to consider the types of dependence.

These types are also discussed in section 3.6 of this course, but as this is a elective section, we will look at them again.

The most common is a statistical dependence, when the value of one variable affects the law of distribution of the second one, that is, if there is no statistical dependence between the variables, we call these variables, these random variables independent, that is, one does not affect the other.

The second kind of dependence is correlation.

This dependence of the mean of a random variable on the value of another random variable.

Then this dependence is more narrow, that is, it is less common than statistical one, but statistics consider correlation dependence, and further in this lecture, saying dependent, we will mean the correlation dependence.

A special case of correlation dependence, the most rare case of dependence is a functional dependence.

This dependence, in which the value of a random variable corresponds to the value of another random variable, i.e. each value corresponds to a different value of the second value.

It is very rare, we will periodically say that in this or that case, we can observe the functional dependence.

I repeat that, when in statistics we say "to study the dependence between random variables," we usually mean the correlation dependence.

The dependence between the variables is studied by two types of analysis.

These are  correlation analysis and regression analysis.

Correlation analysis calculates some correlation coefficients, evaluates them and makes it possible to make conclusions about the links.

Regression analysis examines the shape of this correlation, that is, it allows you to set the its shape and depending on it to choose its parameters, as well as to evaluate the resulting curve for quality.

First, consider correlation analysis.

From section "probability theory", we know that the correlation dependence is estimated by correlation coefficient.

In fact, we have considered only one correlation coefficient, in statistics there are some correlation coefficients, depending on what kind of dependence we want to analyze.

If we are interested in a linear dependence, we consider Pearson correlation coefficient, if we are interested in a non-linear dependence or multiple dependence, we will consider other correlation coefficients.

Correlation dependences are divided by shapes - straight or curved, by direction - forward and backward, or positive and negative.

In the case of a positive correlation we have: firstly, correlation coefficient is positive, and secondly, the average value of the second feature (dependent) increases with increasing the value of the factor feature, and vice versa; when connection is negative with increasing values of the factor feature the value of the dependent feature (or effective) decreases and vice versa.

Once again I say, that factor features are those that are in the basis of the dependence, which build the dependence, and effective features are those that we study, there also may be one or more features.

Within this lecture, we consider only the binary dependence when we have  factor feature r and one effective feature.

Also, the correlation dependence is divided by the degree or by tightness, depending on the module of the correlation coefficient value the dependence can be strong, average, moderate, weak or very weak.

Weak and very weak sometimes are not separated.

The table is shown on the slide.

To assess the tightness of the linear dependence Pearson correlation coefficient is considered.

Of course according to the probability theory we know that it is calculated by the formula where M [X] is expectation Д[X] is dispersion, thus its value lies in the range from 0 to 1, and evaluation of this coefficient is shown on the slide, wherein our expectation is replaced by the average and dispersion by the square of the standard deviation, respectively, we do not have roots here.

To assess the tightness of the linear relation Pearson correlation coefficient is used.

The correlation coefficient between two random variables is shown on the slide.

Below is this ratio estimation where M [X], M [Y] are mathematical expectation of random variables, and stroke x, stroke y are their evaluation.

Accordingly, Д[Х], Д[У] are dispersion of random variables, and s0x2 , s0у2  are their evaluation and s0x , s0у   are evaluation of standard deviations. x, y with a common stroke is the average product.

All values do not exceed one, that is, our expectation of the deviations product is not more than the root of the dispersion product.

Further, if the correlation coefficient by the  module is equal to 1, there is a functional linear dependence, while if it is 1, then we will have a direct linear dependence, if it is -1, then this is the inverse linear dependence.

In case, if we have a correlation coefficient equal to 0, then there is no linear correlation.

Here we have a subtle nuance. The correlation coefficient is equal to zero.

Thus, in practice we got assessment.

We understand from the previous lesson that the assessment and the true value is not equal.

A question arises "If r*xy is not zero, but close to zero, does it mean that there is no dependence?".

This question we leave till one of the following sessions, when we’ll consider a hypothesis about the insignificance of the correlation coefficient and understand how to check it.

If we consider two or more than two features, correlation table is usually analyzed.

With its help multiple correlation coefficients can be calculated and so on.

Here is the table rij, where i ≠ j – this is the correlation coefficient between хi, yj, and rii  will always be equal to 1.

Proceed to the regression analysis.

Let us consider a pair of random variables X and Y, are then expectation Y, provided that the value of X has taken the value of x - is a regression curve or a regression equation.

Then we can represent the variable Yi = φ(хi) +εi , where хi  is not a random value of the first random variable, and the Yi  is a random variable which defines a set of values that can be obtained in this case, and certain properties are formulated for εi.

The first 4 properties are binding and property 5 is optional and is used to assess the regression equation.

Stages of regression analysis include first, the assumption about the shape of the dependence between variables, then parameters are found and the last step is to check the reliability of the assessment, i.e. evaluation of the regression equation.

We need to check how well and correctly, we have chosen the form of connection and how well evaluating regression parameters were found.

To find the parameters equation of the regression equation with the known shape we use the method of least squares.

Suppose we have a pair of x, y, and we want to build a regression curve shown in the slide, where a is one or more parameters, then we construct an error function, this is a function of deviation values derived from the equation of the true value of the variable y in the current study , i.e. the sum of squares of these deviations.

We already know that the sum of deviations is simply equal to 0.

So this sum of equations squares should be minimal, which is why the method is called the "least squares method".

To find the minimum of this function, we have parameters а1, а2,.., ак, we need to consider partial derivatives.

Since this equation is quadratic with respect to all parameters, its stationary point is a minimum point. Here there is no need to conduct stage 2 of checking the minimum.

Consider the most common and the most simple case, where expectation Y, the conditional expectation depends linearly on the value of x.

In this case, we can build a dependence y with a lid shown in the slide, where E (a, b) is the error function, we find the derivatives and get the following system, complete these calculations yourselves.

Having solved this system, you can find parameters a and b.

In fact, in practice, if the calculations are hand-made, this way is more comfortable, it will allow you to minimize calculations.

If this is all calculated automatically, you can express values of parameters a and b.

For this purpose, two equations, both sides of both equations are divided by n.

We get the following system, where stroke x, stroke y  are mean values of x and y. х2 is the mean square, stroke x is the mean product.

Again this system solution will be values of a and b.

Let’s express b from the second equation, interchange equations and substituting in the first one.

We will continue to express, that is, we can express a and get that a is  a mean product minus the product of the mean, divided by the same with squares.

Now, let us remember that dispersion assessment is nothing but stroke х2 minus stroke x in the square, that is, the denominator has dispersion assessment, and the numerator has the correlation assessment, that is, the value of y with a lid will be equal to the value written on the slide.

This ratio is called a regression coefficient and shows the tightness of the dependence.

It is understood that, depending on what units will be used for measuring, regression coefficients will be different.

It is accepted not to analyze the regression coefficient itself, but correlation coefficient.

The slide does not show that, but the regression coefficient can be expressed by dispersion assessment and correlation coefficient.

Do  it yourself.

Sometimes regression is still not linear, that is, if we place a set of available points on the plane, we can see that there is no line.

In this case, it is very difficult to assess the type of the dependence approximately.

For example, you can not always tell the hyperbole from the parabola if it is somehow tricky rotated.

Therefore, various nonlinear regression are cosidered.

For example, a linear one on the estimated parameters, then you can simply change variables to reduce these equations to linear ones and evaluate which one is more suitable.

Nonlinear on estimated parameters.

It may be internally linear, internal nonlinear.

Question: Is there such a replacement of the variable that the equation has been reduced to the linear?

We try more often to look for such dependences, which are reduced to linear ones and analyze them further, they can be reduced both to a binary linear and  to a multiple linear.

For example, a polynomial dependences are reduced to multiple linear by the change of variables, where each level has its own variable.

Next comes the analysis.

We will not be consider the analysis in this course  .

It is in the directory, for more detail, refer to the literature.

Thanks for attention.


Last modified: Четверг, 5 декабря 2024, 10:12