Lecture. The basic concepts of mathematical statistics
We turn to the second part of our module, namely section Mathematical Statistics.
This section will consist of two parts: the main part (basic mathematical statistics) and a variable component, which will be devoted to testing statistical hypotheses.
Today, let us consider the basic concepts of mathematical statistics.
Mathematical statistics as a science studies patterns of mass statistical phenomena, and the object of statistical research is statistical population.
Statistical set is a set of objects that are homogeneous with respect to some characteristic.
Thus each object is called a statistical population unit, the object can have several attributes.
However, some symptoms may be common to all the objects together, and some will be individual.
The goal of mathematical statistics is to analyze all these signs or some of them, that are most interesting to the researcher, for existing laws, variation and so on.
The following are examples of statistical sets: many residents of the Russian Federation, many cruciferous plants, many components made on some machine, as well as many books.
We see that many sets show us that mathematical statistics can be used in many branches of science.
Thus, it is a very widely applied science.
When we examine a statistical set, we pass three stages.
The first stage is a statistical survey which consists in collecting data and recording them in some way, that is, signs of some set of population units are fixed.
The second stage is a summary and grouping the results of the survey.
It consists of organizing data in some tables, grouping by selected characteristics and calculating primary results.
For example, the mean value.
And the last stage (mathematical statistics is primarily engaged in this stage) is the value analysis.
We’ll briefly review the first two stages and most of the time will be given to the third stage.
The first stage is a sampling method.
Why is it called so?
Let us understand that statistical observation is a study of statistical population units.
It can be of two types: continuous monitoring, when we study all the units in the population.
For example, we want to find the average height of students of Applied Mathematics Group 2 course, where , for example, 25 people study.
Then we can measure the growth of all 25 people, and calculate the average height.
In this case, we carry out continuous monitoring.
If we want to find the average height of all the residents of the Russian Federation, this task is much more difficult, as there are too many units in the set.
Each unit must be explored for some time, for some money, that is, if there are a lot of units, it will be spent a lot of time and a lot of money, which is very bad.
Another point, which can interfere with conducting continuous research is costly studies of one unit.
For example, a study of the car.
Each car is expensive, that is, to study it requires money.
The less, the better.
The third point, after crash tests cars cannot be used any longer.
Therefore, if we make a continuous study of cars, we cannot use them later.
Therefore sample studies (or non-continuous) are conducted, when we do not investigate the whole population but only part of it.
The population under research is called general, and the part that we will be explored (i.e., the portion where the characteristic values will be recorded) is called sample.
The size of the population is called population volume.
The volume of the general population is denoted N, and the volume of the sample population (or sample) is usually denoted n.
The way of selecting the sample is also important.
It is believed that the sample population is good (or representative) when selection is completely random, that is, every unit has an equal chance to get into this set.
Representativeness - is presentability.
In this case, selection can be made by two random ways.
We can take an object, examine it features, and let it go in peace.
And then from the entire general population, we can select an object again.
This means that the given object can again get in the sample.
Such selection is called repeated.
If we studied the object and set it aside, and later we choose from the general population without this object, then this selection is called non-repeated sample.
The repeated sample is easier to investigate, but in actual research non-repeated sampling is carried out more often.
With a large sample size the boundaries between the repeated and non-repeated samples are almost erased.
Therefore, mathematical statistics usually examines large samples.
Random sampling is not always convenient and available.
Besides there is also mechanical sampling or simple sampling through regular procedures, where population units are sorted and selected at regular intervals.
Disadvantages of such sampling is the fact that unsuccessful ordering or a wrong selection step can lead to a non-representative sample.
It is quite easy to give an example.
Suppose we have a multi-storeyed apartment building, and study, for example, one entrance, namely the number of children in one apartment.
Moreover, if we take a standard house, on the same floor there are 4 apartments:, one-room, two-room, three-room and four-room apartments.
If you take a step of four apartments, that is to take every fourth apartment, then we will always get either a one-room or faor four-room apartment.
If you always select a one-room appartment results will be understated.
If we keep to choose a four-room apartment, then the results will be too high.
Thus, the sample is not representative.
That is, in the case of a simple regular selection, on the one hand, it is convenient, on the other hand, one should always think about the representativeness of the sample.
The third way: stratified sampling.
This selection is applied when the general population is initially divided into a large typical groups.
For example, we study residents of the Kirov region. Kirov region is divided into districts, so it is convenient to carry out stratified sampling,
This sampling is also called typical or regionalized
In this case, we select units from each district, each stratum.
The number of units from the district are selected proportionally to the size of the district.
The final way of sampling is a serial selection.
Serial selection is carried out when our general population is divided into many similar small groups.
In this case, we conduct a random sampling of series or small groups, and within the group continuous sampling is held
For example, you have a truck, in which there are boxes of canned meat.
You pick a few random boxes, and carry continuous study of cans with stewed meat inside the box.
Data obtained as a result of statistical survey are grouped into statistical series.
In this case, you need to select some feature before grouping
What are the features?
Firstly, it can be attributive features (or qualitative features), features that are are not expressed in quantitative units.
Or it may be quantitative features, that is, the signs expressed by numbers.
In turn, attributive signs are divided into sequential, i.e., signs that cannot be expressed in explicit numbers, but that can be sorted, that is ranked.
The second group - nominative features, which can not be sorted.
Examples of thesefeatures are shown on the slide.
Quantitative features, in turn, are divided into discrete and continuous.
This is more or less clear.
When the number is finite or countable, we obtain a discrete sign.
When the number of possible variable values is continuous, we get a continuous sign.
Statistical series, depending on the choice of a grouping variable is also divided into attributive and variational.
An attributive series is grouped by attributive features, variational – by quantitative.
Mathematical statistics primarily examines variation series.
Variation series, like an attribute one, consists of two lines.
The first line present variants, this is a particular feature value, and the secondline shows their frequency, this is the number of times the variant occurred in the sample.
Quite often, instead of frequencies we consider relative frequencies or particulars.
This frequency is expressed as a fraction or percentage, that is, the second row is normalized so that the sum of all the values in it is equal to one.
We see that the statistical seriess is an analogue of distribution series.
Examples are given on the slide.
The example of a discrete series: the number of children in the family can be 1, 2 or 3.
And the interval series is given intervals, i.e. there is a sufficiently large variation of the feature
Interval series is used not only when the feature is continuous, but if the feature variation is large, there are a lot of feature values and it is inconvenient to group it into a discrete set.
It is convenient to represent statistical series in the graphs.
There are two main ways: it is a polygon and histogram.
A polygon is depicted as follows: variants are plotted on the abscissa and the ordinate axis shows the relative frequency or frequencies.
The point is this corresponding value of the pair - (a variant, purity).
These frequencies are connected by segments, the end points are connected with the abscissa, we get some polygon.
The histogram is constructed differently.
There sre intervals, these intervals are deposited on the abscissa.
Discrete series is, in principle, also possible to determine by a histogram.
The height of the column is a frequency or a relative frequency.
If we consider the histogram of relative frequencies, the sum of the areas of all the columns will be equal to one, which means that if we normalize the height of column by the gap size, the sum of all values will be equal to one, and, therefore, the histogram will estimate density and probability.
Also, we often want to understand how the feature is distributed.
To do this, we can estimate the distribution function.
To evaluate the distribution function we consider empirical distribution function, which is calculated by the formula shown on the slide, where under the summation sign there are feature indicators of each experiment.
Consider an example.
Suppose we study the height of students in a group of students, the results of the twenty-five measurements are arranged in the table.
It is required to construct an empirical distribution function.
To begin, we are to order these values, then group and draw a graph where the x-axis is the values, the y-axis is accumulated frequencies.
For the first value frequency is one, for the second it is two and so on.
Then each value is divided by their total number.
If some value occurs more than once, then, accordingly, there will be a big leap.
Please note that the distribution function graph has risen quite strongly.
In Excel this is done with the help of the date axis.
In output operation for this graph the x-axis is set as "date-axis", and it turns out the empirical distribution function.