When distribution is shown as a symmetrical bell-shaped curve what can be concluded about the data

Normality check

There are four methods to check whether or not the collected data satisfy the normality requirement. These methods are checking the normality using plot and several statistics, such as mean, median, skewness, and kurtosis.

1) Distribution plot

A distribution plot of the collected data is useful to check normality of the data. The distribution of the data should be checked to determine that it does not deviate too much as compared to the normal distribution.

2) Difference value between mean and median

The mean is a simple arithmetic average of the given set of values or quantities. The median is a positional average and is defined as the middle number in an ordered list of values. In a normal distribution, the graph appears as a classical, symmetrical “bell-shaped curve.” The mean, or average, and the mode, or maximum point on the curve, are equal. Hence, the difference value between the mean and the median are close to zero in normal distribution. However, when the difference value between the mean and the median is big, the distribution is skewed to the right or to the left.

3) Skewness and kurtosis

Skewness is a measure of the “asymmetry” of the probability distribution, in which the curve appears distorted or skewed either to the left or to the right. In a perfect normal distribution, the tails on either side of the curve are exact mirror images of each other. When a distribution is skewed to the left, the tail on the curve's left-hand side is longer than that on the right-hand side, and the mean is less than the mode. This situation is also referred to as negative skewness. When a distribution is skewed to the right, the tail on the curve's right-hand side is longer than the tail on the left-hand side, and the mean is greater than the mode. This situation is also referred to as positive skewness.

Kurtosis is a measure of the “tailedness” of the probability distribution, in which the tails asymptotically approach zero or not. Distributions with zero excess kurtosis are called mesokurtic or mesokurtotic. The most prominent example of a mesokurtic distribution is normal distribution. A distribution with a positive excess kurtosis is called leptokurtic or leptokurtotic. In terms of shape, a leptokurtic distribution has fatter tails. Examples of leptokurtic distributions include the Student's t-distribution, exponential distribution, Poisson distribution, and the logistic distribution. A distribution with a negative excess kurtosis is called platykurtic or platykurtotic. Examples of platykurtic distributions include the continuous or discrete uniform distributions and the raised cosine distribution. The most platykurtic distribution is the Bernoulli distribution.

4) Q–Q plot

A Q–Q plot is a plot of the quantiles of two distributions against each other, or a plot based on the estimates of the quantiles. The pattern of points in the plot is used to compare the two distributions. The main step in constructing a Q–Q plot is calculating or estimating the quantiles to be plotted. If one or both of the axes in a Q–Q plot is based on a theoretical distribution with a continuous cumulative distribution function (CDF), all quantiles are uniquely defined and can be obtained by inverting the CDF. If a theoretical probability distribution with a discontinuous CDF is one of the two compared distributions, some quantiles may not be defined, so an interpolated quantile may be plotted. If the Q–Q plot is based on the data, there are multiple quantile estimators in use. The rules for forming Q–Q plots when quantiles must be estimated or interpolated are called plotting positions.

A simple case is when there are two data sets of the same size. In that case, to make the Q–Q plot, each set is ordered in the increasing order, then paired off, and the corresponding values are plotted. A more complicated construction is the case where two data sets of different sizes are being compared. To construct the Q–Q plot in this case, it is necessary to use an interpolated quantile estimate so that quantiles corresponding to the same underlying probability can be constructed.

The points plotted in a Q–Q plot are always non-decreasing when viewed from the left to the right. If the two compared distributions are identical, the Q–Q plot follows the 45° line y=x. If the two distributions agree after linearly transforming the values in one of the distributions, then the Q–Q plot follows some line, but not necessarily the line y=x. If the general trend of the Q–Q plot is flatter than the line y=x, the distribution plotted on the horizontal axis is more dispersed than the distribution plotted on the vertical axis. Conversely, if the general trend of the Q–Q plot is steeper than the line y=x, the distribution plotted on the vertical axis is more dispersed than the distribution plotted on the horizontal axis. Q–Q plots are frequently arced, or “S” shaped, indicating that one of the distributions is more skewed than the other one, or that one of the distributions has heavier tails than the other one. Although a Q–Q plot is based on quantiles, in a standard Q–Q plot, it cannot be determined which point in the Q–Q plot determines a given quantile. For example, it is not possible to determine the median of either of the two compared distributions b by inspecting the Q–Q plot. Some Q–Q plots indicate the deciles to enable determinations of this type.

Q–Q plots are commonly used to compare the distribution of a sample to a theoretical distribution, such as the standard normal distribution N(0,1), as in a normal probability plot. As in the case of comparing two data samples, one orders the data (formally, computes the order statistics) and then plots them against certain quantiles of the theoretical distribution.

Normality test

In the previous section, we described the methods for normality check. However, these methods do not allow us to draw conclusions whether or not the collected data satisfy the normality requirement. Only a rough guess can be made as in this respect. Therefore, to the definite answer, we have to consider a statistical test for normality. There are several methods to perform a normality test. The Kolmogorov-Smirnov test, the Shapiro-Wilk test, and the Anderson-Darling test are among the most popular methods. Specifically, the Kolmogorov-Smirnov test and the Shapiro-Wilk test are supported by IBM SPSS. All these tests follow the same procedure; 1) hypothesis set-up; 2) significance level determination; 3) test statistic calculation; 4) p-value calculation; 5) conclusion.

1) Hypothesis set-up

In general, all statistical tests have a statistical hypothesis. A statistical hypothesis is an assumption about a population parameter. This assumption may or may not be true. A researcher might conduct a statistical experiment to test the validity of this hypothesis. The hypotheses typically include the null hypothesis and the alternative hypothesis. The distribution of population assumes the normal distribution in all data set. Hence, the null hypothesis (H0) and alternative hypothesis (Ha) are follows;

H0:The data are normally distributed.Ha:The data are not normally distributed.

2) Significance level determination

The significance level α is the probability of making the wrong decision when the null hypothesis is true. Alpha levels (sometimes called simply “significance levels”) are used in hypothesis tests. An alpha level is the probability of a type I error, or you reject the null hypothesis when it is true. Usually, these tests are run with an alpha level of 0.05 (5%); other commonly used levels are 0.01 and 0.10.

3) Test statistic calculation

Next, the test statistic for the normality test should be calculated. The calculation of the test statistic differs according to which of the normality test methods is used. The formulas for calculating the test statistic according to each statistical method are as follows.

(1) Shapiro-Wilk test statistic

The Shapiro-Wilk test tests the null hypothesis that a sample x1,⋯,xn comes from a normally distributed population. The test statistic is as follows (see Eq. (1)):

(1)W=(Σi=1 naix(i))2Σ i=1n(xi-x_)2

where x(i) (with parentheses enclosing the subscript index i; not to be confused with xi) is the i-th order statistic, i.e., the i-th smallest number in the sample, the sample mean is given by Eq. (2).

(2)x_=(x1+⋯+xn )n

and the constants ai are given by Eq. (3)

(3)(a 1,⋯,an)=mTV-1 (mTV-1V-1m)1 /2

where m=(m1,⋯,mn)T and m1,⋯,mn are the expected values of the order statistics of independent and identically distributed random variables sampled from the standard normal distribution, and V is the covariance matrix of those order statistics.

(2) Kolmogorov–Smirnov test statistic

The Kolmogorov–Smirnov statistic for a given cumulative distribution function F(x) is computed using Eq. (4).

(4)Dn=sup x|Fn(x)-F(x)|

where supx is the supremum function of the set of distances and Fn is the empirical distribution function for n i.i.d. (independent and identically distributed) in ordered observations Xi defined as shown in Eq. (5).

(5)F(x)=1nΣi=1nI[-∞,x](Xi)

where I[−∞,x] is the indicator function, equal to 1 if Xi≤x and to 0 otherwise. By the Glivenko–Cantelli theorem, if the sample comes from distribution F(x), then Dn converges to 0 almost surely in the limit when n goes to infinity. Kolmogorov strengthened this result by effectively providing the rate of this convergence.

(3) Anderson-Darling test statistic

The Anderson–Darling test assesses whether a sample comes from a specified distribution. It makes use of the fact that, when given a hypothesized underlying distribution and assuming the data do arise from this distribution, the CDF of the data can be assumed to follow a uniform distribution. The data can be then tested for uniformity with a distance test (Shapiro 1980). The formula for the test statistic A to assess if data {Y1<⋯n} (note that the data must be put in order) come from a CDF Φ is shown in Eq. (6).

(6)A 2=-n-Swhere SΣi=1n 2i-1n[In(ϕ(Yi))+ IN(1-ϕ(Yn+1-i))]

The test statistic can then be compared against the critical values of the theoretical distribution. Note that, in this case, no parameters are estimated in relation to the distribution function, Φ.

4) p-value calculation

Next, the significance value (p-value) should be calculated using the test statistic of the regularity test calculated in step 3). The significance value is the probability that a statistical value equal to or more extreme than the observed statistical value of the sample is observed, assuming that the null hypothesis is true. Said differently, the significance value is the probability of rejecting the null hypothesis despite the null hypothesis being true. Therefore the p-value is the degree of support for the null hypothesis. Since it is a probability value, it is calculated as a value between zero and one.

5) Conclusions

Finally, in order to draw conclusions of the normality test, we compare the significance level value set in step 2) and the calculated significance value (p-value) in step 4) and make the following conclusions.

If α≥p-value, then the null hypothesis has to be rejected.If α< p-value, then the null hypothesis is not rejected

If the null hypothesis is rejected because the significance value is smaller than the significance level value, the hypothesis that the data sample satisfies the normality requirement is rejected, and it can be said that it does not satisfy the normality requirement. If we set the probability of rejecting the null hypothesis to be 5%, we can conclude that the data sample does not satisfy the normality at the 5% significance level. Conversely, if the significance value is greater than the significance level, and the null hypothesis is not rejected, the conclusion can be drawn that “the data of the sample satisfies the normality requirement at the 5% significance level”.

Example for normality check and normality test

In this section, we illustrate the process of checking normality and testing normality using the IBM SPSS software 21.0 (IBM Co., Armonk, NY, USA) with uric acid (mg/dL) data (Table 1). First, we draw the histogram of the distribution plot with the normal distribution curve (Figure 1). The distribution plot is not much deviated from the normal distribution curve, so it can be assumed that it satisfies the normality. Second, the mean and the median are computed (6.11 and 6.00, respectively). The two values are not largely different, so it can be guessed that the data sample satisfies the normality requirement. Furthermore, the skewness and kurtosis are 0.09 and 0.68, respectively. Since both values are close to 0, the shape of the distribution can be seen as mesokurtic distribution without a shift to the left or right. Finally, we draw a Q–Q plot (Figure 2). In the Q–Q plot, the dots do not deviate much from the line, so it can be guessed that it satisfies the normality requirement.

Table 1 . Example data set.

No.Uric acidNo.Uric acidNo.Uric acidNo.Uric acidNo.Uric acid
1 3.8 6 8.1 11 5.0 16 5.8 21 6.8
2 2.8 7 7.7 12 6.2 17 5.6 22 4.8
3 9.5 8 6.1 13 5.9 18 5.4 23 5.6
4 8.0 9 7.0 14 6.0 19 5.3 24 4.9
5 7.4 10 6.2 15 6.5 20 5.0 25 7.3

When distribution is shown as a symmetrical bell-shaped curve what can be concluded about the data
Figure 1. Histogram with normal distribution curve.
When distribution is shown as a symmetrical bell-shaped curve what can be concluded about the data
Figure 2. Q–Q plot for example data set.

Next, we test whether the uric acid (mg/dL) data for 25 patients satisfy the normality requirement using the Shapiro-Wilk test method and the Kolmogorov-Smirnov test method. First, we set up the hypotheses. The null hypothesis (H0) is that the uric acid data are normally distributed, and the alternative hypothesis (Ha) is that the no uric acid data are not normally distributed. Secondly, we set the significance level to 0.05. Third, the test statistic is calculated. The test statistic according to the Shapiro-Wilk test method is 0.984, while the test statistic according to the Kolmogorov-Smirnov test method is 0.115. Fourth, we calculate the p-value. The p-value according to the Shapiro-Wilk test method is 0.949, and the p-value according to the Kolmogorov-Smirnov test method is 0.200. Finally, we and interpret the results and draw conclusions. Since the p-values according to the two normality test methods are greater than the significance level of 0.05, the null hypothesis (the uric acid data is normal distribution) is not rejected. Therefore, the uric acid data for 25 patients is considered to satisfy the normality at the 5% significance level.

Statistical analysis methods with or without normality

In selecting and using statistical analysis methods, there is a need to fully understand what statistical analysis methods are used. When establishing a hypothesis in a clinical study and analyzing collected data to test it, the most appropriate statistical analysis method should be selected and used to solve the given problem. The statistical analysis method is determined according to the number of cases, such as the number of dependent variables, kind of dependent variable, number of independent variables, and kind of independent variable. In addition, each statistical analysis method is based on various assumptions such as normality, linearity, independence, and so on. Therefore, before using a statistical analysis method, it should be first checked whether it satisfies the assumptions of the statistical analysis method to be used; then, the selected statistical analysis method can be used. That is, if the assumption is not satisfied, the statistical analysis method could not be used. For example, when trying to compare the quantitative variables of two independent groups, the two independent t-tests that are commonly used assume normality. Therefore, an independent two-group t-test can be used only if the normality test is satisfied. If the normality is not satisfied, the Mann Whitney U-test, a statistical method other than the independent two-group t-test, should be used.

In this section, we introduce some statistical analysis methods are widely used in clinical research which assume normality. The section concludes with a discussion of statistical analysis methods that should be used when the normality requirement is not satisfied.

1) Two sample t-test

Two sample t-test is a statistical analysis method used to compare the means of two independent variables. For example, statistical analysis is used to compare the mean of serum uric acid concentrations in a group taking steroids and a group taking placebo. Two sample t-test assumes normality. Therefore, it can be used when the normality is satisfied through the normality test. In this case, the normality test should be performed for each group, and it can be said that the normality is satisfied when the normality is satisfied in both groups. Alternatively, the Mann Whitney U-test should be used [1]. The Mann Whitney U-test tests whether or not the distributions of the d data collected from two independent groups are the same; it does not compare the mean of the quantitative variables of the two independent groups.

2) Paired t-test

The paired t-test is a statistical analysis method used to compare whether or not the difference of the quantitative variables measured twice for each subject in the dependent two groups is 0 or not. This is a statistical analysis method that examines whether there is a change between two measurements. For instance, this analysis method can be used to compare the uric acid concentration in the blood measured before taking the steroid with the uric acid concentration in the blood measured after taking the steroid. Paired t-test assumes normality. Therefore, it can be used when normality is established through the normality test. In this case, the normality test should be carried out by calculating the difference between before and after the difference. If the normality is not satisfied, the Wilcoxon signed rank test should be used [3]. The Wilcoxon signed rank test tests whether or not the median of the quantitative variables differences is zero in the two dependent groups, rather than whether or not the mean of the quantitative variable differences is zero in the two dependent groups.

3) One-way ANOVA

One-way ANOVA is a statistical analysis method used to compare the means of quantitative variables over three independent groups. For example, statistical analysis is used to compare the mean of serum uric acid concentrations in a group taking steroids, a group taking steroids+vitamins, and a group taking vitamins. One-way ANOVA assumes normality. Therefore, it can be used when the regularity is satisfied through the regularity test. In this case, the normality test should be performed for each group, and it can be said that the normality requirement is satisfied when it is satisfied in all three groups. Alternatively, the Kruskal-Wallis test should be used [4]. The Kruskal-Wallis test does not compare the means of the quantitative variables over the independent three or more groups, but tests whether or not the distributions of data collected over the independent three or more groups are the same. One-way ANOVA also assumes homoscedasticity, i.e., equal dispersion. Therefore, it can be used when the homoscedasticity is satisfied through the homoscedasticity test. If the homoscedasticity is not satisfied, a Brown-Forsythe test or Welch test should be used.

4) Repeated measure one-factor analysis

Repeated measure one-factor analysis is a statistical analysis method used to repeatedly compare whether or not there is a change in the mean value of the measured quantitative variables for each subject in more than three dependent groups. This statistical analysis method examines whether or not there is a change between three or more measurements. For example, this method can be used to compare the uric acid concentration in the blood measured before taking the steroid, 1 day after taking the steroid, and 3 days after taking the steroid. Repeated measure one-factor analysis assumes normality. Therefore, it can be used when the normality is satisfied through the normality test. In this case, the normality test should be carried out by calculating the difference between before and after each measurement point. If the regularity requirement is not satisfied, the Friedman test [5] should be used. The Friedman test tests whether or not there is a difference in the median of the quantitative variables in the dependent group, rather than compares the mean value of the quantitative variables in the dependent group.

5) Linear regression

Linear regression is a statistical analysis method used to calculate the relationship between the quantitative dependent variable and the various explanatory variables and coefficients, as well as to examine the explanatory power of the data with the estimated regression model. For example, this method can be used to examine the factors affecting the uric acid concentration in the blood. As for the factors, qualitative and quantitative variables can be set and analyzed in various ways. The regression model of linear regression assumes normality for error terms [6]. Therefore, it can be used when the normality is satisfied through the normality test. In this case, the normality test should be performed using the residual, which is an estimate of the error. If the normality requirement is not satisfied, the regression model should be modified through the model check and the data check, and the regression analysis should be performed to satisfy the normality requirement. In addition, besides normal assumption with respect to error terms, the regression model of linear regression assumes homoscedasticity, independence, and linearity. Therefore, regression analysis should be done by regression analysis that satisfies all normality, homoscedasticity, independence, and linearity by modifying the regression model through the model check and the data check.

Which type of statistics allows the researcher to draw conclusions about a population based on a sample?

inferential statistics are statistics used not merely to summarize, but to make inferences from sample data to draw conclusions about the population.

What is the relationship between quantitative research qualitative research data collection and data analysis quizlet?

a. In quantitative research, data analysis takes place after data collection; in qualitative research, data collection and analysis are simultaneous.

Which of the following statements accurately describes the difference between interval measurement and ratio measurement?

Which of the following statements describes the difference between interval measurement and ratio measurement. -Ratio measurement is used for continuous data, whereas interval measurement is used for noncontinuous data.

What three attributes are used to determine the reliability of an instrument group of answer choices?

There are three major categories of reliability for most instruments: test-retest, equivalent form, and internal consistency. Each measures consistency a bit differently and a given instrument need not meet the requirements of each. Test-retest measures consistency from one time to the next.