An Introduction to qqplotr (2024)

Table of Contents
Q-Q plot P-P plot FAQs References

Q-Q plot

Start by loading the qqplotr package:

require(qqplotr)

Let’s start by simulating from a standard Normal distribution:

set.seed(0)smp <- data.frame(norm = rnorm(100))

Then, we use the provided stat_qq_* functions toconstruct a complete Q-Q plot with the points, reference line, and theconfidence bands. As default, the standard Q-Q Normal plot with Normalconfidence bands is constructed:

gg <- ggplot(data = smp, mapping = aes(sample = norm)) + stat_qq_band() + stat_qq_line() + stat_qq_point() + labs(x = "Theoretical Quantiles", y = "Sample Quantiles")gg

An Introduction to qqplotr (1)

As we can see, all the points lie within the confidence bands, whichis expected for the given distribution.

As previously described in the Details section, three confidencebands constructs are available, which may be adjusted with thebandType parameter. Here, we may use thegeom_qq_band instead of stat_qq_band, whichpermits a little more flexibility with the graphical parameters whenconstructing and visualizing different confidence bands.

gg <- ggplot(data = smp, mapping = aes(sample = norm)) + geom_qq_band(bandType = "ks", mapping = aes(fill = "KS"), alpha = 0.5) + geom_qq_band(bandType = "ts", mapping = aes(fill = "TS"), alpha = 0.5) + geom_qq_band(bandType = "pointwise", mapping = aes(fill = "Normal"), alpha = 0.5) + geom_qq_band(bandType = "boot", mapping = aes(fill = "Bootstrap"), alpha = 0.5) + stat_qq_line() + stat_qq_point() + labs(x = "Theoretical Quantiles", y = "Sample Quantiles") + scale_fill_discrete("Bandtype")gg

An Introduction to qqplotr (2)

To construct Q-Q plots with other theoretical distributions we mayuse the distribution parameter. Specific distributionalparameters may be passed as a list to the dparamsparamater.

Note that distributional parameters have little impact when buildingQ-Q plots, as changing them will only modify the x-axis range. Incontrast, those paramaters will have a higher effect on P-P plots.

Now, let’s use draw the Q-Q plot functions for the mean ozonelevels from the airquality dataset . Since the data isnon-negative, lets choose the Exponential distribution(exp) as the theoretical.

It is important to note that the distribution nomenclature followsthat from the stats package. So, if you wish to provide acustom distribution, you may do so by creating the density, cumulative,quantile, and random functions following the standard nomenclature fromthe stats package, i.e., for the "custom"distribution, you must define "dcustom","pcustom", "qcustom", and"rcustom" functions.

That being said, let’s set distribution = "exp" andrate = 2 (the latter one just to exemplify the usage ofdparams):

di <- "exp" # exponential distributiondp <- list(rate = 2) # exponential rate parametergg <- ggplot(data = airquality, mapping = aes(sample = Ozone)) + stat_qq_band(distribution = di, dparams = dp) + stat_qq_line(distribution = di, dparams = dp) + stat_qq_point(distribution = di, dparams = dp) + labs(x = "Theoretical Quantiles", y = "Sample Quantiles")gg

An Introduction to qqplotr (3)

The qqplotr package also offers the option todetrend Q-Q and P-P plots in order to help reducing visual biascaused by the orthogonal distances from points to the reference lines(Thode, 2002). That bias may cause wrong conclusions to be drawn viavisual inference of the plot. To do that we must setdetrend = TRUE:

di <- "exp"dp <- list(rate = 2)de <- TRUE # enabling the detrend optiongg <- ggplot(data = airquality, mapping = aes(sample = Ozone)) + stat_qq_band(distribution = di, dparams = dp, detrend = de) + stat_qq_line(distribution = di, dparams = dp, detrend = de) + stat_qq_point(distribution = di, dparams = dp, detrend = de) + labs(x = "Theoretical Quantiles", y = "Sample Quantiles")gg

An Introduction to qqplotr (4)

Note that the detrend option causes to plot to “rotate”, that is,instead of being presented as a diagonal plot, we visualize the data ina horizontal manner.

The Q-Q plot functions are also compatible with manyggplot2 operators, such as Facets (sub-plots). Forinstance, lets consider the barley dataset from thelattice package to illustrate how Facets behave whenapplied to the Q-Q plot functions:

# install.packages("lattice")data("barley", package = "lattice")gg <- ggplot(data = barley, mapping = aes(sample = yield, color = site, fill = site)) + stat_qq_band(alpha=0.5) + stat_qq_line() + stat_qq_point() + facet_wrap(~ site) + labs(x = "Theoretical Quantiles", y = "Sample Quantiles")gg

An Introduction to qqplotr (5)

P-P plot

Let’s start by plotting the previously simulated Normal data versusthe standard Normal distribution:

gg <- ggplot(data = smp, mapping = aes(sample = norm)) + stat_pp_band() + stat_pp_line() + stat_pp_point() + labs(x = "Probability Points", y = "Cumulative Probability")gg

An Introduction to qqplotr (6)

Notice that the label names are different from those of the Q-Qplots. Here, the cumulative probability points (y-axis) are constructedby evaluating the theoretical CDF on sample quantiles.

As discussed before, in the case of P-P plots the distributionalparameters do impact the results. For instance, say wewant to evaluate the same standard Normal data with a shifted andrescaled Normal(2,2) distribution:

dp <- list(mean = 2, sd = 2) # shifted and rescaled Normal parametersgg <- ggplot(data = smp, mapping = aes(sample = norm)) + stat_pp_band(dparams = dp) + stat_pp_line() + stat_pp_point(dparams = dp) + labs(x = "Probability Points", y = "Cumulative Probability")gg

An Introduction to qqplotr (7)

As we already know, the plot shows that the chosen Normaldistribution parameters are not appropriate for the input data.

Also notice that the stat_pp_line function lacks thedparams paramater. The reason is thatstat_pp_line draws by default the identity line and, thus,it isn’t dependent of the sample data and/or its distribution. However,if the user wishes to draw another line (different from the identity)he/she may do so by providing the intercept and slope values,respectively, as a vector of length two to the abparameter:

gg <- ggplot(data = smp, mapping = aes(sample = norm)) + stat_pp_band() + stat_pp_line(ab = c(.2, .5)) + # intercept = 0.2, slope = 0.5 stat_pp_point() + labs(x = "Probability Points", y = "Cumulative Probability")gg

An Introduction to qqplotr (8)

We may also detrend the P-P plots in the same way as before. Let’sevaluate again the mean ozone levels from theairquality dataset. We had previously seen that theExponential distribution was more appropriate than the Normaldistribution, so let’s take that into account:

di <- "exp"dp <- list(rate = .022) # value is based on some empirical testsde <- TRUEgg <- ggplot(data = airquality, mapping = aes(sample = Ozone)) + stat_pp_band(distribution = di, detrend = de, dparams = dp) + stat_pp_line(detrend = de) + stat_pp_point(distribution = di, detrend = de, dparams = dp) + labs(x = "Probability Points", y = "Cumulative Probability") + scale_y_continuous(limits = c(-.5, .5))gg

An Introduction to qqplotr (9)

Based on empirical tests, we set the rate parameter torate = .022. That value let the most P-P points inside theconfidence bands. Even so, that group of outside the confidence bands(at the lower tail) indicate that a more appropriate distribution shouldbe selected.

An Introduction to qqplotr (2024)

FAQs

Is my Q-Q plot normal? ›

Examining data distributions using QQ plots

Points on the Normal QQ plot provide an indication of univariate normality of the dataset. If the data is normally distributed, the points will fall on the 45-degree reference line. If the data is not normally distributed, the points will deviate from the reference line.

How to interpret Q-Q plot in R? ›

On the horizontal axis, it shows the expected value of an individual with the same quantile if the distribution were normal (“theoretical quantiles” in the same figure). The QQ plot should follow more or less along a straight line if the data come from a normal distribution (with some tolerance for sampling variation).

What does a good Q-Q plot look like? ›

A QQ plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that's roughly straight. Here's an example of a normal QQ plot when both sets of quantiles truly come from normal distributions.

What proportion of these numbers is within one standard deviation away from the list's average? ›

The rule states that (approximately): - 68% of the data points will fall within one standard deviation of the mean. - 95% of the data points will fall within two standard deviations of the mean. - 99.7% of the data points will fall within three standard deviations of the mean.

What are the disadvantages of Q-Q plot? ›

Quantile-Quantile (Q-Q) plots are often difficult to interpret because it is unclear how large the deviation from the theoretical distribution must be to indicate a lack of fit.

How do you judge a Q-Q plot? ›

If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the identity line y = x. If the distributions are linearly related, the points in the Q–Q plot will approximately lie on a line, but not necessarily on the line y = x.

What is the math behind Q-Q plot? ›

The theoretical Q-Q plot is the graph of the quantiles of a the CDF F, xp = F−1(p), versus the corresponding quantiles of the CDF G, yp = G−1(p), that is the graph [F−1(p), G−1(p)] for p ∈ (0, 1). If the two CDFs are identical, the theoretical Q-Q plot will be the line y = x. , t-distribution with n degrees of freedom.

What does a heavy tailed Q-Q plot mean? ›

Heavy tailed qqplot: meaning that compared to the normal distribution there is much more data located at the extremes of the distribution and less data in the center of the distribution.

How do you know if a Q-Q plot is right skewed? ›

Put another way, it is left-skewed, also called negatively skewed. When we see the upper end of the Q-Q plot deviate from a straight line while the lower follows one, then the curve has a longer tail to its right and it is right-skewed, also called positively skewed.

What is the main purpose of a Q-Q plot? ›

The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution.

What is the best fit for a Q-Q plot? ›

The Quantile-Quantile (or Q-Q) plot is useful for finding the best fitting distribution within a family of distributions. The first step therefore, is to choose which of the theoretical distributions to fit to the data.

What is the Z score of the Q-Q plot? ›

When the option Q-Q plot is selected, the horizontal axis shows the z-scores of the observed values, z=(x−mean)/SD. A straight reference line represents the Normal distribution. If the sample data are near a Normal distribution, the data points will be near this straight line.

What is the 68 97 99 rule? ›

The empirical rule states that in a normal distribution, virtually all observed data will fall within three standard deviations of the mean. Under this rule, 68% of the data will fall within one standard deviation, 95% within two standard deviations, and 99.7% within three standard deviations from the mean.

What is the 3 sigma rule for normal distribution? ›

In the empirical sciences, the so-called three-sigma rule of thumb (or 3σ rule) expresses a conventional heuristic that nearly all values are taken to lie within three standard deviations of the mean, and thus it is empirically useful to treat 99.7% probability as near certainty.

How to tell if data is normally distributed with mean and standard deviation? ›

Key Takeaways
  1. The normal distribution is the proper term for a probability bell curve.
  2. In a normal distribution, the mean is zero and the standard deviation is 1. It has zero skew and a kurtosis of 3.
  3. Normal distributions are symmetrical, but not all symmetrical distributions are normal.

How to tell if a box plot is normally distributed? ›

Normal Distribution : If a box plot has equal proportions around the median, we can say distribution is symmetric or normal. Positively Skewed : For a distribution that is positively skewed, the box plot will show the median closer to the lower or bottom quartile.

What is the difference between a normal PP plot and a normal Q-Q plot? ›

On a P-P plot, changes in location or scale do not necessarily preserve linearity. On a Q-Q plot, the reference line representing a particular theoretical distribution depends on the location and scale parameters of that distribution, having intercept and slope equal to the location and scale parameters.

How to tell if data is normally distributed? ›

A histogram is an effective way to tell if a frequency distribution appears to have a normal distribution. Plot a histogram and look at the shape of the bars. If the bars roughly follow a symmetrical bell or hill shape, like the example below, then the distribution is approximately normally distributed.

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Dean Jakubowski Ret

Last Updated:

Views: 6405

Rating: 5 / 5 (70 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Dean Jakubowski Ret

Birthday: 1996-05-10

Address: Apt. 425 4346 Santiago Islands, Shariside, AK 38830-1874

Phone: +96313309894162

Job: Legacy Sales Designer

Hobby: Baseball, Wood carving, Candle making, Jigsaw puzzles, Lacemaking, Parkour, Drawing

Introduction: My name is Dean Jakubowski Ret, I am a enthusiastic, friendly, homely, handsome, zealous, brainy, elegant person who loves writing and wants to share my knowledge and understanding with you.