Q-Q plot
Start by loading the qqplotr
package:
require(qqplotr)
Let’s start by simulating from a standard Normal distribution:
set.seed(0)smp <- data.frame(norm = rnorm(100))
Then, we use the provided stat_qq_*
functions toconstruct a complete Q-Q plot with the points, reference line, and theconfidence bands. As default, the standard Q-Q Normal plot with Normalconfidence bands is constructed:
gg <- ggplot(data = smp, mapping = aes(sample = norm)) + stat_qq_band() + stat_qq_line() + stat_qq_point() + labs(x = "Theoretical Quantiles", y = "Sample Quantiles")gg
As we can see, all the points lie within the confidence bands, whichis expected for the given distribution.
As previously described in the Details section, three confidencebands constructs are available, which may be adjusted with thebandType
parameter. Here, we may use thegeom_qq_band
instead of stat_qq_band
, whichpermits a little more flexibility with the graphical parameters whenconstructing and visualizing different confidence bands.
gg <- ggplot(data = smp, mapping = aes(sample = norm)) + geom_qq_band(bandType = "ks", mapping = aes(fill = "KS"), alpha = 0.5) + geom_qq_band(bandType = "ts", mapping = aes(fill = "TS"), alpha = 0.5) + geom_qq_band(bandType = "pointwise", mapping = aes(fill = "Normal"), alpha = 0.5) + geom_qq_band(bandType = "boot", mapping = aes(fill = "Bootstrap"), alpha = 0.5) + stat_qq_line() + stat_qq_point() + labs(x = "Theoretical Quantiles", y = "Sample Quantiles") + scale_fill_discrete("Bandtype")gg
To construct Q-Q plots with other theoretical distributions we mayuse the distribution
parameter. Specific distributionalparameters may be passed as a list to the dparams
paramater.
Note that distributional parameters have little impact when buildingQ-Q plots, as changing them will only modify the x-axis range. Incontrast, those paramaters will have a higher effect on P-P plots.
Now, let’s use draw the Q-Q plot functions for the mean ozonelevels from the airquality
dataset . Since the data isnon-negative, lets choose the Exponential distribution(exp
) as the theoretical.
It is important to note that the distribution nomenclature followsthat from the
stats
package. So, if you wish to provide acustom distribution, you may do so by creating the density, cumulative,quantile, and random functions following the standard nomenclature fromthestats
package, i.e., for the"custom"
distribution, you must define"dcustom"
,"pcustom"
,"qcustom"
, and"rcustom"
functions.
That being said, let’s set distribution = "exp"
andrate = 2
(the latter one just to exemplify the usage ofdparams
):
di <- "exp" # exponential distributiondp <- list(rate = 2) # exponential rate parametergg <- ggplot(data = airquality, mapping = aes(sample = Ozone)) + stat_qq_band(distribution = di, dparams = dp) + stat_qq_line(distribution = di, dparams = dp) + stat_qq_point(distribution = di, dparams = dp) + labs(x = "Theoretical Quantiles", y = "Sample Quantiles")gg
The qqplotr
package also offers the option todetrend Q-Q and P-P plots in order to help reducing visual biascaused by the orthogonal distances from points to the reference lines(Thode, 2002). That bias may cause wrong conclusions to be drawn viavisual inference of the plot. To do that we must setdetrend = TRUE
:
di <- "exp"dp <- list(rate = 2)de <- TRUE # enabling the detrend optiongg <- ggplot(data = airquality, mapping = aes(sample = Ozone)) + stat_qq_band(distribution = di, dparams = dp, detrend = de) + stat_qq_line(distribution = di, dparams = dp, detrend = de) + stat_qq_point(distribution = di, dparams = dp, detrend = de) + labs(x = "Theoretical Quantiles", y = "Sample Quantiles")gg
Note that the detrend option causes to plot to “rotate”, that is,instead of being presented as a diagonal plot, we visualize the data ina horizontal manner.
The Q-Q plot functions are also compatible with manyggplot2
operators, such as Facets (sub-plots). Forinstance, lets consider the barley dataset from thelattice
package to illustrate how Facets behave whenapplied to the Q-Q plot functions:
# install.packages("lattice")data("barley", package = "lattice")gg <- ggplot(data = barley, mapping = aes(sample = yield, color = site, fill = site)) + stat_qq_band(alpha=0.5) + stat_qq_line() + stat_qq_point() + facet_wrap(~ site) + labs(x = "Theoretical Quantiles", y = "Sample Quantiles")gg
P-P plot
Let’s start by plotting the previously simulated Normal data versusthe standard Normal distribution:
gg <- ggplot(data = smp, mapping = aes(sample = norm)) + stat_pp_band() + stat_pp_line() + stat_pp_point() + labs(x = "Probability Points", y = "Cumulative Probability")gg
Notice that the label names are different from those of the Q-Qplots. Here, the cumulative probability points (y-axis) are constructedby evaluating the theoretical CDF on sample quantiles.
As discussed before, in the case of P-P plots the distributionalparameters do impact the results. For instance, say wewant to evaluate the same standard Normal data with a shifted andrescaled Normal(2,2) distribution:
dp <- list(mean = 2, sd = 2) # shifted and rescaled Normal parametersgg <- ggplot(data = smp, mapping = aes(sample = norm)) + stat_pp_band(dparams = dp) + stat_pp_line() + stat_pp_point(dparams = dp) + labs(x = "Probability Points", y = "Cumulative Probability")gg
As we already know, the plot shows that the chosen Normaldistribution parameters are not appropriate for the input data.
Also notice that the stat_pp_line
function lacks thedparams
paramater. The reason is thatstat_pp_line
draws by default the identity line and, thus,it isn’t dependent of the sample data and/or its distribution. However,if the user wishes to draw another line (different from the identity)he/she may do so by providing the intercept and slope values,respectively, as a vector of length two to the ab
parameter:
gg <- ggplot(data = smp, mapping = aes(sample = norm)) + stat_pp_band() + stat_pp_line(ab = c(.2, .5)) + # intercept = 0.2, slope = 0.5 stat_pp_point() + labs(x = "Probability Points", y = "Cumulative Probability")gg
We may also detrend the P-P plots in the same way as before. Let’sevaluate again the mean ozone levels from theairquality
dataset. We had previously seen that theExponential distribution was more appropriate than the Normaldistribution, so let’s take that into account:
di <- "exp"dp <- list(rate = .022) # value is based on some empirical testsde <- TRUEgg <- ggplot(data = airquality, mapping = aes(sample = Ozone)) + stat_pp_band(distribution = di, detrend = de, dparams = dp) + stat_pp_line(detrend = de) + stat_pp_point(distribution = di, detrend = de, dparams = dp) + labs(x = "Probability Points", y = "Cumulative Probability") + scale_y_continuous(limits = c(-.5, .5))gg
Based on empirical tests, we set the rate parameter torate = .022
. That value let the most P-P points inside theconfidence bands. Even so, that group of outside the confidence bands(at the lower tail) indicate that a more appropriate distribution shouldbe selected.