sgd {sgd} | R Documentation |
Description
Run stochastic gradient descent in order to optimize the induced lossfunction given a model and data.
Usage
sgd(x, ...)## S3 method for class 'formula'sgd(formula, data, model, model.control = list(), sgd.control = list(...), ...)## S3 method for class 'matrix'sgd(x, y, model, model.control = list(), sgd.control = list(...), ...)## S3 method for class 'big.matrix'sgd(x, y, model, model.control = list(), sgd.control = list(...), ...)
Arguments
x , y | a design matrix and the respective vector of outcomes. |
... | arguments to be used to form the default |
formula | an object of class |
data | an optional data frame, list or environment (or object coercibleby |
model | character specifying the model to be used: |
model.control | a list of parameters for controlling the model.
|
sgd.control | an optional list of parameters for controlling the estimation.
|
Details
Models:The Cox model assumes that the survival data is ordered when passedin, i.e., such that the risk set of an observation i is all data points afterit.
Methods:
sgd
stochastic gradient descent (Robbins and Monro, 1951)
implicit
implicit stochastic gradient descent (Toulis et al.,2014)
asgd
stochastic gradient with averaging (Polyak and Juditsky,1992)
ai-sgd
implicit stochastic gradient with averaging (Toulis etal., 2015)
momentum
"classical" momentum (Polyak, 1964)
nesterov
Nesterov's accelerated gradient (Nesterov, 1983)
Learning rates and hyperparameters:
one-dim
scalar value prescribed in Xu (2011) as
a_n = scale * gamma/(1 + alpha*gamma*n)^(-c)
where the defaults are
lr.control = (scale=1, gamma=1, alpha=1, c)
wherec
is1
if implemented without averaging,2/3
if with averagingone-dim-eigen
diagonal matrix
lr.control = NULL
d-dim
diagonal matrix
lr.control = (epsilon=1e-6)
adagrad
diagonal matrix prescribed in Duchi et al. (2011) as
lr.control = (eta=1, epsilon=1e-6)
rmsprop
diagonal matrix prescribed in Tieleman and Hinton(2012) as
lr.control = (eta=1, gamma=0.9, epsilon=1e-6)
Value
An object of class "sgd"
, which is a list containing the followingcomponents:
model | name of the model |
coefficients | a named vector of coefficients |
converged | logical. Was the algorithm judged to have converged? |
estimates | estimates from algorithm stored at each iterationspecified in |
fitted.values | the fitted mean values |
pos | vector of indices specifying the iteration number each estimatewas stored for |
residuals | the residuals, that is response minus fitted values |
times | vector of times in seconds it took to complete the number ofiterations specified in |
model.out | a list of model-specific output attributes |
Author(s)
Dustin Tran, Tian Lan, Panos Toulis, Ye Kuang, Edoardo Airoldi
References
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods foronline learning and stochastic optimization. Journal of MachineLearning Research, 12:2121-2159, 2011.
Yurii Nesterov. A method for solving a convex programming problem withconvergence rate O(1/k^2)
. Soviet Mathematics Doklady,27(2):372-376, 1983.
Boris T. Polyak. Some methods of speeding up the convergence of iterationmethods. USSR Computational Mathematics and Mathematical Physics,4(5):1-17, 1964.
Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochasticapproximation by averaging. SIAM Journal on Control and Optimization,30(4):838-855, 1992.
Herbert Robbins and Sutton Monro. A stochastic approximation method.The Annals of Mathematical Statistics, pp. 400-407, 1951.
Panos Toulis, Jason Rennie, and Edoardo M. Airoldi, "Statistical analysis ofstochastic gradient methods for generalized linear models", InProceedings of the 31st International Conference on Machine Learning,2014.
Panos Toulis, Dustin Tran, and Edoardo M. Airoldi, "Stability and optimalityin stochastic gradient descent", arXiv preprint arXiv:1505.02417, 2015.
Wei Xu. Towards optimal one pass large scale learning with averagedstochastic gradient descent. arXiv preprint arXiv:1107.2490, 2011.
# Dimensions
Examples
## Linear regressionset.seed(42)N <- 1e4d <- 5X <- matrix(rnorm(N*d), ncol=d)theta <- rep(5, d+1)eps <- rnorm(N)y <- cbind(1, X) %*% theta + epsdat <- data.frame(y=y, x=X)sgd.theta <- sgd(y ~ ., data=dat, model="lm")sprintf("Mean squared error: %0.3f", mean((theta - as.numeric(sgd.theta$coefficients))^2))
[Package sgd version 1.1.2 Index]