| Title: | (Bayesian) Additive Voronoi Tessellations |
|---|---|
| Description: | Implements the Bayesian Additive Voronoi Tessellation model for non-parametric regression and machine learning as introduced in Stone and Gosling (2025) <doi:10.1080/10618600.2024.2414104>. This package provides a flexible alternative to BART (Bayesian Additive Regression Trees) using Voronoi tessellations instead of trees. Users can fit Bayesian regression models (estimating the associated posterior distributions and make predictions. It is particularly useful for spatial data analysis, machine learning regression, complex function approximation and Bayesian modeling where the underlying structure is unknown. The method is well-suited to capturing spatial patterns and non-linear relationships. |
| Authors: | Adam Stone [aut] (ORCID: <https://orcid.org/0009-0004-0058-6117>), John Paul Gosling [aut, cre] (ORCID: <https://orcid.org/0000-0002-4072-3022>) |
| Maintainer: | John Paul Gosling <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.6.1 |
| Built: | 2026-06-05 20:00:18 UTC |
| Source: | https://github.com/johnpaulgosling/addivortes |
AddiVortes implements Bayesian Additive Voronoi Tessellation models for machine learning regression and non-parametric statistical modeling. This package provides a flexible alternative to BART (Bayesian Additive Regression Trees), using Voronoi tessellations instead of trees for spatial partitioning. The method is particularly effective for spatial data analysis, complex function approximation, and Bayesian regression modeling.
Key features include:
Machine learning regression with Bayesian inference
Alternative to BART using Voronoi tessellations
Spatial data analysis and modeling
Non-parametric regression capabilities
Complex function approximation
Uncertainty quantification through posterior inference
Maintainer: John Paul Gosling [email protected] (ORCID)
Authors:
John Paul Gosling [email protected] (ORCID)
Adam Stone [email protected] (ORCID)
Stone, A. and Gosling, J.P. (2025). AddiVortes: (Bayesian) additive Voronoi tessellations. Journal of Computational and Graphical Statistics.
https://johnpaulgosling.github.io/AddiVortes/
The AddiVortes model is a Bayesian nonparametric regression model that uses a tessellation to model the relationship between the covariates and the output values. The model uses a backfitting algorithm to sample from the posterior distribution of the output values for each tessellation. Alongside fitting details and the posterior sample, the function returns the RMSE value for the test samples.
The function can handle multiple types of covariates, including continuous,
spherical and categorical. Categorical covariates are automatically detected
and one-hot encoded, with the first level of each categorical variable used
as the reference category. The catScaling parameter allows control over
the weight of categorical differences in distance calculations. For spherical
covariates, the function assumes that the final spherical dimension
corresponds to the polar angle, which has a range of 0 to 2*pi. The metric
parameter can be used to specify the type of each covariate (Euclidean,
Spherical, or Categorical), and the members parameter can indicate
membership of covariates into different subspaces when using multiple spheres
in covariate space.
AddiVortes( y, x, m = 200, totalMCMCIter = 1200, mcmcBurnIn = 200, nu = 6, q = 0.85, k = 3, sd = 0.8, Omega = min(3, ncol(x)), LambdaRate = 25, InitialSigma = "Linear", thinning = 1, metric = "E", members = NULL, catScaling = 1, showProgress = interactive() )AddiVortes( y, x, m = 200, totalMCMCIter = 1200, mcmcBurnIn = 200, nu = 6, q = 0.85, k = 3, sd = 0.8, Omega = min(3, ncol(x)), LambdaRate = 25, InitialSigma = "Linear", thinning = 1, metric = "E", members = NULL, catScaling = 1, showProgress = interactive() )
y |
A vector of the output values. |
x |
A matrix or data frame of the covariates. Character and factor columns are treated as categorical variables and automatically converted to d-1 binary indicator variables via one-hot encoding (with the first level as reference). |
m |
The number of tessellations. |
totalMCMCIter |
The number of iterations. |
mcmcBurnIn |
The number of burn in iterations. |
nu |
The degrees of freedom. |
q |
The quantile. |
k |
The number of centres. |
sd |
The standard deviation used in centre proposals. |
Omega |
Omega/(number of covariates) is the prior probability of adding a dimension. |
LambdaRate |
The rate of the Poisson distribution for the number of centres. |
InitialSigma |
The method used to calculate the initial variance. |
thinning |
The thinning rate. |
metric |
Either "E" (Euclidean, default), "S" (Spherical), or "C" (Categorical). |
members |
If needed, indicates membership of covariates into different
subspaces (needed if using multiple spheres in covariate space). Default |
catScaling |
Numeric scalar controlling the scale of binary indicator
variables created from categorical covariates. Each binary indicator takes
values 0 (reference level) or |
showProgress |
Logical; if TRUE, progress bars and messages are shown during fitting. |
An AddiVortes object containing the posterior samples of the tessellations, dimensions and predictions.
# Simple example with simulated data set.seed(123) x <- matrix(rnorm(50), 10, 5) y <- rnorm(10) # Fit model with reduced iterations for quick example fit <- AddiVortes(y, x, m = 5, totalMCMCIter = 50, mcmcBurnIn = 10) # Larger example with categorical covariates (d=2 and d=3) and a test set set.seed(456) n_train <- 200 n_test <- 50 x_train <- data.frame( x1 = rnorm(n_train), x2 = runif(n_train), grp2 = sample(c("A", "B"), n_train, replace = TRUE), grp3 = sample(c("low", "mid", "high"), n_train, replace = TRUE) ) y_train <- x_train$x1 + ifelse(x_train$grp2 == "B", 1, 0) + rnorm(n_train, sd = 0.5) fit2 <- AddiVortes(y_train, x_train, m = 10, totalMCMCIter = 200, mcmcBurnIn = 50, catScaling = 1, showProgress = FALSE ) x_test <- data.frame( x1 = rnorm(n_test), x2 = runif(n_test), grp2 = sample(c("A", "B"), n_test, replace = TRUE), grp3 = sample(c("low", "mid", "high"), n_test, replace = TRUE) ) y_test <- x_test$x1 + ifelse(x_test$grp2 == "B", 1, 0) + rnorm(n_test, sd = 0.5) preds <- predict(fit2, x_test, showProgress = FALSE) test_rmse <- sqrt(mean((y_test - preds)^2))# Simple example with simulated data set.seed(123) x <- matrix(rnorm(50), 10, 5) y <- rnorm(10) # Fit model with reduced iterations for quick example fit <- AddiVortes(y, x, m = 5, totalMCMCIter = 50, mcmcBurnIn = 10) # Larger example with categorical covariates (d=2 and d=3) and a test set set.seed(456) n_train <- 200 n_test <- 50 x_train <- data.frame( x1 = rnorm(n_train), x2 = runif(n_train), grp2 = sample(c("A", "B"), n_train, replace = TRUE), grp3 = sample(c("low", "mid", "high"), n_train, replace = TRUE) ) y_train <- x_train$x1 + ifelse(x_train$grp2 == "B", 1, 0) + rnorm(n_train, sd = 0.5) fit2 <- AddiVortes(y_train, x_train, m = 10, totalMCMCIter = 200, mcmcBurnIn = 50, catScaling = 1, showProgress = FALSE ) x_test <- data.frame( x1 = rnorm(n_test), x2 = runif(n_test), grp2 = sample(c("A", "B"), n_test, replace = TRUE), grp3 = sample(c("low", "mid", "high"), n_test, replace = TRUE) ) y_test <- x_test$x1 + ifelse(x_test$grp2 == "B", 1, 0) + rnorm(n_test, sd = 0.5) preds <- predict(fit2, x_test, showProgress = FALSE) test_rmse <- sqrt(mean((y_test - preds)^2))
The Boston Housing Dataset, derived from information from the U.S. Census Service.
BostonBoston
BostonA data.frame with 506 rows and 14 columns:
The covariates in the data
The response variable
Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol. 5, 81-102, 1978
For a given tessellation, this function identifies which cell (centre) each observation belongs to based on nearest neighbour classification.
cellIndices(x, tess, dim, metric = "E", members)cellIndices(x, tess, dim, metric = "E", members)
x |
A numeric matrix of covariates where each row is an observation. |
tess |
A numeric matrix representing the tessellation centres, where each row is a unique centre. |
dim |
An integer vector specifying the column indices of |
metric |
Either "Euclidean" or "Spherical". |
members |
A vector indicating covariate membership. |
It finds the closest tessellation centre for each observation (row) in the covariate matrix, considering only the specified dimensions. This is achieved using the k-nearest neighbour algorithm where k=1.
A numeric vector of integers where each element corresponds to a row
in x and its value is the row index of the nearest centre in tess.
Synthetic data generated for cylindrical parameters. The response is given by the function f(z, theta) = sin(z)cos(theta) + z*sin(theta)^2, with added noise.
CylinderCylinder
CylinderA data.frame with 400 rows and 3 columns:
Height
Polar angle
Response
A constructor for the AddiVortes class.
new_AddiVortes( posteriorTess, posteriorDim, posteriorSigma, posteriorPred, xCentres, xRanges, yCentre, yRange, inSampleRmse, metric = "E", members = rep(1, length(xCentres)), metric_aug = "E", member_aug = rep(1, length(xCentres)), catEncoding = NULL )new_AddiVortes( posteriorTess, posteriorDim, posteriorSigma, posteriorPred, xCentres, xRanges, yCentre, yRange, inSampleRmse, metric = "E", members = rep(1, length(xCentres)), metric_aug = "E", member_aug = rep(1, length(xCentres)), catEncoding = NULL )
posteriorTess |
A list of the posterior samples of the tessellations. |
posteriorDim |
A list of the posterior samples of the dimensions. |
posteriorSigma |
A list of the posterior samples of the error variance. |
posteriorPred |
A list of the posterior samples of the predictions. |
xCentres |
The centres of the covariates. |
xRanges |
The ranges of the covariates. |
yCentre |
The centre of the output values. |
yRange |
The range of the output values. |
inSampleRmse |
The in-sample RMSE. |
metric |
The metric used for scaling covariates (default "E" for Euclidean). |
members |
The membership vector for the covariates |
metric_aug |
The augmented metric after categorical variables are converted to one-hot |
member_aug |
The membership vector corresponding to metric_aug |
catEncoding |
Optional list of categorical encoding metadata returned by
|
An object of class AddiVortes.
Generates comprehensive diagnostic plots for a fitted AddiVortes object.
This function creates multiple diagnostic plots including residuals,
MCMC traces for sigma, and tessellation complexity over iterations.
## S3 method for class 'AddiVortes' plot( x, x_train, y_train, sigma_trace = NULL, which = c(1, 2, 3), ask = FALSE, ... )## S3 method for class 'AddiVortes' plot( x, x_train, y_train, sigma_trace = NULL, which = c(1, 2, 3), ask = FALSE, ... )
x |
An object of class |
x_train |
A matrix of the original training covariates. |
y_train |
A numeric vector of the original training true outcomes. |
sigma_trace |
An optional numeric vector of sigma values from MCMC samples. If not provided, the method will attempt to extract it from the model object. |
which |
A numeric vector specifying which plots to generate: 1 = Residuals plot, 2 = Sigma trace, 3 = Tessellation complexity trace, 4 = Predicted vs Observed. Default is c(1, 2, 3). |
ask |
Logical; if TRUE, the user is asked to press Enter before each plot. |
... |
Additional arguments passed to plotting functions. |
The function generates up to four diagnostic plots:
Residuals Plot: Residuals vs fitted values with smoothed trend line
Sigma Trace: MCMC trace plot for the error variance parameter
Tessellation Complexity: Trace of average tessellation size over iterations
Predicted vs Observed: Scatter plot with credible intervals
This function is called for its side effect of creating plots and returns
NULL invisibly.
## Not run: # Assuming 'fit' is a trained AddiVortes object plot(fit, x_train = x_train_data, y_train = y_train_data) # Show only specific plots plot(fit, x_train = x_train_data, y_train = y_train_data, which = c(1, 3)) # With custom sigma trace plot(fit, x_train = x_train_data, y_train = y_train_data, sigma_trace = my_sigma_samples ) ## End(Not run)## Not run: # Assuming 'fit' is a trained AddiVortes object plot(fit, x_train = x_train_data, y_train = y_train_data) # Show only specific plots plot(fit, x_train = x_train_data, y_train = y_train_data, which = c(1, 3)) # With custom sigma trace plot(fit, x_train = x_train_data, y_train = y_train_data, sigma_trace = my_sigma_samples ) ## End(Not run)
Predicts outcomes for new data using a fitted AddiVortes model object.
It can return mean predictions, quantiles and optionally calculate the
Root Mean Squared Error (RMSE) if true outcomes are provided.
## S3 method for class 'AddiVortes' predict( object, newdata, type = c("response", "quantile"), quantiles = c(0.025, 0.975), interval = c("credible", "prediction"), showProgress = interactive(), parallel = TRUE, cores = NULL, ... )## S3 method for class 'AddiVortes' predict( object, newdata, type = c("response", "quantile"), quantiles = c(0.025, 0.975), interval = c("credible", "prediction"), showProgress = interactive(), parallel = TRUE, cores = NULL, ... )
object |
An object of class |
newdata |
A matrix of covariates for the new test set. The number of columns must match the original training data. |
type |
The type of prediction required. The default |
quantiles |
A numeric vector of probabilities to
compute for the predictions when |
interval |
The type of interval calculation. The default |
showProgress |
Logical; if TRUE, a progress bar is shown during prediction. |
parallel |
Logical; if TRUE (default), predictions are computed in parallel. |
cores |
The number of CPU cores to use for parallel processing. If NULL (default), it defaults to one less than the total number of available cores. |
... |
Further arguments passed to or from other methods (currently unused). |
This function relies on the internal helper function applyScaling_internal
being available in the environment, which is used by the main
AddiVortes function.
When interval = "prediction" and type = "quantile", the function samples
additional Gaussian noise with variance equal to the sampled sigma squared
from the posterior. This accounts for the inherent variability in individual
predictions, not just uncertainty in the mean function. The noise is added
in the scaled space before unscaling predictions.
If type = "response", a numeric vector of mean predictions.
If type = "quantile", a matrix where each row corresponds to an observation
in newdata and each column to a quantile.
# Fit a model set.seed(123) X <- matrix(rnorm(100), 20, 5) Y <- rnorm(20) fit <- AddiVortes(Y, X, m = 5, totalMCMCIter = 50, mcmcBurnIn = 10) # New data for prediction X_new <- matrix(rnorm(25), 5, 5) # Mean predictions pred_mean <- predict(fit, X_new, type = "response") # Credible intervals (uncertainty in mean only) pred_conf <- predict(fit, X_new, type = "quantile", interval = "credible", quantiles = c(0.025, 0.975) ) # Prediction intervals (includes error variance) pred_pred <- predict(fit, X_new, type = "quantile", interval = "prediction", quantiles = c(0.025, 0.975) ) # Prediction intervals are wider than credible intervals mean(pred_pred[, 2] - pred_pred[, 1]) > mean(pred_conf[, 2] - pred_conf[, 1])# Fit a model set.seed(123) X <- matrix(rnorm(100), 20, 5) Y <- rnorm(20) fit <- AddiVortes(Y, X, m = 5, totalMCMCIter = 50, mcmcBurnIn = 10) # New data for prediction X_new <- matrix(rnorm(25), 5, 5) # Mean predictions pred_mean <- predict(fit, X_new, type = "response") # Credible intervals (uncertainty in mean only) pred_conf <- predict(fit, X_new, type = "quantile", interval = "credible", quantiles = c(0.025, 0.975) ) # Prediction intervals (includes error variance) pred_pred <- predict(fit, X_new, type = "quantile", interval = "prediction", quantiles = c(0.025, 0.975) ) # Prediction intervals are wider than credible intervals mean(pred_pred[, 2] - pred_pred[, 1]) > mean(pred_conf[, 2] - pred_conf[, 1])
Prints a summary of a fitted AddiVortes object, providing information
about the model structure, dimensions, and fit quality similar to the
output of a linear model summary.
## S3 method for class 'AddiVortes' print(x, ...)## S3 method for class 'AddiVortes' print(x, ...)
x |
An object of class |
... |
Further arguments passed to or from other methods (currently unused). |
The print method displays:
The model formula representation
Number of covariates and posterior samples
Number of tessellations used
In-sample RMSE
Covariate scaling information
The function is called for its side effect of printing model information
and returns the input object x invisibly.
Provides a detailed summary of a fitted AddiVortes object, including
more comprehensive information than the print method.
## S3 method for class 'AddiVortes' summary(object, ...)## S3 method for class 'AddiVortes' summary(object, ...)
object |
An object of class |
... |
Further arguments passed to or from other methods (currently unused). |
The function is called for its side effect of printing detailed model
information and returns the input object object invisibly.
A subset of data collected as part of the GES DISC datasets for calibrated brightness temperatures.
WeatherWeather
WeatherA data.frame with 2000 rows and 3 columns:
Polar angle
Azimuthal angle
Response variable
https://disc.gsfc.nasa.gov/datasets/GPM_1CGPMGMI_07/summary?keywords=10.5067%2FGPM%2FGMI%2FGPM%2F1C%2F07