--- title: "Modelling Spherical Data with AddiVortes" author: "Andy Iskauskas and John Paul Gosling" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Modelling Spherical Data with AddiVortes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette demonstrates how to use `AddiVortes` with **spherical data**, such as measurements taken at geographic locations on the surface of a globe. Spherical data requires a specialised distance measure because the usual Euclidean distance does not respect the geometry of a curved surface — in particular it does not handle the wrap-around at the poles and the date line correctly. ### 1. What is Spherical Data? Spherical data are observations whose locations lie on the surface of a sphere. A familiar example is geographic data specified by **latitude** and **longitude**. Two important properties make this different from ordinary planar data: 1. **Wrap-around**: Longitude values near −180° and +180° are actually very close to each other on the globe, but their Euclidean distance is large. A good distance metric must recognise this. 2. **Curvature**: Distances should follow great-circle arcs (the shortest path along the sphere's surface) rather than straight lines. `AddiVortes` uses the **great-circle distance** (also known as the geodesic distance or the haversine-type formula) when `metric = "S"` is specified. This ensures that points near the poles and near the date line are treated as neighbours when they are genuinely close on the sphere. ### 2. Coordinate Convention `AddiVortes` expects spherical coordinates in **radians**, following this convention: - **Latitude-type dimensions** (all spherical dimensions except the last): values in [−π/2, π/2]. - **Longitude / azimuthal dimension** (the *last* spherical column): values in [−π, π]. For a single-surface model (a globe), you need exactly two columns: ``` x <- cbind(latitude_rad, longitude_rad) ``` with `metric = "S"`. The model will use the spherical (great-circle) distance between all pairs of tessellation centres and observations. If your data combine spherical and non-spherical covariates, you can pass a vector of metric indicators, e.g. `metric = c(1, 1, 0)` for two spherical dimensions followed by one Euclidean dimension. ### 3. Generating Synthetic Spherical Data We simulate a dataset of 300 observations distributed across the surface of a globe. Each observation has a latitude and a longitude, and the response variable represents a smoothly varying quantity — here, a synthetic "temperature" that is warmest at the equator and varies with longitude. ```{r simulate_data, message=FALSE, warning=FALSE} library(AddiVortes) set.seed(42) n <- 300 # Sample random locations on the globe lat <- runif(n, -pi / 2, pi / 2) # latitude in radians: [-pi/2, pi/2] lon <- runif(n, -pi, pi) # longitude in radians: [-pi, pi] # True function: warmer at the equator, slight east-west gradient y_true <- 20 * cos(lat) + 5 * sin(lon) # Add observation noise y <- y_true + rnorm(n, sd = 2) # Covariate matrix: latitude first, longitude last (required convention) x <- cbind(lat, lon) ``` The plot below shows the spatial distribution of the simulated response values: ```{r plot_data, fig.width=7, fig.height=4, fig.align='center'} # Colour scale for the response cols <- colorRampPalette(c("blue", "white", "red"))(100) col_index <- cut(y, breaks = 100, labels = FALSE) # Convert radians to degrees for a readable plot lat_deg <- lat * 180 / pi lon_deg <- lon * 180 / pi plot(lon_deg, lat_deg, col = cols[col_index], pch = 19, cex = 0.7, xlab = "Longitude (degrees)", ylab = "Latitude (degrees)", main = "Simulated Response on the Globe" ) legend("bottomleft", legend = c("High", "Low"), col = c("red", "blue"), pch = 19, bty = "n" ) ``` ### 4. Fitting the Spherical Model We fit `AddiVortes` with `metric = "S"` to tell the model that both columns are spherical coordinates. The model will then use great-circle distance when assigning observations to Voronoi cells. ```{r fit_spherical, results='hide'} fit_sph <- AddiVortes( y = y, x = x, m = 50, totalMCMCIter = 500, mcmcBurnIn = 100, metric = "S", # use great-circle distance for all columns showProgress = FALSE ) ``` ```{r rmse_spherical} cat("In-sample RMSE (spherical metric):", round(fit_sph$inSampleRmse, 3), "\n") ``` For comparison, we fit an identical model using the default **Euclidean** metric. This model treats latitude and longitude as ordinary continuous variables and will not handle the wrap-around near ±π correctly. ```{r fit_euclidean, results='hide'} fit_euc <- AddiVortes( y = y, x = x, m = 50, totalMCMCIter = 500, mcmcBurnIn = 100, metric = "E", # Euclidean distance (default) showProgress = FALSE ) ``` ```{r rmse_euclidean} cat("In-sample RMSE (Euclidean metric):", round(fit_euc$inSampleRmse, 3), "\n") ``` ### 5. Out-of-Sample Evaluation We generate a held-out test set and compare the two models. We also include a group of observations near the date line (longitude ≈ ±π) to highlight where the Euclidean metric is most likely to struggle. ```{r test_data} set.seed(101) n_test <- 200 lat_test <- runif(n_test, -pi / 2, pi / 2) lon_test <- runif(n_test, -pi, pi) y_true_test <- 20 * cos(lat_test) + 5 * sin(lon_test) y_test <- y_true_test + rnorm(n_test, sd = 2) x_test <- cbind(lat_test, lon_test) ``` ```{r predictions, results='hide'} preds_sph <- predict(fit_sph, x_test, showProgress = FALSE) preds_euc <- predict(fit_euc, x_test, showProgress = FALSE) ``` ```{r compare_rmse} rmse_sph <- sqrt(mean((y_test - preds_sph)^2)) rmse_euc <- sqrt(mean((y_test - preds_euc)^2)) cat("Test RMSE — spherical metric:", round(rmse_sph, 3), "\n") cat("Test RMSE — Euclidean metric:", round(rmse_euc, 3), "\n") ``` ### 6. Visualising Predictions The plot below shows predicted versus observed values for both models on the test set. A well-calibrated model should have points lying close to the diagonal. ```{r plot_predictions, fig.width=7, fig.height=5, fig.align='center'} y_range <- range(c(y_test, preds_sph, preds_euc)) plot(y_test, preds_sph, pch = 19, col = "darkblue", cex = 0.7, xlab = "Observed values", ylab = "Predicted values", main = "Spherical vs. Euclidean Metric: Predicted vs. Observed", xlim = y_range, ylim = y_range ) points(y_test, preds_euc, pch = 4, col = "darkred", cex = 0.7) abline(0, 1, lwd = 2, lty = 2, col = "grey40") legend("topleft", legend = c("Spherical metric", "Euclidean metric", "y = x"), col = c("darkblue", "darkred", "grey40"), pch = c(19, 4, NA), lty = c(NA, NA, 2), lwd = c(NA, NA, 2), bty = "n" ) ``` ### 7. Multiple Spherical Covariates If the dataset contains covariates whose embedding is on separate n-spheres, then these must be treated separately in `AddiVortes`. We augment the data above to illustrate this. ```{r augmented-data} lat_prime1 <- runif(n, -pi/2, pi/2) lat_prime2 <- runif(n, -pi/2, pi/2) lon_prime <- runif(n, -pi, pi) y_prime_true <- 20 * cos(lat) + 5 * sin(lon) - 5 * cos(lat_prime1) * sin(lat_prime2) + 10*sin(lon_prime)^2 * sin(abs(lat_prime1 - lat_prime2)) y_prime <- y_prime_true + rnorm(n, sd = 3) x_prime <- cbind(lat, lon, lat_prime1, lat_prime2, lon_prime) ``` If we try to use `AddiVortes` as before, we will encounter an error, since there are two azimuthal coordinates in the covariates. ```{r fit_multi_spherical_error, error=TRUE} fit_sph_new <- AddiVortes( y = y_prime, x = x_prime, m = 50, totalMCMCIter = 500, mcmcBurnIn = 100, metric = "S", # use great-circle distance for all columns showProgress = FALSE ) ``` To recitfy this, we provide `AddiVortes` with an additional argument, `members`, which indicates which coordinates belong to which space: in this case we give the first two coordinates membership 1, and the other three membership 2. ```{r fit_multi_spherical, results='hide'} fit_sph_new <- AddiVortes( y = y_prime, x = x_prime, m = 50, totalMCMCIter = 500, mcmcBurnIn = 100, metric = "S", # use great-circle distance for all columns, members = rep(1:2, times = c(2,3)), # indicate membership showProgress = FALSE ) ``` We follow the same process as before; fitting with Euclidean coordinates, and comparing the results relative to a hold-out test set. ```{r multi_sphere_euc, results='hide'} fit_euc_new <- AddiVortes( y = y_prime, x = x_prime, m = 50, totalMCMCIter = 500, mcmcBurnIn = 100, metric = "E", # Euclidean distance (default) showProgress = FALSE ) ``` ```{r multi_sph_compare, fig.width=7, fig.height=5, fig.align='center'} ## Set up test data lat_prime_test1 <- runif(n_test, -pi/2, pi/2) lat_prime_test2 <- runif(n_test, -pi/2, pi/2) lon_prime_test <- runif(n_test, -pi, pi) y_prime_test_true <- 20 * cos(lat_test) + 5 * sin(lon_test) - 5 * cos(lat_prime_test1) * sin(lat_prime_test2) + 10*sin(lon_prime_test)^2 * sin(abs(lat_prime_test1 - lat_prime_test2)) y_prime_test <- y_prime_test_true + rnorm(n_test, sd = 3) x_prime_test <- cbind(lat_test, lon_test, lat_prime_test1, lat_prime_test2, lon_prime_test) ## Make predictions preds_prime_sph <- predict(fit_sph_new, x_prime_test, showProgress = FALSE) preds_prime_euc <- predict(fit_euc_new, x_prime_test, showProgress = FALSE) rmse_prime_sph <- sqrt(mean((y_prime_test - preds_prime_sph)^2)) rmse_prime_euc <- sqrt(mean((y_prime_test - preds_prime_euc)^2)) cat("Test RMSE — spherical metric:", round(rmse_prime_sph, 3), "\n") cat("Test RMSE — Euclidean metric:", round(rmse_prime_euc, 3), "\n") y_prime_range <- range(c(y_prime_test, preds_prime_sph, preds_prime_euc)) plot(y_prime_test, preds_prime_sph, pch = 19, col = "darkblue", cex = 0.7, xlab = "Observed values", ylab = "Predicted values", main = "Spherical vs. Euclidean Metric: Predicted vs. Observed", xlim = y_prime_range, ylim = y_prime_range ) points(y_prime_test, preds_prime_euc, pch = 4, col = "darkred", cex = 0.7) abline(0, 1, lwd = 2, lty = 2, col = "grey40") legend("topleft", legend = c("Spherical metric", "Euclidean metric", "y = x"), col = c("darkblue", "darkred", "grey40"), pch = c(19, 4, NA), lty = c(NA, NA, 2), lwd = c(NA, NA, 2), bty = "n" ) ``` ### 8. Summary of Key Points - Set `metric = "S"` (or `metric = "Spherical"`) to use the great-circle distance for **all** covariate columns. - For datasets that mix spherical and non-spherical covariates, pass a vector of 0s and 1s, e.g. `metric = c(1, 1, 0)` for two spherical dimensions and one Euclidean dimension. - For datasets with covariates taken from multiple different spherical coordinate systems, pass a vector of membership to indicate which belongs to which. For example, for five covariates where the first three are from S3 and the remaining two are from S2, `members = rep(1:2, times = c(3,2))`. - Spherical coordinates must be in **radians**: latitude in [−π/2, π/2] and longitude in [−π, π]. - The **last** spherical column for each set is treated as the azimuthal (longitude) dimension; all other spherical columns are treated as polar (latitude) dimensions. - The great-circle distance correctly handles wrap-around, so points near longitude ±π are recognised as neighbours.