SCRIPT PART 2 of 3 This script “field-vs-lab-doses-r-2” is the second of three used for this project. The second script, “field-vs-lab-doses-r-2”, deals with the analysis. The first script, “field-vs-lab-doses-r-1”, tidies the input databases and combines them into a single data frame for analysis, the third deals with a specific sub-set Quality Assurance and Quality Control (QA/QC) analysis. If you download the data files for this project from the Open Science Framework (https://osf.io/h6cde/, DOI 10.17605/OSF.IO/H6CDE), you can run this script independently.
ARTICLE This R script is associated with Martin et al (2025) “Aligning Behavioural Ecotoxicology Tests with Real-World Water Concentrations: Current Minimum Tested Levels Far Exceed Environmental Reality” DOI: <to be added>
AUTHORS
Jake M. Martin 1,2,3*
Erin S. McCallum 1
Jack A. Band 2,4
AFFILIATIONS
(1) School of Life and Environmental Sciences, Deakin University,
Geelong, Australia
(2) Department of Wildlife, Fish, and Environmental Studies, Swedish
University of Agricultural Sciences, Umeå, Sweden
(3) School of Biological Sciences, Monash University, Melbourne,
Australia
(4) Institute of Zoology, Zoological Society of London, London, United
Kingdom
(*) Corresponding author
AIM
The aim of this study is two-fold: (1) identify whether tested
concentrations in behavioural ecotoxicology studies reflect those that
are detected in surface waters and/or waste water; (2) to compare the
target pharmaceuticals tested in behavioural ecotoxicology to what have
been detected in the environment, highlighting potential targets for
future investigation.
METHODS
To achieve these aims, this study capitalises on four extensive open
access databases, one of which focuses on peer-review research on the
behavioural ecotoxicology of pharmaceuticals1, and the
remaining focus on pharmaceutical environmental occurrence and
concentrations2,3,4.
SCRIPT
This is an R markdown script written in R studio (2023.09.0+463 “Desert
Sunflower” Release). The ‘field-vs-lab-doses’ git repository hosts this
script, and if downloaded (or pulled) it will reproduce all data
tidy/filter, analysis, and visualisations used in Martin et al (2025).
The GitHub repository includes all raw input data from the four open
access databases, as well as all output data and figures. I have tried
to structure this code with Open, Reliable, and Transparent (ORT) coding
practices in mind. Feel free to reach out if anything is unclear.
The R script first cleans and filters each databases to make them comparable. In the gernalsit sensce In terms of filtering, we target pharmaceutical compounds, surface waters/waste water, and environmental/lab samples measured in units of mass per volume of water (e.g. ug/L).
We then summarise occurrence and concentrations between EIPAAB the three environmental databases (UBA, Wilkson and NORMAN) separately, to highlight the consistency of different environmental databases.
Lastly we combined all environmental datasets for an overall assesses of occurrence and concentrations between EIPAAB and environmental data.
DISCLAIMER
I (Jake Martin) am dyslexic. I have made an effort to review the script
for grammatical errors, but some will likely remain. I apologise. Please
reach out via the contact details below if anything is unclear.
🐟 Jake M. Martin
📧 Email: jake.martin@deakin.edu.au
📧 Alt Email: jake.martin.research@gmail.com
🌐 Web: jakemartin.org
🐙 GitHub: JakeMartinResearch
Here we define our Knit settings, to make the output more user friendly, and to cache output for faster knitting.
#kniter setting
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE, # no warnings
cache = TRUE,# Cacheing to save time when kniting
cache.lazy = FALSE, # I am running into issues with cache because the files are very large
tidy = TRUE
)
# ---- Install pacman if it's not already installed ----
if (!requireNamespace("pacman", quietly = TRUE)) install.packages("pacman")
# ---- List of required packages ----
pkgs <- c(
# ----- Data Visualisation -----
"ggthemes", "bayesplot", "gt", "gtsummary", "plotly", "qqplotr", "ggrepel",
"colorspace", "flextable", "officer", "scales",
# ----- Tidy Data and Wrangling -----
"tidyverse", "janitor", "readxl", "broom.mixed", "data.table", "devtools", "scales", "glue",
# ----- Modelling and Statistical Analysis -----
"brms", "rstan", "cmdstanr", "marginaleffects", "performance", "emmeans",
"tidybayes","future", "coda"
)
# ---- Install and load all packages using pacman ----
suppressPackageStartupMessages(
pacman::p_load(char = pkgs, install = TRUE)
)
Here’s a list of the package names
pkgs
## [1] "ggthemes" "bayesplot" "gt" "gtsummary"
## [5] "plotly" "qqplotr" "ggrepel" "colorspace"
## [9] "flextable" "officer" "scales" "tidyverse"
## [13] "janitor" "readxl" "broom.mixed" "data.table"
## [17] "devtools" "scales" "glue" "brms"
## [21] "rstan" "cmdstanr" "marginaleffects" "performance"
## [25] "emmeans" "tidybayes" "future" "coda"
Here are some custom function used within this script.
sentence_case()
Changes a string into sentence case.
sentence_case <- function(text) {
text <- tolower(text) # Convert the entire text to lowercase
text <- sub("^(\\s*\\w)", "\\U\\1", text, perl = TRUE) # Capitalise the first letter
return(text)
}
Here we define the directories for the project. We also make the output directories if they don’t already exist.
These are the input directors for the databases used in this project.
📥 data_wd: Directory for the input data.
wd <- getwd() # getwd tells us what the current wd is, we are using this to drop it in a variable called wd
data_wd <- paste0(wd, "./data") # creates a variable with the name of the wd we want to use
fig_wd: Directory for the output figures.
fig_wd <- paste0(wd, "./fig")
if (!dir.exists("fig")) {
dir.create("fig")
}
mods_wd: Directory for the output models.
mods_wd <- paste0(wd, "./mods")
if (!dir.exists("mods")) {
dir.create("mods")
}
This is the database made in part one of this script
martin-field-vs-lab-all-databases.csv
. It can be downloaded
from the Open Science Frame Work (OSF) project page “Aligning
Behavioural Ecotoxicology with Real-World Water Concentrations” by Jake
M. Martin, Jack A Brand, and Erin McCallum [LINK https://osf.io/h6cde/]
You can cite the databases as
Martin JM, Brand JA, & McCallum E (2025). Aligning Behavioural Ecotoxicology with Real-World Water Concentrations [martin-field-vs-lab-all-databases.csv]. https://doi.org/10.17605/OSF.IO/H6CDE
all_databases <- fread(paste0(data_wd, "./martin-field-vs-lab-all-databases.csv"))
If you use the collated database, please also cite:
Wilkinson JL, Boxall ABA, Kolpin DW, Leung KMY, Lai RWS, Wong D, et al. Pharmaceutical pollution of the world’s rivers. Proceedings of the National Academy of Sciences. 2022;119:1–10. DOI:10.1073/pnas.2113947119.
Network N. NORMAN EMPODAT Database - Chemical Occurrence Data [Internet]. Available from: https://www.norman-network.com/nds/empodat
aus der Beek T, Weber FA, Bergmann A, Hickmann S, Ebert I, Hein A, et al. Pharmaceuticals in the environment-Global occurrences and perspectives. Environ Toxicol Chem. 2016;35:823–35.DOI:10.1002/etc.3339.
Let’s see how much data is present in the filtered and combined databases.
total_compounds_eipaab <- all_databases %>%
dplyr::filter(source == "eipaab") %>%
nrow(.)
total_water_samples <- all_databases %>%
dplyr::filter(source != "eipaab") %>%
dplyr::reframe(n = sum(n_units_est, na.rm = TRUE)) %>%
dplyr::pull(n)
glue::glue("There are {total_compounds_eipaab}, behavioural ecotoxicology exposures (compounds by species) and {total_water_samples}, water samples.")
## There are 767, behavioural ecotoxicology exposures (compounds by species) and 10009385, water samples.
all_databases %>%
dplyr::filter(value < 0)
## Empty data.table (0 rows and 31 cols): compound_cas_corrected,compound_name_corrected,source,compound_cas,compound_name,concentration_type...
How many water samples are surface water vs waste water
all_databases %>%
dplyr::filter(source != "eipaab") %>%
dplyr::group_by(matrix_group) %>%
dplyr::reframe(n = sum(n_units_est, na.rm = TRUE)) %>%
flextable() %>%
font(fontname = "Arial", part = "all")
matrix_group | n |
---|---|
effluent | 281,752 |
surfacewater | 9,727,633 |
A breakdown for data contributed by each data source.
In the manuscript this will be called Table 1
Table 1: The number of pharmaceutical compounds within each of the four dataset (EIPAAB, NORMAN, PHARMS-UBA, and Wilksom) used for this investigation, and number of individual data rows within each of the database (i.e. testsa for EIPAAB, and water samplesb for the environmental occurrence databases).
soruce_data <- all_databases %>%
dplyr::group_by(source) %>%
dplyr::reframe(Data = sum(n_units_est, na.rm = TRUE), `Number of compounds` = length(unique(compound_name_corrected))) %>%
dplyr::rename(Source = source) %>%
dplyr::mutate(Source = case_when(Source == "eipaab" ~ "EIPAAB", Source == "norman" ~
"NORMAN‡", Source == "uba" ~ "PHARMS-UBA", Source == "wilkson" ~ "Wilkson et al.",
))
total_data <- all_databases %>%
dplyr::reframe(Data = sum(n_units_est, na.rm = TRUE), `Number of compounds` = length(unique(compound_name_corrected))) %>%
dplyr::mutate(Source = "Total")
n_summary <- rbind(soruce_data, total_data)
table_1 <- n_summary %>%
mutate(Data = round(Data, 0)) %>%
flextable() %>%
align(align = "center", part = "all") %>%
autofit()
table_1
Source | Data | Number of compounds |
---|---|---|
EIPAAB | 767 | 184 |
NORMAN‡ | 9,382,388 | 1,379 |
PHARMS-UBA | 562,865 | 911 |
Wilkson et al. | 64,132 | 61 |
Total | 10,010,152 | 1,760 |
Note: (a) each row of the EIPAAB database represents a unique species by compound exposure; (b) each row within the NORMAN and PHARMS-UBA represents one of more samples, as a row can be a measure of central tendency (e.g. mean or median) or a single value, all rows in the Wilkson database represent a single sample; ‡ NORMAN database was filtered to remove data from the German Environment Agency (UBA), and restricted to 2014-2022, as the PHARMS-UBA also included data from NORMAN prior to 2014. Thus, this number is not a true total number of pharmaceutical samples present in the NORMAN database.
Save the table as a word document Table_1.docx
save_as_docx(table_1, path = paste0(fig_wd, "./table_1.docx"))
Number of aggregate values that had missing sample size.
all_data <- all_databases %>%
dplyr::filter(source != "eipaab") %>%
dplyr::reframe(total_data_n = length(compound_cas)) %>%
dplyr::pull(total_data_n)
source_units_nas <- all_databases %>%
dplyr::filter(concentration_type == "central") %>%
dplyr::group_by(source) %>%
dplyr::reframe(total_central = length(compound_cas), total_missing_units = sum(is.na(n_units)),
all_data_n = all_data, percent_aggregate = round((total_central/all_data_n) *
100, 2), percent_missing = round((total_missing_units/all_data_n) * 100,
2)) %>%
dplyr::rename(Source = source, `Total central data` = total_central, `Total missing units` = total_missing_units,
`All data` = all_data_n, `Percent aggregate` = percent_aggregate, `Percent missing` = percent_missing) %>%
dplyr::mutate(Source = case_when(Source == "eipaab" ~ "EIPAAB", Source == "norman" ~
"NORMAN‡", Source == "uba" ~ "PHARMS-UBA", Source == "wilkson" ~ "Wilkson et al.",
))
source_units_nas %>%
flextable() %>%
autofit()
Source | Total central data | Total missing units | All data | Percent aggregate | Percent missing |
---|---|---|---|---|---|
PHARMS-UBA | 31,725 | 4,023 | 9,569,449 | 0.33 | 0.04 |
Checking the cross-over between the compounds assessed in behavioural test against all environmental occurrence data.
eipaab_compounds_list <- all_databases %>%
dplyr::filter(source == "eipaab") %>%
dplyr::distinct(compound_name_corrected) %>%
dplyr::pull(compound_name_corrected)
n_eipaab_compounds <- length(eipaab_compounds_list)
enviro__cross_over_compounds_list <- all_databases %>%
dplyr::filter(source != "eipaab" & compound_name_corrected %in% eipaab_compounds_list) %>%
dplyr::distinct(compound_name_corrected) %>%
dplyr::pull(compound_name_corrected)
n_cross_over_compounds <- length(enviro__cross_over_compounds_list)
percent_cross_over = round(n_cross_over_compounds/n_eipaab_compounds * 100, 2)
glue::glue("{n_cross_over_compounds} of the {n_eipaab_compounds} pharmaceuticals tested in the EIPAAB database (with an environmental motivation) were present in the environmental databases (i.e. {percent_cross_over}%).")
## 167 of the 184 pharmaceuticals tested in the EIPAAB database (with an environmental motivation) were present in the environmental databases (i.e. 90.76%).
For questions around concentration, I will filtered for only cases where the levels are above zero or above the limit of detection (i.e. positive detections or above LOD). We will do this for two reasons, the first being that when researchers are trying to estimate the risk of environmental exposure, we assume that they are emulating potential exposure events at a contaminated site, not trying to replicate a theoretical global surface water average. The second is, by removing zero values and values below LODs, we are trying to be more conservative with the relative comparison between tested doses in behavioural ecotoxicology and observed environmental doses (as opposed to using some value substitute/estimated value for samples below detection limit). We are also using both single values and summary values of central tendency (means, median, ect). Fig
In this section we are calculating summary metrics for concentrations used in the behavioural test and detected in surface waters and wastewater in the enviromental data. We will do this separately for wastewater and surface waters samples. We are calculating, the mean, median, min, max, range, n (number of contributing data points), standard deviation (sd), standard error (se), and the upper and lower 95% credible intervals (ci_upper, ci_lower) using empirical quantiles.
conc_summary <- all_databases %>%
dplyr::filter(value > 0) %>%
dplyr::group_by(matrix_group, compound_name_corrected) %>%
dplyr::reframe(mean = mean(value, na.rm = TRUE), median = median(value, na.rm = TRUE),
min = min(value, na.rm = TRUE), max = max(value, na.rm = TRUE), range = max -
min, n = length(unique_row_id), sd = sd(value, na.rm = TRUE), ci_lower = quantile(value,
0.025, na.rm = TRUE), ci_upper = quantile(value, 0.975, na.rm = TRUE)) %>%
dplyr::mutate(ci_lower = if_else(is.na(ci_lower), NA, ci_lower), ci_upper = if_else(is.na(ci_upper),
NA, ci_upper)) %>%
tidyr::complete(tidyr::expand(nesting(matrix_group, compound_name_corrected)))
Making a more use friendly table with the concentration of each compound that has a positive detection for wastewater and or surf waters.
cas_number <- all_databases %>%
dplyr::filter(value > 0) %>%
dplyr::group_by(compound_cas_corrected) %>%
dplyr::slice(1) %>%
dplyr::ungroup() %>%
dplyr::select(compound_cas_corrected, compound_name_corrected)
conc_summary_df <- conc_summary %>%
dplyr::left_join(., cas_number, by = "compound_name_corrected") %>%
dplyr::select(compound_name_corrected, compound_cas_corrected, n, everything()) %>%
dplyr::rename(compound_name = compound_name_corrected, compound_cas = compound_cas_corrected,
matrix = matrix_group, cri_95_lower = ci_lower, cri_95_upper = ci_upper)
Saving as a date frame
write.csv(conc_summary_df, paste0(data_wd, "./martin-concentration-summary-table.csv"),
row.names = FALSE)
For the Supplementary martial, we will make a smaller table.
eff_table <- conc_summary_df %>%
dplyr::filter(matrix == "effluent") %>%
dplyr::mutate(across(where(is.double), ~formatC(.x, format = "f", digits = 4))) %>%
dplyr::mutate(Compound = paste0(compound_name, " (", compound_cas, ")"), `Wastewater 95% credible interval (µg/L)‡` = if_else(n ==
1, NA_character_, paste0(cri_95_lower, "–", cri_95_upper)), `Wastewater range (µg/L)` = if_else(n ==
1, NA_character_, paste0(min, "–", max)), `Wastewater n` = round(n, 0),
`Wastewater median (µg/L)†` = median) %>%
dplyr::select(Compound, "Wastewater n", "Wastewater median (µg/L)†", "Wastewater 95% credible interval (µg/L)‡",
"Wastewater range (µg/L)")
surf_table <- conc_summary_df %>%
dplyr::filter(matrix == "surfacewater") %>%
dplyr::mutate(across(where(is.double), ~formatC(.x, format = "f", digits = 4))) %>%
dplyr::mutate(Compound = paste0(compound_name, " (", compound_cas, ")"), `Surface water 95% credible interval (µg/L)‡` = if_else(n ==
1, NA_character_, paste0(cri_95_lower, "–", cri_95_upper)), `Surface water range (µg/L)` = if_else(n ==
1, NA_character_, paste0(min, "–", max)), `Surface water n` = round(n,
0), `Surface water median (µg/L)†` = median) %>%
dplyr::select(Compound, "Surface water n", "Surface water median (µg/L)†",
"Surface water 95% credible interval (µg/L)‡", "Surface water range (µg/L)")
table_all <- full_join(surf_table, eff_table, by = "Compound") %>%
gt() %>%
cols_align(align = "center", columns = everything()) %>%
tab_options(table.font.size = px(11)) %>%
tab_header(title = md("**Summary of Measured Pharmaceutical Concentrations**"),
subtitle = md("This summary table was compiled for *Martin et al. (2025)*. It is based on a filtered synthesis of three publicly available datasets: (1) the NORMAN EMPODAT database for chemical occurrence (accessed 18/03/2025), (2) the Umweltbundesamt Pharmaceuticals in the Environment database (PHARMS-UBA; accessed 19/12/2024), and (3) Wilkinson et al. (2022) Pharmaceutical Pollution of the World’s Rivers database. Data were restricted to entries reported in mass per volume of water (e.g., µg/L) and relevant to surface water and wastewater matrices (for details of the filtering process, please refer to Martin et al. (2025)). Users of this table should **cite the original sources of the data: NORMAN EMPODAT, PHARMS-UBA, and Wilkinson et al. (2022).**")) %>%
tab_source_note(md("(†) Summary statistics reflect only positive detections. (‡) Credible intervals represent the lower and upper 95% empirical quantiles."))
Saving the table as a HTLM file
gtsave(table_all, filename = paste0(fig_wd, "./martin-et-al-supplemratry-file-1-concentration-summary-table.html"))
Here we will add the summary statistics from above to the full database, so we can see how many EIPAAB tested concentrations fall under the median surface water and wastewater concentrations for each compound.
rename_cols <- conc_summary %>%
dplyr::select(-matrix_group, -compound_name_corrected) %>%
colnames()
conc_summary_eff <- conc_summary %>%
dplyr::filter(matrix_group == "effluent") %>%
dplyr::rename_with(.cols = all_of(rename_cols), .fn = ~paste0("effluent_", .)) %>%
dplyr::select(-matrix_group)
conc_summary_surf <- conc_summary %>%
dplyr::filter(matrix_group == "surfacewater") %>%
dplyr::rename_with(.cols = all_of(rename_cols), .fn = ~paste0("surfacewater_",
.)) %>%
dplyr::select(-matrix_group)
all_databases_with_sum <- all_databases %>%
dplyr::filter(source == "eipaab") %>%
dplyr::left_join(., conc_summary_eff, by = "compound_name_corrected") %>%
dplyr::left_join(., conc_summary_surf, by = "compound_name_corrected") %>%
dplyr::mutate(under_med_surf = if_else(value < surfacewater_median, 1, 0), under_hci_surf = if_else(value <
surfacewater_ci_upper, 1, 0), under_med_effl = if_else(value < effluent_median,
1, 0), under_hci_effl = if_else(value < effluent_ci_upper, 1, 0), diff_med_surf = value/surfacewater_median,
diff_hic_surf = value/surfacewater_ci_upper, diff_med_effl = value/effluent_median,
diff_hic_effl = value/effluent_ci_upper)
Making a database for surface waters and waste waters
surfacewater_conc <- all_databases_with_sum %>%
dplyr::filter(!is.na(surfacewater_median), !is.na(value)) %>%
dplyr::mutate(value_log = log(value), surfacewater_median_log = log(surfacewater_median),
relative_year = year - min(year))
wastewater_conc <- all_databases_with_sum %>%
dplyr::filter(!is.na(effluent_median), !is.na(value)) %>%
dplyr::mutate(value_log = log(value), effluent_median_log = log(effluent_median),
relative_year = year - min(year))
The relative_year is based on the earliest year, 1992.
all_databases_with_sum %>%
dplyr::filter(!is.na(surfacewater_median), !is.na(value)) %>%
dplyr::summarise(min_year = min(year, na.rm = TRUE)) %>%
dplyr::pull(min_year)
## [1] 1992
In these models, we examine the relationship between the log-transformed minimum tested concentration (value_log) and the log-transformed median surface water concentration (surfacewater_median_log).analysis
n = nrow(surfacewater_conc)
print(glue("We have a sample size of {n} for this analysis, as its on the level of exposure but only for compounds with correspond surfacewater detections"))
## We have a sample size of 706 for this analysis, as its on the level of exposure but only for compounds with correspond surfacewater detections
To account for temporal trends and study design differences, we include two additional covariates:
relative_year, defined as the number of years since the first publication in 1992, and doses, the number of concentration levels tested in each study.
We fit a full model that includes all three predictors, and a null model that excludes wastewater_median_log, allowing us to assess the added predictive value of surface water concentrations through model comparison.
concentration_str <- bf(value_log ~ surfacewater_median_log + relative_year + doses,
family = gaussian())
concentration_str_null <- bf(value_log ~ relative_year + doses, family = gaussian())
These are the default priors.
suppressWarnings(get_prior(concentration_str, data = surfacewater_conc, family = gaussian())) %>%
flextable() %>%
autofit()
prior | class | coef | group | resp | dpar | nlpar | lb | ub | source |
---|---|---|---|---|---|---|---|---|---|
b | default | ||||||||
b | doses | default | |||||||
b | relative_year | default | |||||||
b | surfacewater_median_log | default | |||||||
student_t(3, -0.2, 4.4) | Intercept | default | |||||||
student_t(3, 0, 4.4) | sigma | 0 | default |
Run model. This has been hashed out as I have saved the model, and will re-load it instead of re-running it.
options(brms.backend = "cmdstanr")
concentration_mod_log <- brm(
concentration_str,
data = surfacewater_conc,
cores = 4,
chains = 4,
#prior = mod_priors, #use defaults
warmup = 1000,
seed = 20250418,
thin = 2,
iter = 8000,
control = list(max_treedepth = 20, adapt_delta = 0.95),
save_pars = save_pars(all = TRUE),
sample_prior = TRUE,
file = paste0(mods_wd, './concentration_mod_log')
)
concentration_mod_null_log <- brm(
concentration_str_null,
data = surfacewater_conc,
cores = 4,
chains = 4,
#prior = mod_priors, #use defaults
warmup = 1000,
seed = 20250418,
thin = 2,
iter = 8000,
control = list(max_treedepth = 20, adapt_delta = 0.95),
save_pars = save_pars(all = TRUE),
sample_prior = TRUE,
file = paste0(mods_wd, './concentration_mod_null_log')
)
Reload the model if necessary.
concentration_mod_log <- readRDS(file = paste0(mods_wd, "./concentration_mod_log.rds"))
concentration_mod_null_log <- readRDS(file = paste0(mods_wd, "./concentration_mod_null_log.rds"))
There’s no clear evidence that the concentration model is better or worse than the null model for out-of-sample prediction. Based on the expected log predictive density (ELPD) between two models using approximate leave-one-out cross-validation (LOO-CV).
loo_null <- loo(concentration_mod_null_log, reloo = TRUE)
loo_mod <- loo(concentration_mod_log, reloo = TRUE)
loo_compare(loo_mod, loo_null)
## elpd_diff se_diff
## concentration_mod_null_log 0.0 0.0
## concentration_mod_log -0.7 1.1
A Bayes factor (BF) quantifies how much more likely the data are under one model compared to another. Where, BF > 1 support for the first model (concentration_mod_log), BF < 1 support for the second model (concentration_mod_null_log).
In this case: BF = 0.536 so the null model is about 1.87× more supported by the data (1/0.536 = 1.87)
bayes_factor(concentration_mod_log, concentration_mod_null_log)
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Estimated Bayes factor in favor of concentration_mod_log over concentration_mod_null_log: 0.53717
This is a very low R², suggesting:The predictor(s) in the model account for very little of the outcome variability.
Metric | Value |
---|---|
Conditional R² | 0.022 |
95% CI | [0.004, 0.042] |
performance::r2_bayes(concentration_mod_log, robust = FALSE, ci = 0.95)
## # Bayesian R2 with Compatibility Interval
##
## Conditional R2: 0.022 (95% CI [0.004, 0.042])
The model fits the marginal distribution of the response variable well.
color_scheme_set("red")
brms::pp_check(concentration_mod_log, ndraws = 25, type = "dens_overlay")
brms::pp_check(concentration_mod_log, ndraws = 25, type = "ecdf_overlay")
y_rep <- posterior_predict(concentration_mod_log, ndraws = 1000)
y_obs <- concentration_mod_log$data$value_log # Replace with your actual response variable
ppc_stat_2d(y = y_obs, yrep = y_rep, stat = c("median", "mad"))
performance::check_model(concentration_mod_log)
Checking R-hat values to confirm model convergence and ESS fo estimates.
max_rhat <- max(rhat(concentration_mod_log), na.rm = TRUE)
print(glue::glue("All R-hat values ≤ {max_rhat} (good convergence)."))
## All R-hat values ≤ 1.00041606613586 (good convergence).
print(summary(concentration_mod_log, prob = 0.95)) #All Rhat = 1 (good).
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: value_log ~ surfacewater_median_log + relative_year + doses
## Data: surfacewater_conc (Number of observations: 706)
## Draws: 4 chains, each with iter = 8000; warmup = 1000; thin = 2;
## total post-warmup draws = 14000
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept 2.63 1.11 0.48 4.82 1.00 13376
## surfacewater_median_log 0.13 0.13 -0.13 0.39 1.00 13082
## relative_year -0.11 0.03 -0.17 -0.04 1.00 13016
## doses 0.07 0.08 -0.09 0.23 1.00 12523
## Tail_ESS
## Intercept 11549
## surfacewater_median_log 11490
## relative_year 12209
## doses 12514
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 4.38 0.12 4.16 4.62 1.00 13039 12408
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Diagnostic plots look fine. A smaller shape value indicates greater overdispersion (variance is much larger than the mean).
color_scheme_set("red")
plot(concentration_mod_log)
Pulling the marginal effect estimates from the model for our plots.
estimate_data_1 <- marginal_effects(concentration_mod_log, effects = "surfacewater_median_log")[[1]]
estimate_data_year_1 <- marginal_effects(concentration_mod_log, effects = "relative_year")[[1]]
These estimated coefficients represent the average effects of each
predictor on the log-transformed tested concentration, controlling for
(not averaging over) the other variables in the model. Because this is a
log–log model, the coefficient for log(surfacewater_median)
can be interpreted as an elasticity—that is, the percent change in the
response associated with a percent change in the predictor.
The estimated effect of log(surfacewater_median)
was
β = 0.128 (95% CrI: –0.133 to 0.389), suggesting that a
1% increase in surface water concentration is
associated with an average 0.128% increase in tested
concentration. However, the 95% credible interval includes
zero, indicating substantial uncertainty and that the effect is
not well-supported by the data.
The estimated effect of relative_year was β = –0.106 (95% CrI: –0.170 to –0.042), providing strong evidence of a temporal decline in tested concentrations. Since the predictor is linear and the outcome is log-transformed, this corresponds to an average 10% decrease in tested concentration per year (where the exp(–0.106) ≈ 0.899; and (0.899−1) × 100 = −10.1%), holding all else constant.
The estimated coefficient for doses was β = 0.068 (95% CrI: –0.094 to 0.234), suggesting a weak positive association, but again with considerable uncertainty—the effect is not credibly different from zero.
model_summary_1 <- as.data.frame(fixef(concentration_mod_log, summary = TRUE)) %>%
rownames_to_column() %>%
clean_names()
estimate_1 <- model_summary_1$estimate[2]
L95CI_1 <- model_summary_1$q2_5[2]
H95CI_1 <- model_summary_1$q97_5[2]
estimate_year_1 <- model_summary_1$estimate[3]
L95CI_year_1 <- model_summary_1$q2_5[3]
H95CI_year_1 <- model_summary_1$q97_5[3]
model_summary_1_ft <- model_summary_1 %>%
dplyr::mutate(across(where(is.double), ~round(.x, 3))) %>%
dplyr::rename(predictor = rowname) %>%
dplyr::mutate(CrI = paste0(q2_5, " to ", q97_5)) %>%
dplyr::select(predictor, estimate, CrI) %>%
flextable() %>%
set_header_labels(predictor = "Predictor", estimate = "Estimate", CrI = "95% CrI") %>%
autofit()
model_summary_1_ft
Predictor | Estimate | 95% CrI |
---|---|---|
Intercept | 2.631 | 0.481 to 4.817 |
surfacewater_median_log | 0.128 | -0.133 to 0.389 |
relative_year | -0.106 | -0.17 to -0.042 |
doses | 0.068 | -0.094 to 0.234 |
Saving the table
save_as_docx(model_summary_1_ft, path = paste0(fig_wd, "./table-s2.docx"))
This gives you the model-derived expected number of years (from the 1992 baseline) for convergence of tested and detected concentrations, for an average compound with average dose and average water concentration.
b0 <- model_summary_1$estimate[1] # Intercept
b1 <- model_summary_1$estimate[2] # surfacewater_median_log
b2 <- model_summary_1$estimate[3] # relative_year
b3 <- model_summary_1$estimate[4] # doses
x_bar <- mean(surfacewater_conc$surfacewater_median_log, na.rm = TRUE)
d_bar <- mean(surfacewater_conc$doses, na.rm = TRUE)
average_equal_years <- round(((-b0 + (1 - b1) * x_bar - b3 * d_bar)/b2), 0)
years_to_equal <- (1992 + average_equal_years)
glue::glue("Based on model predictions for average compound, with the average number of doses, and average water concentration, it would take {average_equal_years} years from our baseline (i.e. 1992), for the tested dose to be equal with the meassured dose. In other words, based on current trends, tested doses wont be equal to field levels until {years_to_equal}")
## Based on model predictions for average compound, with the average number of doses, and average water concentration, it would take 63 years from our baseline (i.e. 1992), for the tested dose to be equal with the meassured dose. In other words, based on current trends, tested doses wont be equal to field levels until 2055
text_coord <- surfacewater_conc %>%
summarise(y_max = max(value_log, na.rm = TRUE), x_max = max(surfacewater_median_log,
na.rm = TRUE), x_min = min(surfacewater_median_log, na.rm = TRUE), x = mean(c(x_max,
x_min)))
y_coord <- text_coord %>%
pull(y_max)
x_coord <- text_coord %>%
pull(x_min)
fig_1a <- ggplot() + geom_line(data = estimate_data_1, aes(x = surfacewater_median_log,
y = estimate__), color = "#2D5F34", linewidth = 1) + geom_ribbon(data = estimate_data_1,
aes(x = surfacewater_median_log, ymin = lower__, ymax = upper__), fill = "#2D5F34",
alpha = 0.1) + geom_point(data = surfacewater_conc, aes(x = surfacewater_median_log,
y = value_log), alpha = 0.4, shape = 21, color = "#2D5F34", fill = "#2D5F34") +
annotate("text", x = x_coord, y = y_coord, label = paste0("Bayesian log-log Gaussian regression",
"\n", "Effect estimate: ", round(estimate_1, 3), "\n", "95%CrI: ", round(L95CI_1,
3), " to ", round(H95CI_1, 3)), hjust = 0, vjust = 1, size = 4) + labs(title = "(A) Surface water vs exposure concentration",
x = "Exposure concentration (μg/L)*", y = "Environmental concentration (μg/L)*",
caption = "*Values in μg/L space on a log scale") + scale_y_continuous(labels = function(x) (scales::label_scientific())(exp(x))) +
scale_x_continuous(labels = function(x) (scales::label_scientific())(exp(x))) +
theme_few() + theme(legend.position = "bottom", legend.justification = "center",
plot.caption = element_text(size = 8, hjust = 0.5, margin = margin(t = 6)))
fig_1a
Save this figure for the MS
ggsave(filename = paste0(fig_wd, "./figure-1a.pdf"), plot = fig_1a, width = 210,
height = 297/1.5, units = "mm")
Plot surface water concentrations by year
text_coord <- wastewater_conc %>%
summarise(y_max = max(value_log, na.rm = TRUE), x_max = max(relative_year, na.rm = TRUE),
x_min = min(relative_year, na.rm = TRUE), x_mean = mean(c(x_max, x_min)))
y_coord <- text_coord %>%
pull(y_max)
x_coord <- text_coord %>%
pull(x_min)
fig_s1a <- ggplot() + geom_line(data = estimate_data_year_1, aes(x = relative_year,
y = estimate__), color = "#2D5F34", linewidth = 1) + geom_ribbon(data = estimate_data_year_1,
aes(x = relative_year, ymin = lower__, ymax = upper__), fill = "#2D5F34", alpha = 0.1) +
geom_point(data = surfacewater_conc, aes(x = relative_year, y = value_log), alpha = 0.4,
shape = 21, color = "#2D5F34", fill = "#2D5F34") + annotate("text", x = x_coord,
y = y_coord, label = paste0("Bayesian log-log Gaussian regression", "\n", "Effect estimate: ",
round(estimate_year_1, 3), "\n", "95%CrI: ", round(L95CI_year_1, 3), " to ",
round(H95CI_year_1, 3)), hjust = 0, vjust = 1, size = 4) + labs(title = "(A) Exposure concentration vs year (surface water compounds)",
x = "Year", y = "Exposure concentration (μg/L)*", caption = "*Concentration values in μg/L space on a log scale") +
scale_y_continuous(labels = function(x) (scales::label_scientific())(exp(x))) +
scale_x_continuous(labels = function(x) x + 1992) + theme_few() + theme(legend.position = "bottom",
legend.justification = "center", plot.caption = element_text(size = 8, hjust = 0.5,
margin = margin(t = 6)))
fig_s1a
Save this figure for the MS
ggsave(filename = paste0(fig_wd, "./figure-s1a.pdf"), plot = fig_s1a, width = 210,
height = 297/1.5, units = "mm")
In these models, we examine the relationship between the log-transformed minimum tested concentration (value_log) and the log-transformed median wastewater concentration (effluent_median_log).
but only for compounds with correspond wastewater detections
n = nrow(wastewater_conc)
print(glue("We have a sample size of {n} for this analysis, as its on the level of exposure, but only for compounds with correspond wastewater detections"))
## We have a sample size of 714 for this analysis, as its on the level of exposure, but only for compounds with correspond wastewater detections
To account for temporal trends and study design differences, we include two additional covariates:
relative_year, defined as the number of years since the first publication in 1992, and doses, the number of concentration levels tested in each study.
We fit a full model that includes all three predictors, and a null model that excludes effluent_median_log, allowing us to assess the added predictive value of wastewater concentrations through model comparison.
eff_concentration_str <- bf(value_log ~ effluent_median_log + relative_year + doses,
family = gaussian())
eff_concentration_str_null <- bf(value_log ~ relative_year + doses, family = gaussian())
These are the default priors.
suppressWarnings(get_prior(eff_concentration_str, data = wastewater_conc, family = gaussian())) %>%
flextable()
prior | class | coef | group | resp | dpar | nlpar | lb | ub | source |
---|---|---|---|---|---|---|---|---|---|
b | default | ||||||||
b | doses | default | |||||||
b | effluent_median_log | default | |||||||
b | relative_year | default | |||||||
student_t(3, -0.2, 4.5) | Intercept | default | |||||||
student_t(3, 0, 4.5) | sigma | 0 | default |
Run model. This has been hashed out as I have saved the model, and will re-load it instead of re-running it.
options(brms.backend = "cmdstanr")
eff_concentration_mod_log <- brm(
eff_concentration_str,
data = wastewater_conc,
cores = 4,
chains = 4,
#prior = mod_priors, #use defaults
warmup = 1000,
seed = 20250418,
thin = 2,
iter = 8000,
control = list(max_treedepth = 20, adapt_delta = 0.95),
save_pars = save_pars(all = TRUE),
sample_prior = TRUE,
file = paste0(mods_wd, './eff_concentration_mod_log')
)
eff_concentration_mod_null_log <- brm(
eff_concentration_str_null,
data = wastewater_conc,
cores = 4,
chains = 4,
#prior = mod_priors, #use defaults
warmup = 1000,
seed = 20250418,
thin = 2,
iter = 8000,
control = list(max_treedepth = 20, adapt_delta = 0.95),
save_pars = save_pars(all = TRUE),
sample_prior = TRUE,
file = paste0(mods_wd, './eff_concentration_mod_null_log')
)
Reload the model if necessary.
eff_concentration_mod_log <- readRDS(file = paste0(mods_wd, "./eff_concentration_mod_log.rds"))
eff_concentration_mod_null_log <- readRDS(file = paste0(mods_wd, "./eff_concentration_mod_null_log.rds"))
We compared the full model (including effluent_median_log) with the null model (excluding it) using approximate leave-one-out cross-validation (LOO-CV).
The expected log predictive density (ELPD) difference was:
ΔELPD = 31.7 in favour of the full model, with a standard error (SE) of 8.1.
Since the ELPD difference is approximately four times larger than its standard error (Δ/SE ≈ 3.9), this indicates strong evidence that the full model provides substantially better out-of-sample predictive performance than the null model.
In other words, including effluent_median_log as a predictor significantly improves the model’s ability to predict tested concentrations across studies.
loo_compare(loo(eff_concentration_mod_log, moment_match = TRUE), loo(eff_concentration_mod_null_log,
moment_match = TRUE))
## elpd_diff se_diff
## eff_concentration_mod_log 0.0 0.0
## eff_concentration_mod_null_log -31.7 8.1
The Bayes factor comparing the full model (including effluent_median_log) to the null model (excluding it) was approximately 2.25 × 10¹³, overwhelmingly favouring the full model.
A Bayes factor (BF) this large indicates decisive evidence that the full model provides a better explanation of the data than the null model. In other words, the observed data are roughly 22 trillion times more likely under the full model than under the null.
This provides exceptionally strong support for including effluent_median_log as a predictor of tested concentrations, reinforcing the conclusion from the LOO-CV comparison.
bayes_factor(eff_concentration_mod_log, eff_concentration_mod_null_log)
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Estimated Bayes factor in favor of eff_concentration_mod_log over eff_concentration_mod_null_log: 30862546421405.30469
The Bayesian R² for the full model (including effluent_median_log) was 0.108, with a 95% compatibility interval ranging from 0.070 to 0.150.
This indicates that the model explains approximately 10.8% of the variance in tested concentrations across studies, with credible support for this value lying between 7.0% and 15.0%.
While this represents a modest proportion of explained variance, it nonetheless reflects a meaningful improvement over the null model — especially when considered alongside the strong support from both the LOO-CV comparison and the Bayes factor. This suggests that effluent_median_log is an informative predictor, even if the overall model explains a limited portion of the variability, which may reflect substantial heterogeneity across studies.
performance::r2_bayes(eff_concentration_mod_log, robust = FALSE, ci = 0.95)
## # Bayesian R2 with Compatibility Interval
##
## Conditional R2: 0.108 (95% CI [0.070, 0.150])
The model fits the marginal distribution of the response variable well.
color_scheme_set("red")
brms::pp_check(eff_concentration_mod_log, ndraws = 25, type = "dens_overlay")
brms::pp_check(eff_concentration_mod_log, ndraws = 25, type = "ecdf_overlay")
y_rep <- posterior_predict(eff_concentration_mod_log, ndraws = 1000)
y_obs <- eff_concentration_mod_log$data$value_log # Replace with your actual response variable
ppc_stat_2d(y = y_obs, yrep = y_rep, stat = c("median", "mad"))
Based on the diagnostic plots:(1) The residuals exhibit no major departures from randomness; (2) The Q-Q and density plots support the normality assumption of the residuals; (3) The absence of strong patterns in the autocorrelation and scale-location plots further suggests that the model is well-specified with homoscedastic and independent errors.
performance::check_model(eff_concentration_mod_log)
Checking R-hat values to confirm model convergence and ESS fo estimates.
max_rhat <- max(rhat(eff_concentration_mod_log), na.rm = TRUE)
print(glue::glue("All R-hat values ≤ {max_rhat}"))
## All R-hat values ≤ 1.00029834752456
General model output
print(summary(eff_concentration_mod_log, prob = 0.95))
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: value_log ~ effluent_median_log + relative_year + doses
## Data: wastewater_conc (Number of observations: 714)
## Draws: 4 chains, each with iter = 8000; warmup = 1000; thin = 2;
## total post-warmup draws = 14000
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 5.58 0.95 3.71 7.45 1.00 13485 12258
## effluent_median_log 0.68 0.08 0.52 0.84 1.00 12501 12460
## relative_year -0.13 0.03 -0.19 -0.07 1.00 12979 12064
## doses -0.08 0.08 -0.24 0.08 1.00 13233 12270
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 4.19 0.11 3.98 4.42 1.00 12826 11539
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Diagnostic plots look fine. A smaller shape value indicates greater overdispersion (variance is much larger than the mean).
color_scheme_set("red")
plot(eff_concentration_mod_log)
Pulling the marginal effect estimates from the model for our plots.
# Extract marginal effects for surface water and year
estimate_data_2 <- marginal_effects(eff_concentration_mod_log, effects = "effluent_median_log")[[1]]
estimate_data_year_2 <- marginal_effects(eff_concentration_mod_log, effects = "relative_year")[[1]]
These estimated coefficients represent the average effects across the
dataset, controlling for (rather than averaging over) the other
predictors in the model. Because the model is log–log for value_log and
effluent_median_log
, the coefficient for
log(effluent_median)
reflects an elasticity—that is, the
percent change in the response for a percent change in the
predictor.
The estimated effect of log(effluent_median)
was
β = 0.679 (95% CrI: 0.516 to 0.841), indicating that a
1% increase in effluent concentration is associated with an
average 0.679% increase in tested concentration, all else being
equal. Since the 95% credible interval does not include zero, this
effect is statistically well-supported.
The coefficient for relative_year
was β = –0.134
(95% CrI: –0.195 to –0.073), suggesting a consistent temporal
decline in tested concentrations over time. As this predictor is linear
while the response is log-transformed, this corresponds to an average
12.6% decrease in tested concentration per year
(exp(–0.134) ≈ 0.874), holding other variables
constant.
The estimated effect of doses was β = –0.080 (95% CrI: –0.242 to 0.084), suggesting a weak and uncertain negative association between the number of doses tested and the minimum reported concentration. However, the credible interval includes zero, indicating no strong evidence for an effect of doses in this model.
model_summary_2 <- as.data.frame(fixef(eff_concentration_mod_log, summary = TRUE)) %>%
rownames_to_column() %>%
clean_names()
estimate_2 <- model_summary_2$estimate[2]
L95CI_2 <- model_summary_2$q2_5[2]
H95CI_2 <- model_summary_2$q97_5[2]
estimate_year_2 <- model_summary_2$estimate[3]
L95CI_year_2 <- model_summary_2$q2_5[3]
H95CI_year_2 <- model_summary_2$q97_5[3]
model_summary_2_ft <- model_summary_2 %>%
dplyr::mutate(across(where(is.double), ~round(.x, 3))) %>%
dplyr::rename(predictor = rowname) %>%
dplyr::mutate(CrI = paste0(q2_5, " to ", q97_5)) %>%
dplyr::select(predictor, estimate, CrI) %>%
flextable() %>%
set_header_labels(predictor = "Predictor", estimate = "Estimate", CrI = "95% CrI") %>%
autofit()
model_summary_2_ft
Predictor | Estimate | 95% CrI |
---|---|---|
Intercept | 5.576 | 3.709 to 7.45 |
effluent_median_log | 0.679 | 0.516 to 0.841 |
relative_year | -0.134 | -0.195 to -0.073 |
doses | -0.080 | -0.242 to 0.084 |
Saving the table
save_as_docx(model_summary_2_ft, path = paste0(fig_wd, "./table-s3.docx"))
text_coord <- wastewater_conc %>%
summarise(y = max(value_log, na.rm = TRUE), x_max = max(effluent_median_log,
na.rm = TRUE), x_min = min(effluent_median_log, na.rm = TRUE), x = mean(c(x_max,
x_min)))
y_coord <- text_coord %>%
pull(y)
x_coord <- text_coord %>%
pull(x)
fig_1b <- ggplot() + geom_line(data = estimate_data_2, aes(x = effluent_median_log,
y = estimate__), color = "#5B489D", linewidth = 1) + geom_ribbon(data = estimate_data_2,
aes(x = effluent_median_log, ymin = lower__, ymax = upper__), fill = "#5B489D",
alpha = 0.1) + geom_point(data = wastewater_conc, aes(x = effluent_median_log,
y = value_log), alpha = 0.4, shape = 21, color = "#5B489D", fill = "#5B489D") +
annotate("text", x = x_coord, y = y_coord, label = paste0("Bayesian log-log Gaussian regression",
"\n", "Effect estimate: ", round(estimate_2, 3), "\n", "95%CrI: ", round(L95CI_2,
3), " to ", round(H95CI_2, 3)), hjust = 0, vjust = 1, size = 4) + labs(title = "(B) Wastewater vs exposure concentration",
x = "Exposure concentration (μg/L)*", y = "Environmental concentration (μg/L)*",
caption = "*Values in μg/L space on a log scale") + scale_y_continuous(labels = function(x) (scales::label_scientific())(exp(x))) +
scale_x_continuous(labels = function(x) (scales::label_scientific())(exp(x))) +
theme_few() + theme(legend.position = "bottom", legend.justification = "center",
plot.caption = element_text(size = 8, hjust = 0.5, margin = margin(t = 6)))
fig_1b
Saving figure
ggsave(filename = paste0(fig_wd, "./figure-1b.pdf"),
plot = fig_1b,
width = 210,
height = 297/1.5, # Specify the height of the plot
units = "mm")
Plot surface water concentrations by year
text_coord <- wastewater_conc %>%
summarise(y_max = max(value_log, na.rm = TRUE), x_max = max(relative_year, na.rm = TRUE),
x_min = min(relative_year, na.rm = TRUE), x_mean = mean(c(x_max, x_min)))
y_coord <- text_coord %>%
pull(y_max)
x_coord <- text_coord %>%
pull(x_min)
fig_s1b <- ggplot() + geom_line(data = estimate_data_year_2, aes(x = relative_year,
y = estimate__), color = "#5B489D", linewidth = 1) + geom_ribbon(data = estimate_data_year_2,
aes(x = relative_year, ymin = lower__, ymax = upper__), fill = "#5B489D", alpha = 0.1) +
geom_point(data = wastewater_conc, aes(x = relative_year, y = value_log), alpha = 0.4,
shape = 21, color = "#5B489D", fill = "#5B489D") + annotate("text", x = x_coord,
y = y_coord, label = paste0("Bayesian log-log Gaussian regression", "\n", "Effect estimate: ",
round(estimate_year_2, 3), "\n", "95%CrI: ", round(L95CI_year_2, 3), " to ",
round(H95CI_year_2, 3)), hjust = 0, vjust = 1, size = 4) + labs(title = "(B) Exposure concentration vs year (wastewater compounds)",
x = "Year", y = "Exposure concentration (μg/L)*", caption = "*Concentration values in μg/L space on a log scale") +
scale_y_continuous(labels = function(x) (scales::label_scientific())(exp(x))) +
scale_x_continuous(labels = function(x) x + 1992) + theme_few() + theme(legend.position = "bottom",
legend.justification = "center", plot.caption = element_text(size = 8, hjust = 0.5,
margin = margin(t = 6)))
fig_s1b
Save this figure for the MS
ggsave(filename = paste0(fig_wd, "./figure-s1b.pdf"), plot = fig_s1b, width = 210,
height = 297/1.5, units = "mm")
Is this trend still present if the extreme concentration is removed. The compound is streptomycin (CAS: 57-92-1)
wastewater_conc %>%
dplyr::arrange(desc(effluent_median_log)) %>%
dplyr::slice(1) %>%
dplyr::select(compound_cas_corrected, compound_name_corrected, value) %>%
flextable()
compound_cas_corrected | compound_name_corrected | value |
---|---|---|
57-92-1 | streptomycin | 20,400 |
Building an identical model without streptomycin (outlier removed; or)
options(brms.backend = "cmdstanr")
eff_concentration_mod_log_or <- brm(eff_concentration_str,
data = wastewater_conc %>%
dplyr::filter(compound_cas_corrected != "57-92-1"),
cores = 4,
chains = 4,
#prior = mod_priors, #use defaults
warmup = 1000,
seed = 20250418,
thin = 2,
iter = 8000,
control = list(max_treedepth = 20, adapt_delta = 0.95),
save_pars = save_pars(all=TRUE),
sample_prior = TRUE,
file = paste0(mods_wd, './eff_concentration_mod_log_or'))
Reload the model if necessary.
eff_concentration_mod_log_or <- readRDS(file = paste0(mods_wd, "./eff_concentration_mod_log_or.rds"))
The effect of year seems robust to the outlier.
model_summary_2_1 <- as.data.frame(fixef(eff_concentration_mod_log_or, summary = TRUE)) %>%
rownames_to_column() %>%
clean_names()
table_i <- model_summary_2_1 %>%
dplyr::mutate(across(where(is.double), ~round(.x, 3))) %>%
dplyr::rename(predictor = rowname) %>%
dplyr::mutate(CrI = paste0(q2_5, " to ", q97_5)) %>%
dplyr::select(predictor, estimate, CrI) %>%
flextable() %>%
set_header_labels(predictor = "Predictor", estimate = "Estimate (outlier removed)",
CrI = "95% CrI (outlier removed)") %>%
autofit()
table_i
Predictor | Estimate (outlier removed) | 95% CrI (outlier removed) |
---|---|---|
Intercept | 5.635 | 3.726 to 7.548 |
effluent_median_log | 0.685 | 0.521 to 0.851 |
relative_year | -0.135 | -0.195 to -0.073 |
doses | -0.083 | -0.245 to 0.078 |
Here we are summarising how many test were below the median surface water and effluent concentrations as well as the upper 95%CI. We are also calculating the mean-fold difference in concentrations.
conc_summary_overall <- all_databases_with_sum %>%
dplyr::reframe(
eipaab_n = length(unique_row_id),
n_compounds_sw = length(unique(compound_name_corrected[!is.na(surfacewater_median)])),
n_compounds_ef = length(unique(compound_name_corrected[!is.na(effluent_median)])),
n_sw = sum(!is.na(under_med_surf)),
less_med_sw = sum(under_med_surf, na.rm = TRUE),
less_hci_sw = sum(under_hci_surf, na.rm = TRUE),
percent_less_med_sw = (less_med_sw / n_sw) * 100,
percent_less_hic_sw = (less_hci_sw / n_sw) * 100,
n_ef = sum(!is.na(under_med_effl)),
less_med_ef = sum(under_med_effl, na.rm = TRUE),
less_hci_ef = sum(under_hci_effl, na.rm = TRUE),
percent_less_med_ef = (less_med_ef / n_ef) * 100,
percent_less_hic_ef = (less_hci_ef / n_ef) * 100,
# Median values
fdiff_med_sw = median(diff_med_surf, na.rm = TRUE),
fdiff_hci_sw = median(diff_hic_surf, na.rm = TRUE),
fdiff_med_ef = median(diff_med_effl, na.rm = TRUE),
fdiff_hci_ef = median(diff_hic_effl, na.rm = TRUE),
# Standard Deviation
sd_fdiff_med_sw = sd(diff_med_surf, na.rm = TRUE),
sd_fdiff_hci_sw = sd(diff_hic_surf, na.rm = TRUE),
sd_fdiff_med_ef = sd(diff_med_effl, na.rm = TRUE),
sd_fdiff_hci_ef = sd(diff_hic_effl, na.rm = TRUE)
) %>%
dplyr::mutate(across(everything(), ~ ifelse(is.nan(.), NA, .))) # Convert NaNs to NAs
For the 706 exposures to compounds with surface water detections (in UBA, NORMAN, or Wilkson), only 18.70% and 37.96% employed test concentrations lower then median and upper 95% credible interval of surface water concentrations, respectively. Further, for the 714 exposures to compounds with wastewater detections, only 23.39% and 53.08% employed test concentrations lower then median and upper 95% credible interval of wastewater. On average, concentrations used in behavioural ecotoxicology were 43 times higher then median surf water concentrations, and 10 times higher then those in effluent.
surf <- conc_summary_overall %>%
dplyr::mutate(fdiff_med_sw = round(fdiff_med_sw,2),
fdiff_hci_sw = round(fdiff_hci_sw)) %>%
dplyr::select(n_sw, less_med_sw, less_hci_sw, fdiff_med_sw, fdiff_hci_sw, percent_less_med_sw, percent_less_hic_sw, n_compounds_sw) %>%
dplyr::rename(total_n = n_sw, less_med = less_med_sw, less_hci = less_hci_sw, fdiff_med = fdiff_med_sw, fdiff_hci = fdiff_hci_sw, percent_less_med = percent_less_med_sw, percent_less_hic = percent_less_hic_sw, compounds = n_compounds_sw) %>%
dplyr::mutate(matrix = "Surface water")
effluent <- conc_summary_overall %>%
dplyr::mutate(fdiff_med_ef = round(fdiff_med_ef, 2),
fdiff_hci_ef = round(fdiff_hci_ef, 2)) %>%
dplyr::select(n_ef, less_med_ef, less_hci_ef, fdiff_med_ef, fdiff_hci_ef, percent_less_med_ef, percent_less_hic_ef, n_compounds_ef) %>%
dplyr::rename(total_n = n_ef, less_med = less_med_ef, less_hci = less_hci_ef, fdiff_med = fdiff_med_ef, fdiff_hci = fdiff_hci_ef, percent_less_med = percent_less_med_ef, percent_less_hic = percent_less_hic_ef, compounds = n_compounds_ef) %>%
dplyr::mutate(matrix = "Effluent")
overall_summary_table <- rbind(surf, effluent) %>%
dplyr::select(matrix, everything()) %>%
dplyr::mutate(
percent_less_med = round(percent_less_med, 2),
percent_less_hic = round(percent_less_hic, 2)
) %>%
dplyr::rename(
Matrix = matrix,
`Total tests` = total_n,
`Total compounds` = compounds,
`Dose less than median` = less_med,
`Dose less than upper 95%CrI` = less_hci,
`Mean-fold difference from median` = fdiff_med,
`Mean-fold difference from upper 95%CrI` = fdiff_hci,
`Percent less than median (%)` = percent_less_med,
`Percent less than upper 95%CrI (%)` = percent_less_hic
) %>%
flextable() %>%
colformat_num(j = "Total tests", digits = 0) %>% # Format total tests as whole numbers
align(align = "center", part = "all") %>%
autofit()
overall_summary_table
Matrix | Total tests | Dose less than median | Dose less than upper 95%CrI | Mean-fold difference from median | Mean-fold difference from upper 95%CrI | Percent less than median (%) | Percent less than upper 95%CrI (%) | Total compounds |
---|---|---|---|---|---|---|---|---|
Surface water | 706 | 133 | 269 | 42.70 | 2.00 | 18.84 | 38.10 | 139 |
Effluent | 714 | 167 | 379 | 10.11 | 0.78 | 23.39 | 53.08 | 148 |
save_as_docx(overall_summary_table, path = paste0(fig_wd, "./table-s6-1.docx"))
Now we will make a summary for each compound
sample_counts <- all_databases_with_sum %>%
dplyr::select(compound_name_corrected, surfacewater_n, effluent_n) %>%
group_by(compound_name_corrected) %>%
dplyr::slice(1) %>%
ungroup()
under_levels <- all_databases_with_sum %>%
dplyr::group_by(compound_name_corrected) %>%
dplyr::reframe(eipaab_n = length(unique_row_id), n_sw = sum(!is.na(under_med_surf)),
less_med_sw = sum(under_med_surf, na.rm = TRUE), less_hci_sw = sum(under_hci_surf,
na.rm = TRUE), prop_less_med_sw = less_med_sw/n_sw, prop_less_hic_sw = less_hci_sw/n_sw,
n_ef = sum(!is.na(under_med_effl)), less_med_ef = sum(under_med_effl, na.rm = TRUE),
less_hci_ef = sum(under_hci_effl, na.rm = TRUE), prop_less_med_ef = less_med_ef/n_ef,
prop_less_hic_ef = less_hci_ef/n_ef, fdiff_med_sw = median(diff_med_surf,
na.rm = TRUE), fdiff_hci_sw = median(diff_hic_surf, na.rm = TRUE), fdiff_med_ef = median(diff_med_effl,
na.rm = TRUE), fdiff_hci_ef = median(diff_hic_effl, na.rm = TRUE), ) %>%
dplyr::left_join(., sample_counts, by = "compound_name_corrected") %>%
dplyr::mutate(across(everything(), ~ifelse(is.nan(.), NA, .)))
under_levels
## # A tibble: 184 × 18
## compound_name_corrected eipaab_n n_sw less_med_sw less_hci_sw
## <chr> <int> <int> <dbl> <dbl>
## 1 17-alpha-ethinylestradiol 56 56 51 56
## 2 17-alpha-methyldihydrotestosterone 1 0 0 0
## 3 17-beta-estradiol 12 12 1 11
## 4 17-beta-trenbolone 16 16 2 2
## 5 5-fluorouracil 1 0 0 0
## 6 acetylsalicylic acid 3 3 0 2
## 7 albendazole 1 1 0 0
## 8 alprazolam 1 1 0 1
## 9 aminosidine 1 0 0 0
## 10 amiodarone 1 1 0 0
## # ℹ 174 more rows
## # ℹ 13 more variables: prop_less_med_sw <dbl>, prop_less_hic_sw <dbl>,
## # n_ef <int>, less_med_ef <dbl>, less_hci_ef <dbl>, prop_less_med_ef <dbl>,
## # prop_less_hic_ef <dbl>, fdiff_med_sw <dbl>, fdiff_hci_sw <dbl>,
## # fdiff_med_ef <dbl>, fdiff_hci_ef <dbl>, surfacewater_n <int>,
## # effluent_n <int>
Lets plot the mean-fold difference Building the plot in two halves so that it can fit in our manuscript
Fig 2 The median fold difference in tested concentrations in EIPAAB and median surface water concentration (i.e. tested dose / median surface water concentration)
plot_left
plot_right
Note: the size of the points represent the number of occurrence in EIPAAB, and colour of the point represents the number of detections in the environmental databases
Save the figure
invisible(fig_2 <- plot_left | plot_right)
ggsave(filename = paste0(fig_wd, "./figure-2.pdf"),
plot = fig_2,
width = 210,
height = 297/1.25, # Specify the height of the plot
units = "mm")
For surface waters there were 139 compounds
nrow(plot_sw_data)
## [1] 139
Let’s plot mean-fold difference fo effluent
Building the plot in two halves so that it can fit in our manuscript
plot_ef_data <- under_levels %>%
dplyr::filter(!is.na(fdiff_med_ef)) %>%
dplyr::arrange(desc(fdiff_med_ef)) %>%
dplyr::mutate(
order = 1:nrow(.),
compound_name_corrected = sentence_case(compound_name_corrected),
compound_name_short = if_else(
nchar(as.character(compound_name_corrected)) > 13,
str_trunc(compound_name_corrected, 16, ellipsis = "..."),
compound_name_corrected
),
compound_name_short = factor(compound_name_short, levels = rev(compound_name_short)), # Fixed the parentheses
compound_name_short = factor(compound_name_short, levels = rev(compound_name_short)), # Fixed the parentheses
eipaab_n_cat = case_when(
eipaab_n == 1 ~ "1",
eipaab_n >= 2 & eipaab_n <= 5 ~ "2 to 5",
eipaab_n >= 6 & eipaab_n <= 10 ~ "6 to 10",
eipaab_n > 10 ~ "10+",
TRUE ~ "ERROR" # This catches unexpected values
),
eipaab_n_cat = factor(eipaab_n_cat, levels = c("1", "2 to 5", "6 to 10", "10+"))
)
vline_data_left <- data.frame(xintercept = log(10^(-1:2)),
xlabs = as.character(10^(-1:2))) %>%
dplyr::mutate(xlabs = paste0("×",xlabs))
vline_data_right <- data.frame(xintercept = log(10^(2:7)),
xlabs = as.character(10^(2:7))) %>%
dplyr::mutate(xlabs = paste0("×",xlabs))
n_half <- ceiling(nrow(plot_ef_data) / 2)
plot_ef_data_right <- plot_ef_data %>%
dplyr::slice(1:n_half)
plot_ef_data_left <- plot_ef_data %>%
dplyr::slice((n_half + 1):nrow(plot_ef_data))
plot_left <- plot_ef_data_left %>%
ggplot(aes(x = log(fdiff_med_ef), y = compound_name_short, fill = eipaab_n_cat)) +
geom_vline(data = vline_data_left, aes(xintercept = xintercept),
linetype = "dashed", colour = "black", alpha = 0.5) +
geom_text(data = vline_data_left, aes(x = xintercept, y = "Amitriptyline", label = xlabs),
inherit.aes = FALSE, angle = 90, vjust = -0.5, hjust = 1, size = 3) +
geom_point(alpha = 0.5, shape = 21, size = 2) +
theme_classic() +
# Adjusted log scale with 10-fold increments
scale_x_continuous(
breaks = log(10^(-1:2)), # Explicitly set log-scale breaks
labels = function(x) format(exp(x), scientific = TRUE)
) +
labs(
subtitle = "Tested concentrations vs median waste water",
x = "Log mean fold difference of tested and median waste water concentrations",
y = "Compounds",
fill = "EIPAAB data"
) +
theme(
legend.position = "bottom",
legend.justification = "right"
)
plot_right <- plot_ef_data_right %>%
ggplot(aes(x = log(fdiff_med_ef), y = compound_name_short, fill = eipaab_n_cat)) +
geom_vline(data = vline_data_right, aes(xintercept = xintercept),
linetype = "dashed", colour = "black", alpha = 0.5) +
geom_text(data = vline_data_right, aes(x = xintercept, y = "Risperidone", label = xlabs),
inherit.aes = FALSE, angle = 90, vjust = -0.5, hjust = 1, size = 3) +
geom_point(alpha = 0.5, shape = 21, size = 2) +
theme_classic() +
# Adjusted log scale: removes 10^2 while keeping 10^3 to 10^7
scale_x_continuous(
breaks = log(c(10^(3:7))), # Explicitly setting ticks without 10^2
labels = function(x) format(exp(x), scientific = TRUE)
) +
theme(
axis.title = element_blank(), # Removes axis labels
legend.position = "none"
)
Fig S2: The median fold difference in tested concentrations in EIPAAB and median waste water concentration (i.e. tested dose / median waste water concentration)
plot_left
plot_right
Note: the size of the points represent the number of occurrence in EIPAAB, and colour of the point represents the number of detections in the environmental databases
Save the figure
invisible(fig_s2 <- plot_left | plot_right)
ggsave(filename = paste0(fig_wd, "./figure-s2.pdf"), plot = fig_s2, width = 210,
height = 297/1.25, units = "mm")
Looking at the number of compounds in the database that have data at concentrations below median and upper 95 CI
The percentage of compounds in the EIPAAB database that have data at concentrations less then median and upper 95% credible interval of surface water and wastewater concentrations
compound_sum_table_surf <- under_levels %>%
dplyr::reframe(total_compounds = sum(!is.na(prop_less_med_sw)), count_less_med_sw = sum(prop_less_med_sw ==
0, na.rm = TRUE), count_less_hic_sw = sum(prop_less_hic_sw == 0, na.rm = TRUE),
percent_less_med_sw = (count_less_med_sw/total_compounds) * 100, percent_less_hic_sw = (count_less_hic_sw/total_compounds) *
100, `Percent with no data under median concentration` = paste0(round(percent_less_med_sw,
2), "%", " (", count_less_med_sw, " of ", total_compounds, ")"), `Percent with no data under upper 95 CI concentration` = paste0(round(percent_less_hic_sw,
2), "%", " (", count_less_hic_sw, " of ", total_compounds, ")")) %>%
dplyr::mutate(Matrix = "Surface water") %>%
dplyr::select(Matrix, "Percent with no data under median concentration", "Percent with no data under upper 95 CI concentration")
compound_sum_table_waste <- under_levels %>%
dplyr::reframe(total_compounds = sum(!is.na(prop_less_med_ef)), count_less_med_ef = sum(prop_less_med_ef ==
0, na.rm = TRUE), count_less_hic_ef = sum(prop_less_hic_ef == 0, na.rm = TRUE),
percent_less_med_ef = (count_less_med_ef/total_compounds) * 100, percent_less_hic_ef = (count_less_hic_ef/total_compounds) *
100, `Percent with no data under median concentration` = paste0(round(percent_less_med_ef,
2), "%", " (", count_less_med_ef, " of ", total_compounds, ")"), `Percent with no data under upper 95 CI concentration` = paste0(round(percent_less_hic_ef,
2), "%", " (", count_less_hic_ef, " of ", total_compounds, ")")) %>%
dplyr::mutate(Matrix = "Wastewater") %>%
dplyr::select(Matrix, "Percent with no data under median concentration", "Percent with no data under upper 95 CI concentration")
compound_sum_table <- rbind(compound_sum_table_surf, compound_sum_table_waste)
compound_sum_table %>%
gt() %>%
fmt_number(columns = c("Percent with no data under median concentration", "Percent with no data under upper 95 CI concentration"),
decimals = 0) %>%
cols_align(align = "center", columns = everything())
Matrix | Percent with no data under median concentration | Percent with no data under upper 95 CI concentration |
---|---|---|
Surface water | 74.82% (104 of 139) | 50.36% (70 of 139) |
Wastewater | 64.86% (96 of 148) | 44.59% (66 of 148) |
Making a summary data frame for each compound, all_surfacewater, that includes: (1) the number of samples, (2) the number of positive detections (based on single samples, not summary statistics), (3) the rank order of detections, (4) the percentage of positive detections (only for compounds with more then 10 sample units), and, (5) the rank order of percentage of positive detection.
all_surfacewater <- all_databases %>%
dplyr::filter(source != "eipaab", matrix_group == "surfacewater") %>%
dplyr::group_by(compound_name_corrected) %>%
dplyr::reframe(combind_samples = sum(n_units_est, na.rm = TRUE), combind_rows = length(unique_row_id),
combind_singles = sum(concentration_type == "single", na.rm = TRUE), combind_detection = sum(value >
0, na.rm = TRUE), combind_single_detection = sum(value > 0 & concentration_type ==
"single", na.rm = TRUE), combind_percent_detection = if_else(combind_singles >
10, (combind_single_detection/combind_singles) * 100, NA)) %>%
ungroup() %>%
dplyr::mutate(combind_rank_samples = rank(-combind_samples, ties.method = "max",
na.last = "keep"), combind_rank_detection = rank(-combind_detection, ties.method = "max",
na.last = "keep"), combind_rank_detection_percent = rank(-combind_percent_detection,
ties.method = "max", na.last = "keep"))
There’s a total of 1650 compounds in this surface water summary
all_surfacewater %>%
dplyr::distinct(compound_name_corrected) %>%
nrow(.)
## [1] 1650
Making occurrence count in EIPAAB, 📊 eipaab_counts.
all_surfacewater_list <- all_surfacewater %>%
dplyr::distinct(compound_name_corrected) %>%
dplyr::pull(compound_name_corrected)
eipaab_counts <- all_databases %>%
dplyr::filter(source == "eipaab" & compound_name_corrected %in% all_surfacewater_list) %>%
dplyr::group_by(compound_name_corrected) %>%
dplyr::reframe(eipaab_n = sum(n_units_est, na.rm = TRUE), eipaab_n_log = log(eipaab_n)) %>%
ungroup() %>%
dplyr::mutate(eipaab_rank = rank(-eipaab_n, ties.method = "max", na.last = "keep")) %>%
tidyr::complete(., compound_name_corrected)
Combining the data together to make 📊 test_surfacewater
test_surfacewater <- left_join(all_surfacewater, eipaab_counts, by = "compound_name_corrected") %>%
dplyr::mutate(rank_samples_dif = eipaab_rank - combind_rank_samples, rank_samples_dif = if_else(is.na(eipaab_n),
NA, rank_samples_dif), rank_detection_dif = eipaab_rank - combind_rank_detection,
rank_detection_percent_dif = eipaab_rank - combind_rank_detection_percent) %>%
dplyr::arrange(desc(rank_detection_dif), desc(combind_detection)) %>%
dplyr::mutate(rank_detection_dif = 1:nrow(.)) %>%
dplyr::arrange(desc(rank_detection_percent_dif), desc(combind_percent_detection)) %>%
dplyr::mutate(rank_detection_percent_dif = 1:nrow(.))
uba_rank_detection_max <- test_surfacewater %>%
dplyr::arrange(desc(combind_rank_detection)) %>%
dplyr::slice(1) %>%
dplyr::pull(combind_rank_detection)
uba_rank_detection_percent_max <- test_surfacewater %>%
dplyr::arrange(desc(combind_rank_detection_percent)) %>%
dplyr::slice(1) %>%
dplyr::pull(combind_rank_detection_percent)
Figuring out what the max rank occurrence and positive detections could be.
uba_rank_detection_max <- test_surfacewater %>%
dplyr::arrange(desc(combind_rank_detection)) %>%
dplyr::slice(1) %>%
dplyr::pull(combind_rank_detection)
uba_rank_detection_percent_max <- test_surfacewater %>%
dplyr::arrange(desc(combind_rank_detection_percent)) %>%
dplyr::slice(1) %>%
dplyr::pull(combind_rank_detection_percent)
print(paste0("The max rank in terms of the number of occurrences is ", uba_rank_detection_max,
". The max rank for the percentage of postive detections is ", uba_rank_detection_percent_max))
## [1] "The max rank in terms of the number of occurrences is 1650. The max rank for the percentage of postive detections is 1424"
The 10 most common compounds in EIPAAB versus occurrence and detection frequency in UBA surface waters UBA samples.
table_top_surfacewater <- test_surfacewater %>%
arrange(desc(eipaab_n)) %>%
slice(1:10) %>%
mutate(compound_name = sentence_case(compound_name_corrected), combind_percent_detection = round(combind_percent_detection,
2)) %>%
select(compound_name, eipaab_n, combind_samples, combind_detection, combind_percent_detection,
eipaab_rank, combind_rank_detection, combind_rank_detection_percent) %>%
rename(Name = compound_name, `EIPAAB tests` = eipaab_n, `Environmental samples` = combind_samples,
`All detections` = combind_detection, `Precent single detections` = combind_percent_detection,
`EIPAAB rank (1–870)` = eipaab_rank, `Rank total detections (1–1654)` = combind_rank_detection,
`Rank percent detections (1–1428)*` = combind_rank_detection_percent) %>%
flextable() %>%
colformat_num(j = c("Environmental samples", "All detections"), digits = 0) %>%
align(align = "center", part = "all") %>%
autofit()
table_top_surfacewater
Name | EIPAAB tests | Environmental samples | All detections | Precent single detections | EIPAAB rank (1–870) | Rank total detections (1–1654) | Rank percent detections (1–1428)* |
---|---|---|---|---|---|---|---|
Fluoxetine | 122 | 6,642 | 197 | 2.92 | 1 | 185 | 367 |
17-alpha-ethinylestradiol | 56 | 18,120 | 626 | 3.04 | 2 | 113 | 361 |
Venlafaxine | 33 | 5,333 | 1,308 | 46.68 | 3 | 77 | 79 |
Citalopram | 32 | 3,471 | 697 | 27.62 | 4 | 108 | 125 |
Sertraline | 30 | 4,928 | 56 | 1.04 | 5 | 299 | 471 |
Carbamazepine | 27 | 56,084 | 14,637 | 54.90 | 6 | 16 | 65 |
Oxazepam | 26 | 18,858 | 8,704 | 49.00 | 7 | 23 | 74 |
Ibuprofen | 18 | 27,447 | 5,389 | 25.04 | 8 | 30 | 134 |
17-beta-trenbolone | 16 | 247 | 4 | 0.00 | 9 | 647 | 1,424 |
Diazepam | 15 | 7,192 | 399 | 6.68 | 10 | 143 | 261 |
Note: the rank for total percentage detections does not range from 1 to the number of compounds (i.e. 1–1428) because percentage detections were only calculated where the number of single samples are greater than ten
Saving the table
save_as_docx(table_top_surfacewater, path = paste0(fig_wd, "./table-S7.docx"))
Here we will build a simple Bayesian negative binomial (log-link) regression model to look at the relationship between the two parameters.
test_surfacewater %>%
dplyr::filter(!is.na(eipaab_n))
## # A tibble: 163 × 16
## compound_name_corrected combind_samples combind_rows combind_singles
## <chr> <int> <int> <int>
## 1 guanylurea 718 68 27
## 2 irbesartan 7084 6876 6840
## 3 lidocaine 7540 6189 6143
## 4 sulfapyridine 3069 1851 1714
## 5 desvenlafaxine 4045 2012 1918
## 6 metoprolol 16887 11528 11332
## 7 cetirizine 2012 1921 1912
## 8 genistein 143 143 143
## 9 nicotine 3620 3620 3620
## 10 diethyltoluamide 38383 38383 38383
## # ℹ 153 more rows
## # ℹ 12 more variables: combind_detection <int>, combind_single_detection <int>,
## # combind_percent_detection <dbl>, combind_rank_samples <int>,
## # combind_rank_detection <int>, combind_rank_detection_percent <int>,
## # eipaab_n <int>, eipaab_n_log <dbl>, eipaab_rank <int>,
## # rank_samples_dif <int>, rank_detection_dif <int>,
## # rank_detection_percent_dif <int>
detection_negbin_str <- bf(eipaab_n ~ combind_detection, family = negbinomial())
These are the default priors.
suppressWarnings(get_prior(detection_negbin_str, data = test_surfacewater, family = negbinomial()))
## prior class coef group resp dpar nlpar lb ub
## (flat) b
## (flat) b combind_detection
## student_t(3, 0, 2.5) Intercept
## inv_gamma(0.4, 0.3) shape 0
## source
## default
## (vectorized)
## default
## default
Run model. This has been hashed out as I have saved the model, and will re-load it instead of re-running it.
options(brms.backend = "cmdstanr")
detection_negbin_mod <- brm(
detection_negbin_str,
data = test_surfacewater,
cores = 4,
chains = 4,
#prior = mod_priors, #use defaults
warmup = 1000,
seed = 20250418,
thin = 2,
iter = 8000,
control = list(max_treedepth = 20, adapt_delta = 0.95),
save_pars = save_pars(all = TRUE),
sample_prior = TRUE,
file = paste0(mods_wd, './detection_negbin_mod')
)
Reload the model if necessary.
detection_negbin_mod <- readRDS(file = paste0(mods_wd, "./detection_negbin_mod.rds"))
The R2 value is very low, but should be fine for the purposes of this investigation.
performance::r2_bayes(detection_negbin_mod, robust = FALSE, ci = 0.95)
## # Bayesian R2 with Compatibility Interval
##
## Conditional R2: 0.077 (95% CI [1.251e-07, 0.254])
The mode generally follows the observed data, but isn’t the best fit. Again, this should be fine for the purposes of this investigation.
color_scheme_set("red")
brms::pp_check(detection_negbin_mod, ndraws = 50, type = "ecdf_overlay")
Checking R-hat values to confirm model convergence and ESS fo estimates.
print(summary(detection_negbin_mod, prob = 0.95)) #All Rhat = 1 (good).
## Family: negbinomial
## Links: mu = log; shape = identity
## Formula: eipaab_n ~ combind_detection
## Data: test_surfacewater (Number of observations: 163)
## Draws: 4 chains, each with iter = 8000; warmup = 1000; thin = 2;
## total post-warmup draws = 14000
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 1.35 0.10 1.15 1.56 1.00 8978 9012
## combind_detection 0.00 0.00 0.00 0.00 1.00 13850 12137
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## shape 0.81 0.09 0.65 1.01 1.00 8054 7759
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Diagnostic plots look fine. A smaller shape value indicates greater overdispersion (variance is much larger than the mean).
color_scheme_set("red")
plot(detection_negbin_mod)
Pulling the conditional effect estimates from the model. The estimate corresponds to the effect of the predictor on the log of the expected response variable. A 1-unit increase (i.e. 1 detection) in the number of environmental detections increases the expected occurrence in EIPAAB database by 0.011% (i.e. (exp(0.0001105696)-1)×100 = 0.01105757). In other words, for each 1000 detections we would expect a 5% increase in test in the EIPAAB database.
esitmates_3 <- conditional_effects(detection_negbin_mod)
estimate_data_3 <- esitmates_3$combind_detection
model_summary_3 <- as.data.frame(fixef(detection_negbin_mod, summary = TRUE)) %>%
rownames_to_column() %>%
clean_names()
estimate_3 <- model_summary_3$estimate[2]
L95CI_3 <- model_summary_3$q2_5[2]
H95CI_3 <- model_summary_3$q97_5[2]
estimate_percent_3 <- (exp(estimate_3) - 1) * 100
L95CI_percent_3 <- (exp(L95CI_3) - 1) * 100
H95CI_percent_3 <- (exp(H95CI_3) - 1) * 100
model_summary_3_ft <- model_summary_3 %>%
dplyr::mutate(across(where(is.double), ~round(.x, 5))) %>%
dplyr::rename(predictor = rowname) %>%
dplyr::mutate(CrI = paste0(q2_5, " to ", q97_5)) %>%
dplyr::select(predictor, estimate, CrI) %>%
flextable() %>%
set_header_labels(predictor = "Predictor", estimate = "Estimate", CrI = "95% CrI") %>%
autofit()
model_summary_3_ft
Predictor | Estimate | 95% CrI |
---|---|---|
Intercept | 1.35109 | 1.15232 to 1.55699 |
combind_detection | 0.00011 | 4e-05 to 0.00019 |
test_surfacewater %>%
filter(!is.na(eipaab_n)) %>%
dplyr::distinct(compound_name_corrected) %>%
dplyr::arrange(desc(compound_name_corrected))
## # A tibble: 163 × 1
## compound_name_corrected
## <chr>
## 1 yohimbine
## 2 xylazine
## 3 verapamil
## 4 venlafaxine
## 5 tylosin
## 6 tramadol
## 7 toltrazuril
## 8 tiamulin
## 9 thrimethoprim
## 10 tetracycline
## # ℹ 153 more rows
Each pharmaceutical occurrence in the behavioural database plotted against the number of positive detections in the environmental databases. The five compounds highlighted represent those that are currently underrepresented in the behaviour test database based on their rank detections (i.e. rank occurrence in behaviour test databases vs rank detections). The point size represents the total number of samples across all environmental databases.
text_coord <- test_surfacewater %>%
filter(!is.na(eipaab_n)) %>%
summarise(x = max(combind_detection, na.rm = TRUE)/2, y = max(eipaab_n, na.rm = TRUE))
y_coord <- text_coord %>%
pull(y)
x_coord <- text_coord %>%
pull(x)
fig_s3a <- ggplot() + geom_line(data = estimate_data_3, aes(x = combind_detection,
y = estimate__), color = "#E01F33", linewidth = 1) + geom_ribbon(data = estimate_data_3,
aes(x = combind_detection, ymin = lower__, ymax = upper__), fill = "#E01F33",
alpha = 0.1) + geom_point(data = test_surfacewater %>%
filter(!is.na(eipaab_n)), aes(x = combind_detection, y = eipaab_n, size = combind_samples),
alpha = 0.4, shape = 21, color = "#01080A", fill = "#E01F33") + annotate("text",
x = x_coord, y = y_coord, label = paste0("Bayesian negative binomial regression",
"\n", "Effect estimate: ", round(estimate_percent_3, 3), "\n", "95%CrI: ",
round(L95CI_percent_3, 3), " to ", round(H95CI_percent_3, 3)), hjust = 0,
vjust = 1, size = 4) + geom_text_repel(data = test_surfacewater %>%
filter(!is.na(eipaab_n), rank_detection_dif %in% 1:5), aes(x = combind_detection,
y = eipaab_n, label = compound_name_corrected), size = 3, box.padding = 0.7) +
# Adjust scales
scale_fill_gradientn(colours = colorspace::diverge_hcl(7)) + scale_colour_manual(values = c(`TRUE` = "#01080A",
`FALSE` = "#01080A"), guide = "none") + labs(title = "Test vs environmental detections",
x = "Total detections in environmental databases", y = "Occurrence in behavioural database",
fill = "Number of surface water samples", size = "Number of surface water samples") +
theme_clean() + theme(legend.position = "bottom", legend.justification = "center")
fig_s3a
This code saves the figure above as PDF.
ggsave(filename = paste0(fig_wd, "./figure-s3a.pdf"),
plot = fig_s3a,
width = 210,
height = 297/1.5, # Specify the height of the plot
units = "mm")
Model structure.
percent_detection_negbin_str <- bf(eipaab_n ~ combind_percent_detection, family = negbinomial())
Run model. This has been hashed out as I have saved the model, and will re-load it instead of re-running it.
options(brms.backend = "cmdstanr")
percent_detection_negbin_mod <- brm(
percent_detection_negbin_str,
data = test_surfacewater,
cores = 4,
chains = 4,
#prior = mod_priors, #use defaults
warmup = 1000,
seed = 20250418,
thin = 2,
iter = 8000,
control = list(max_treedepth = 20, adapt_delta = 0.95),
save_pars = save_pars(all = TRUE),
sample_prior = TRUE,
file = paste0(mods_wd, './percent_detection_negbin_mod')
)
Reload the model if necessary.
percent_detection_negbin_mod <- readRDS(file = paste0(mods_wd, "./percent_detection_negbin_mod.rds"))
The R2 value is very low, but should be fine for the purposes of this investigation.
performance::r2_bayes(percent_detection_negbin_mod, robust = FALSE, ci = 0.95)
## # Bayesian R2 with Compatibility Interval
##
## Conditional R2: 0.020 (95% CI [1.900e-10, 0.065])
Check fit.
color_scheme_set("red")
brms::pp_check(percent_detection_negbin_mod, ndraws = 100, type = "ecdf_overlay")
Checking R-hat values to confirm model convergence.
print(summary(percent_detection_negbin_mod, prob = 0.95)) #All Rhat = 1 (good).
## Family: negbinomial
## Links: mu = log; shape = identity
## Formula: eipaab_n ~ combind_percent_detection
## Data: test_surfacewater (Number of observations: 156)
## Draws: 4 chains, each with iter = 8000; warmup = 1000; thin = 2;
## total post-warmup draws = 14000
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept 1.37 0.12 1.13 1.62 1.00 13369
## combind_percent_detection 0.01 0.01 0.00 0.02 1.00 12856
## Tail_ESS
## Intercept 12015
## combind_percent_detection 12000
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## shape 0.78 0.09 0.62 0.97 1.00 13705 12182
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Check chains, and estimates.
color_scheme_set("red")
plot(percent_detection_negbin_mod)
Pulling the conditional effect estimates from the model. The estimate corresponds to the effect of the predictor on the log of the expected response variable. A 1-unit increase (i.e. 1%) in the percentage of environmental detections increases the expected occurrence in EIPAAB database by 1.17% (i.e. (exp(0.01160477)-1)*100 = 1.167237).
esitmates_4 <- conditional_effects(percent_detection_negbin_mod)
estimate_data_4 <- esitmates_4$combind_percent_detection
model_summary_4 <- as.data.frame(fixef(percent_detection_negbin_mod, summary = TRUE)) %>%
rownames_to_column() %>%
clean_names()
estimate_4 <- model_summary_4$estimate[2]
L95CI_4 <- model_summary_4$q2_5[2]
H95CI_4 <- model_summary_4$q97_5[2]
estimate_percent_4 <- (exp(estimate_4) - 1) * 100
L95CI_percent_4 <- (exp(L95CI_4) - 1) * 100
H95CI_percent_4 <- (exp(H95CI_4) - 1) * 100
model_summary_4_ft <- model_summary_4 %>%
dplyr::mutate(across(where(is.double), ~round(.x, 3))) %>%
dplyr::rename(predictor = rowname) %>%
dplyr::mutate(CrI = paste0(q2_5, " to ", q97_5)) %>%
dplyr::select(predictor, estimate, CrI) %>%
flextable() %>%
set_header_labels(predictor = "Predictor", estimate = "Estimate", CrI = "95% CrI") %>%
autofit()
model_summary_4_ft
Predictor | Estimate | 95% CrI |
---|---|---|
Intercept | 1.366 | 1.127 to 1.615 |
combind_percent_detection | 0.012 | 0.001 to 0.023 |
Compounds with the most environmental detections that are not present in EIPAAB.
table_S5 <- test_surfacewater %>%
filter(is.na(eipaab_n)) %>%
arrange(desc(combind_detection)) %>%
slice(1:15) %>%
mutate(compound_name = sentence_case(compound_name_corrected)) %>%
select(compound_name, combind_samples, combind_detection, combind_percent_detection,
combind_rank_samples, combind_rank_detection) %>%
rename(Name = compound_name, Samples = combind_samples, Detections = combind_detection,
`Percentage detections` = combind_percent_detection, `Rank samples` = combind_rank_samples,
`Rank detections` = combind_rank_detection) %>%
flextable() %>%
colformat_num(j = c("Samples", "Detections"), digits = 0) %>%
align(align = "center", part = "all") %>%
autofit()
table_S5
Name | Samples | Detections | Percentage detections | Rank samples | Rank detections |
---|---|---|---|---|---|
Glyphosate | 79,682 | 33,956 | 42.614392 | 28 | 1 |
Atrazine | 121,709 | 29,480 | 24.221709 | 3 | 2 |
Bentazone | 98,363 | 28,315 | 28.786231 | 19 | 3 |
Carbendazim | 110,395 | 22,410 | 20.299832 | 9 | 4 |
Lead | 57,662 | 21,463 | 37.222087 | 56 | 5 |
Tebuconazole | 106,651 | 20,664 | 19.375346 | 12 | 6 |
Imidacloprid | 90,527 | 20,383 | 22.515934 | 21 | 7 |
Boscalid | 68,930 | 19,456 | 28.225736 | 39 | 8 |
Chlorotoluron | 114,605 | 18,155 | 15.841368 | 4 | 9 |
Diflufenican | 70,085 | 17,074 | 24.361846 | 38 | 11 |
Isoproturon | 128,238 | 16,944 | 13.212932 | 1 | 12 |
Azoxystrobin | 83,421 | 16,516 | 19.798372 | 25 | 13 |
Propyzamide | 105,962 | 16,302 | 15.384761 | 14 | 14 |
Chloridazon | 98,996 | 15,203 | 15.357186 | 17 | 15 |
Ethofumesate | 111,478 | 10,649 | 9.552557 | 8 | 19 |
Pharmaceutical occurrence in the behavioural test database plotted against the percentage of positive detections (i.e. positive detections relative to total samples for that compound) in environmental databases. The five compounds highlighted represent those that are currently underrepresented in the behaviour test database based on their rank percentage detections (i.e. rank occurrence in behaviour test databases vs rank percentage detections in environmental databases). The point size represents the total number of samples across all environmental databases.
text_coord <- test_surfacewater %>%
filter(!is.na(eipaab_n)) %>%
summarise(x = max(combind_percent_detection, na.rm = TRUE)/2, y = max(eipaab_n,
na.rm = TRUE))
y_coord <- text_coord %>%
pull(y)
x_coord <- text_coord %>%
pull(x)
fig_s3b <- ggplot() + geom_line(data = estimate_data_4, aes(x = combind_percent_detection,
y = estimate__), color = "#1c4595", linewidth = 1) + geom_ribbon(data = estimate_data_4,
aes(x = combind_percent_detection, ymin = lower__, ymax = upper__), fill = "#1c4595",
alpha = 0.1) + geom_point(data = test_surfacewater %>%
filter(!is.na(eipaab_n)) %>%
dplyr::arrange(rank_detection_percent_dif) %>%
mutate(is_top5 = rank_detection_percent_dif %in% 1:5), aes(x = combind_percent_detection,
y = eipaab_n, size = combind_singles), alpha = 0.4, shape = 21, color = "#01080A",
fill = "#1C4794") + annotate("text", x = x_coord, y = y_coord, label = paste0("Bayesian negative binomial regression",
"\n", "Effect estimate (%): ", round(estimate_percent_4, 3), "\n", "95%CrI: ",
round(L95CI_percent_4, 3), " to ", round(H95CI_percent_4, 3)), hjust = 0, vjust = 1,
size = 4) + geom_text_repel(data = test_surfacewater %>%
filter(!is.na(eipaab_n), rank_detection_percent_dif %in% 1:5), aes(x = combind_percent_detection,
y = eipaab_n, label = compound_name_corrected), size = 3, box.padding = 0.7) +
scale_fill_gradientn(colours = colorspace::diverge_hcl(7)) + scale_colour_manual(values = c(`TRUE` = "#01080A",
`FALSE` = "#01080A"), guide = "none") + labs(title = "Test vs percentage detections",
x = "Percentage detections in environmental databases", y = "Occurrence in behavioural database",
size = "Number of single samples") + theme_clean() + theme(legend.position = "bottom",
legend.justification = "center")
fig_s3b
Save the figure
ggsave(filename = paste0(fig_wd, "./figure-s3b.pdf"),
plot = fig_s3b,
width = 210,
height = 297/1.5, # Specify the height of the plot
units = "mm")
Here are compound that are not in EIPAAB database, but have high positive detection rates in the environment.
gap_tb <- test_surfacewater %>%
filter(is.na(eipaab_n), !is.na(combind_percent_detection)) %>%
arrange(desc(combind_percent_detection)) %>%
slice(1:20) %>%
mutate(compound_name = sentence_case(compound_name_corrected), combind_percent_detection = round(combind_percent_detection,
2)) %>%
select(compound_name, combind_singles, combind_single_detection, combind_percent_detection) %>%
rename(Name = compound_name, `Single samples` = combind_singles, `Single detections` = combind_single_detection,
`Percent detections` = combind_percent_detection) %>%
flextable() %>%
align(align = "center", part = "all") %>%
autofit()
gap_tb
Name | Single samples | Single detections | Percent detections |
---|---|---|---|
Sulisobenzone | 101 | 101 | 100.00 |
10,11-dihydro-10,11-dihydroxy carbamazepine | 13 | 13 | 100.00 |
Alpha-hydroxy metoprolol | 24 | 24 | 100.00 |
Acetaminophen sulfate | 24 | 24 | 100.00 |
17-beta-estradiol-17-glucuronide | 14 | 14 | 100.00 |
Estriol-16-glucuronide | 14 | 14 | 100.00 |
2-hydroxy ibuprofen | 41 | 40 | 97.56 |
4-acetaminoantipyrine | 21 | 20 | 95.24 |
Methenamine | 811 | 768 | 94.70 |
1-hydroxy ibuprofen | 61 | 57 | 93.44 |
Carbamazepine-3oh | 26 | 24 | 92.31 |
Carboxyibuprofen | 39 | 36 | 92.31 |
O-desmethyltramadol | 34 | 31 | 91.18 |
Perindopril | 112 | 101 | 90.18 |
Lamotrigine | 1,555 | 1,378 | 88.62 |
4-formylaminoantipyrine | 40 | 35 | 87.50 |
Erythromycin-h2o | 168 | 143 | 85.12 |
10-hydroxy-amitriptyline | 13 | 11 | 84.62 |
10,11-dihydro-10,11-epoxycarbamazepine | 44 | 37 | 84.09 |
Carbamazepine-2oh | 31 | 25 | 80.65 |
Saving the table
save_as_docx(gap_tb, path = paste0(fig_wd, "./gap_tb.docx"))
Here we will look at general temporal trends using the EIPAAB, UBA and NORMAN data.
temporal_trends <- all_databases %>%
dplyr::filter(source != "wilkson") %>%
dplyr::mutate(group_type = if_else(source == "eipaab", "test", "environmental")) %>%
dplyr::group_by(group_type, year) %>%
dplyr::reframe(n_samples = sum(n_units_est, na.rm = TRUE), n_compounds = length(unique(compound_name_corrected)),
rel_compounds = n_compounds/n_samples, compound_list = list(unique(compound_name_corrected))) %>%
tidyr::complete(group_type, year) %>%
dplyr::filter(!is.na(year)) %>%
dplyr::group_by(group_type) %>%
dplyr::mutate(compound_list = purrr::map(compound_list, ~if (is.null(.x)) character(0) else .x),
cumulative_compounds = purrr::accumulate(compound_list, ~union(.x, .y)) %>%
purrr::map_int(length), n_samples_z = scale(n_samples)) %>%
dplyr::ungroup()
Let’s look at the total number of samples per year in these combined databases
temporal_trends %>%
# dplyr::filter(year > 1999 & year < 2021) %>%
ggplot(aes(x = year, y = n_samples)) + geom_point(alpha = 0.5) + facet_wrap(~group_type,
scales = "free", labeller = labeller(group_type = c(environmental = "Environmental data",
test = "Behavioural ecotoxicology data"))) + labs(title = "Data over time",
x = "Year", y = "Sampling effort (tests or water samples)") + theme_clean() +
theme(legend.position = "bottom", legend.justification = "center")
For a clearer temporal trend we look within the NORMAN and UBA-PHAMA databases separately.
Making a new dataset just for NORMAN EMPODAT and PHARMS-UBA database
uba_norman_temporal_trends <- all_databases %>%
dplyr::filter(source == "uba" | source == "norman") %>%
dplyr::group_by(source, year) %>%
dplyr::reframe(n_samples = sum(n_units_est, na.rm = TRUE), n_compounds = length(unique(compound_name_corrected)),
rel_compounds = n_compounds/n_samples, compound_list = list(unique(compound_name_corrected))) %>%
tidyr::complete(source, year) %>%
dplyr::filter(!is.na(year)) %>%
dplyr::group_by(source) %>%
dplyr::mutate(compound_list = purrr::map(compound_list, ~if (is.null(.x)) character(0) else .x),
cumulative_compounds = purrr::accumulate(compound_list, ~union(.x, .y)) %>%
purrr::map_int(length), n_samples_z = scale(n_samples)) %>%
dplyr::ungroup()
Closer look at UBA
uba_sample_year <- uba_norman_temporal_trends %>%
dplyr::filter(source == "uba") %>%
ggplot(aes(x = year, y = n_samples)) + geom_point(alpha = 0.5) + labs(title = "Data over time",
x = "Year", y = "Sampling effort") + theme_clean() + theme(legend.position = "bottom",
legend.justification = "center")
uba_sample_year
Save this figure for the MS
ggsave(filename = paste0(fig_wd, "./uba_sample_year.pdf"), plot = uba_sample_year,
width = 210, height = 297/1.5, units = "mm")
Now the cumulative number of compounds over time.
temporal_trends %>%
# dplyr::filter(year > 1999 & year < 2021) %>%
ggplot(aes(x = year, y = cumulative_compounds, size = n_samples_z)) + geom_point(alpha = 0.5) +
facet_wrap(~group_type, scales = "free", labeller = labeller(group_type = c(environmental = "Environmental data",
test = "Behavioural ecotoxicology data"))) + labs(title = "Cummulative compounds",
x = "Year", y = "Total number of compounds", size = "Sample effort (scaled)") +
theme_clean() + theme(legend.position = "bottom", legend.justification = "center")
With a closer look at UBA
uba_cummulative_fig <- uba_norman_temporal_trends %>%
dplyr::filter(source == "uba") %>%
ggplot(aes(x = year, y = cumulative_compounds, size = n_samples_z)) + geom_point(alpha = 0.5) +
# facet_wrap(~source, scales = 'free', labeller = labeller(source = c(
# 'uba' = 'PHARMS-UBA database', 'norman' = 'NORMAN EMPODAT database')) ) +
labs(title = "Cummulative compounds", x = "Year", y = "Total number of compounds",
size = "Sample effort (scaled)") + theme_clean() + theme(legend.position = "bottom",
legend.justification = "center")
uba_cummulative_fig
Save this figure for the MS
ggsave(filename = paste0(fig_wd, "./uba_cummulative_fig.pdf"), plot = uba_cummulative_fig,
width = 210, height = 297/1.5, units = "mm")
select_doses()
Generates a sequence of recommended doses
based on observed environmental concentration data, using a specified
measure of central tendency (mean, median, or mode) and geometric
spacing. The function selects up to max_doses values spaced above and/or
below the central value, optionally constrained to observed values and
limits of quantification (LOQ). It returns a dose table, a summary of
the distribution, and side-by-side plots showing dose placement on both
raw and log-scaled density curves. Designed to support transparent and
data-informed dose selection for ecotoxicological study design.
select_doses <- function(data, variable, central_tendency = "median", min_spacing = 3.2,
max_doses = 5, LOQ = 1e-04, direction = "both", limit_to_observed = FALSE) {
library(ggplot2)
library(dplyr)
library(patchwork)
values <- data[[variable]]
values <- values[!is.na(values) & values > 0]
if (length(values) < 3)
stop("Not enough data points.")
# Central tendency
center <- switch(tolower(central_tendency), mean = mean(values), median = median(values),
mode = {
d <- density(values)
d$x[which.max(d$y)]
}, stop("Invalid central_tendency"))
upper <- quantile(values, 0.95)
lower <- quantile(values, 0.05)
max_val <- max(values)
min_val <- min(values)
doses <- c(center)
labels <- c("D1")
# Generate above and below doses
above <- c()
below <- c()
# Above
val <- center
while (length(above) < max_doses) {
val <- val * min_spacing
if (limit_to_observed && val > max_val)
break
above <- c(above, val)
}
# Below
val <- center
while (length(below) < max_doses) {
val <- val/min_spacing
if (val <= LOQ)
break
if (limit_to_observed && val < min_val)
break
below <- c(below, val)
}
# Alternate selection
dose_list <- c(center)
i <- 1
while (length(dose_list) < max_doses) {
if (i <= length(above))
dose_list <- c(dose_list, above[i])
if (length(dose_list) == max_doses)
break
if (i <= length(below))
dose_list <- c(dose_list, below[i])
if (length(dose_list) == max_doses)
break
i <- i + 1
if (i > 20)
break
}
if (length(dose_list) < max_doses) {
warning("Unable to select enough doses within constraints.")
return(NA)
}
dose_list <- sort(dose_list)
dose_df <- data.frame(dose = paste0("D", seq_along(dose_list)), value = dose_list)
# Summary
summary_tbl <- data.frame(central = center, upper_95 = upper, lower_05 = lower,
max = max_val, min = min_val, LOQ = LOQ, row.names = NULL)
# Base density for grey fill
dens <- density(values)
dens_df <- data.frame(x = dens$x, y = dens$y)
dens_df <- dens_df %>%
mutate(in_cri = x >= lower & x <= upper)
caption_text <- paste("Density plots show the distribution of observed environmental data.\n",
"The blue dashed line indicates the selected measure of central tendency\n",
"(mean, median, or mode). The red dot-dash line marks the Limit of\n", "Quantification (LOQ), defult 0.0001. Black points represent the recommended doses\n",
"based on spacing from the central value. The shaded grey region denotes\n",
"the 95% credible interval (CrI) of the data distribution.", sep = "")
# Plot
base_plot <- ggplot() + geom_density(data = data, aes_string(x = variable), size = 1.2) +
geom_ribbon(data = dens_df, aes(x = x, ymin = 0, ymax = ifelse(in_cri, y,
NA)), fill = "grey80", alpha = 0.5) + geom_vline(xintercept = center,
color = "blue", linetype = "dashed", size = 1) + geom_vline(xintercept = LOQ,
color = "red", linetype = "dotdash") + geom_point(data = dose_df, aes(x = value,
y = 0), size = 3, color = "black") + geom_text(data = dose_df, aes(x = value,
y = 0.01, label = dose), vjust = -1) + ggtitle("Dose Selection (Raw Scale)") +
theme_minimal() + labs(caption = caption_text) + theme(plot.caption = element_text(size = 9,
hjust = 0, lineheight = 1.1))
dens_log <- density(log10(values))
dens_df_log <- data.frame(x = 10^dens_log$x, y = dens_log$y)
log_plot <- ggplot() + geom_line(data = dens_df_log, aes(x = x, y = y), size = 1.2) +
annotate("rect", xmin = lower, xmax = upper, ymin = 0, ymax = max(dens_df_log$y),
fill = "grey80", alpha = 0.4) + geom_vline(xintercept = center, color = "blue",
linetype = "dashed", size = 1) + geom_vline(xintercept = LOQ, color = "red",
linetype = "dotdash") + geom_point(data = dose_df, aes(x = value, y = 0),
size = 3, color = "black") + geom_text(data = dose_df, aes(x = value, y = 0.01 *
max(dens_df_log$y), label = dose), vjust = -1) + scale_x_log10() + ggtitle("Dose Selection (Log Scale)") +
theme_minimal()
plots <- print(base_plot + log_plot)
plots
covered_prop <- mean(values >= min(dose_list) & values <= max(dose_list))
summary_tbl$distribution_covered <- covered_prop
return(list(doses = dose_df, summary = summary_tbl, plot = plots))
}
Here’s an example data set based on one of the compounds in the dataset, fluoxetine
fluoxetine_surfacwaters <- all_databases %>%
dplyr::filter(matrix_group == "surfacewater" & compound_name == "fluoxetine") %>%
dplyr::arrange(value) %>%
dplyr::filter(value > 0)
Apply to concentration data
my_doses <- select_doses(data = fluoxetine_surfacwaters, variable = "value", central_tendency = "median",
min_spacing = 5, max_doses = 3, LOQ = 1e-04, direction = "both", limit_to_observed = FALSE)