Data Cleaning & Preparation for Analysis
Source:vignettes/data_cleaning_preparation.Rmd
data_cleaning_preparation.Rmd
library(ManyEcoEvo)
#> Loading required package: rmarkdown
#> Loading required package: bookdown
#> Registered S3 method overwritten by 'parsnip':
#> method from
#> print.nullmodel vegan
#> Registered S3 method overwritten by 'lava':
#> method from
#> print.estimate EnvStats
suppressPackageStartupMessages(library(tidyverse))
0.1 Data Cleaning
0.1.1 Anonymising Data
We have anonymised our public dataset data(ManyEcoEvo)
anonymise_teams()
, which takes a look-up table of new and old identifier names with which to replace each analysis identifier. The lookup table and original non-anonymised data can be stored in a private repository or component, on the OSF for example, while the anonymised dataset can be released publicly.
0.2 Data Pre-processing for Meta-analysis
The meta-analysis requires that all estimates are on the same scale. This is because the meta-analysis is based on the assumption that the outcome measures are comparable. Note that the ManyAnalysts project utilises two different outcomes for meta-analysis, standardised effect-sizes, or \(Z_r\) and out-of-sample predictions \(y_i\), But alternative effect-size measures may be utilised instead1.
We provide the function standardise_response()
to standardise a data-frame of analyst-data.
data("ManyEcoEvo")
blue_tit_effect_sizes <-
ManyEcoEvo %>%
dplyr::filter(dataset == "blue tit") %>%
pluck("data", 1) %>%
slice(1:10) %>%
select(contains("id"),
-response_id_S2,
contains("beta"),
adjusted_df)
blue_tit_effect_sizes
#> # A tibble: 10 × 9
#> response_id submission_id analysis_id split_id TeamIdentifier id_col
#> <chr> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 R_11787O3NmejXKAH 1 2 2 Ayr Ayr-1-2-2
#> 2 R_11787O3NmejXKAH 1 2 3 Ayr Ayr-1-2-3
#> 3 R_11787O3NmejXKAH 1 2 1 Ayr Ayr-1-2-1
#> 4 R_126erjKKuN3IwSJ 2 2 1 Bega Bega-2-2…
#> 5 R_126erjKKuN3IwSJ 2 2 2 Bega Bega-2-2…
#> 6 R_126erjKKuN3IwSJ 1 1 1 Bega Bega-1-1…
#> 7 R_126erjKKuN3IwSJ 1 1 2 Bega Bega-1-1…
#> 8 R_12cozGev3IOOBG2 4 4 1 Bell Bell-4-4…
#> 9 R_12cozGev3IOOBG2 3 3 1 Bell Bell-3-3…
#> 10 R_12cozGev3IOOBG2 1 1 1 Bell Bell-1-1…
#> # ℹ 3 more variables: beta_estimate <dbl>, beta_SE <dbl>, adjusted_df <dbl>
standardise_response(dat = blue_tit_effect_sizes,
estimate_type = "Zr",
param_table = NULL,
dataset = "blue tit") %>%
select(id_col, contains("beta"), adjusted_df, Zr, VZr )
#>
#> ── Computing meta-analysis inputsfor `estimate_type` = "Zr" ────────────────────
#>
#> ── Computing standardised effect sizes `Zr` and variance `VZr` ──
#>
#> ✖ Required values for computing standardised effect sizes missing:
#> ! Returning "NA" for tupple:
#> 1. beta_estimate NA,
#> 2. beta_se NA,
#> 3. adjusted_df 484.0193.
#> ✖ Required values for computing standardised effect sizes missing:
#> ! Returning "NA" for tupple:
#> 1. beta_estimate NA,
#> 2. beta_se NA,
#> 3. adjusted_df 666.56874.
#> ✖ Required values for computing standardised effect sizes missing:
#> ! Returning "NA" for tupple:
#> 1. beta_estimate NA,
#> 2. beta_se NA,
#> 3. adjusted_df 590.18263.
#> ✖ Required values for computing standardised effect sizes missing:
#> ! Returning "NA" for tupple:
#> 1. beta_estimate NA,
#> 2. beta_se 0.006225,
#> 3. adjusted_df NA.
#> ✖ Required values for computing standardised effect sizes missing:
#> ! Returning "NA" for tupple:
#> 1. beta_estimate NA,
#> 2. beta_se 0.003996,
#> 3. adjusted_df NA.
#> ✖ Required values for computing standardised effect sizes missing:
#> ! Returning "NA" for tupple:
#> 1. beta_estimate NA,
#> 2. beta_se NA,
#> 3. adjusted_df NA.
#> # A tibble: 10 × 6
#> id_col beta_estimate beta_SE adjusted_df Zr VZr
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Ayr-1-2-2 NA NA 484. NA NA
#> 2 Ayr-1-2-3 NA NA 667. NA NA
#> 3 Ayr-1-2-1 NA NA 590. NA NA
#> 4 Bega-2-2-1 -4.05 2.11 389. -0.0972 0.00257
#> 5 Bega-2-2-2 -2.55 1.91 384. -0.0681 0.00260
#> 6 Bega-1-1-1 -9.2 2.45 388. -0.189 0.00257
#> 7 Bega-1-1-2 1.26 2.21 382. 0.0292 0.00262
#> 8 Bell-4-4-1 NA 0.00622 NA NA NA
#> 9 Bell-3-3-1 NA 0.00400 NA NA NA
#> 10 Bell-1-1-1 NA NA NA NA NA
Note that if any of beta_estimate
, beta_SE
or adjusted_df
are missing, standardise_response()
is unable to compute standardised correlation coefficients \(Z_r\) and the associated variance \(\text{VZ}_r\).
Below we standardise a data frame containing out-of-sample point-estimate predictions, which are stored in a list-column of dataframes, called augmented_data
, notice some additional console messages about back-transformations, as well as an additional step Transforming out of sample predictions from link to response scale. That’s because, depending on what estimate_type
is being standardised, a different workflow will be implemented by standardise_response()
.
# ----- Create example blue tit dataset ----
data("ManyEcoEvo_yi")
blue_tit_predictions <-
ManyEcoEvo_yi %>%
dplyr::filter(dataset == "blue tit") %>%
pluck("data", 1) %>%
head()
# ----- back-transform analyst estimates to original response scale ----
blue_tit_back_transformed <-
blue_tit_predictions %>%
back_transform_response_vars_yi(estimate_type = "yi",
dataset = "blue tit") %>%
ungroup %>%
select(
id_col,
response_variable_name,
contains("transformation"),
augmented_data,
back_transformed_data
) #TODO transformation column seems wrong! but output from convert_predictions() suggests correct transformation occured!
#> ✔ Applied back-transformation for squared effect sizes or out-of-sample predictions.
#> ✔ Applied back-transformation for squared effect sizes or out-of-sample predictions.
#> ✔ Applied back-transformation for squared effect sizes or out-of-sample predictions.
#> ✔ Applied back-transformation for squared effect sizes or out-of-sample predictions.
#> ✔ Applied back-transformation for squared effect sizes or out-of-sample predictions.
#> ✔ Applied back-transformation for squared effect sizes or out-of-sample predictions.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
blue_tit_back_transformed
#> # A tibble: 6 × 8
#> id_col response_variable_name response_transformat…¹ response_transformat…²
#> <chr> <chr> <chr> <chr>
#> 1 Bega-1-1… day_14_weight power2 square
#> 2 Bega-2-2… day_14_tarsus_length power2 square
#> 3 Bell-2-2… day_14_tarsus_length NA NA
#> 4 Berr-1-1… day_14_weight z.score identity
#> 5 Burr-1-1… day_14_tarsus_length NA NA
#> 6 Burr-2-2… day_14_weight NA NA
#> # ℹ abbreviated names: ¹response_transformation_description,
#> # ²response_transformation_status
#> # ℹ 4 more variables: transformation <chr>, transformation_type <chr>,
#> # augmented_data <named list>, back_transformed_data <named list>
# ----- standardize to Z scale ------
blue_tit_standardised <-
blue_tit_back_transformed %>%
standardise_response(
estimate_type = "yi" ,
param_table = ManyEcoEvo:::analysis_data_param_tables,
dataset = "blue tit"
) %>%
ungroup %>%
select(
id_col,
params,
transformation,
augmented_data,
back_transformed_data
)
#>
#> ── Computing meta-analysis inputsfor `estimate_type` = "yi" ────────────────────
#>
#> ── Standardising out-of-sample predictions ──
#>
blue_tit_standardised
#> # A tibble: 6 × 5
#> id_col params transformation augmented_data back_transformed_data
#> <chr> <list> <chr> <named list> <named list>
#> 1 Bega-1-1-1 <tibble> identity <gropd_df [3 × 5]> <tibble [3 × 3]>
#> 2 Bega-2-2-1 <tibble> identity <gropd_df [3 × 5]> <tibble [3 × 3]>
#> 3 Bell-2-2-1 <tibble> identity <gropd_df [3 × 5]> <tibble [3 × 3]>
#> 4 Berr-1-1-1 <tibble> identity <gropd_df [3 × 5]> <tibble [3 × 3]>
#> 5 Burr-1-1-1 <tibble> identity <gropd_df [3 × 5]> <tibble [3 × 3]>
#> 6 Burr-2-2-1 <tibble> identity <gropd_df [3 × 5]> <tibble [3 × 3]>
# ----- parameters ----
blue_tit_standardised %>% pluck("params", 1)
#> # A tibble: 2 × 4
#> variable parameter value dataset
#> <chr> <chr> <dbl> <chr>
#> 1 day_14_weight mean 10.3 blue tit
#> 2 day_14_weight sd 1.19 blue tit
blue_tit_standardised %>% pluck("params", 2) # gets a different set depending on the variable
#> # A tibble: 2 × 4
#> variable parameter value dataset
#> <chr> <chr> <dbl> <chr>
#> 1 day_14_tarsus_length mean 16.7 blue tit
#> 2 day_14_tarsus_length sd 0.684 blue tit
# ---- raw predictions data ----
blue_tit_back_transformed %>% pluck("augmented_data", 1)
#> # A tibble: 3 × 5
#> # Groups: scenario [3]
#> scenario estimate se.fit ci.low ci.hi
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 87.6 6.20 74.9 99.1
#> 2 2 115. 6.31 102. 126.
#> 3 3 124. 6.04 112. 135.
blue_tit_back_transformed %>% pluck("back_transformed_data", 1)
#> # A tibble: 3 × 5
#> scenario estimate se.fit ci.low ci.hi
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 9.35 0.00329 8.70 9.97
#> 2 2 10.7 0.00294 10.1 11.3
#> 3 3 11.1 0.00269 10.6 11.6
# ---- back-transformed & standardised predictions_data ----
blue_tit_standardised %>% pluck("back_transformed_data", 1)
#> # A tibble: 3 × 3
#> scenario Z VZ
#> <int> <dbl> <dbl>
#> 1 1 -0.778 0.00277
#> 2 2 0.360 0.00248
#> 3 3 0.708 0.00227
MA_data_yi <- blue_tit_standardised %>%
select(id_col, back_transformed_data) %>%
unnest(back_transformed_data) %>%
pointblank::col_vals_between(columns = "Z", left = -3, right = 3, inclusive = TRUE)
0.2.1 Standardising effect-sizes to \(Z_r\)
Standardisation of effect-sizes (fishers’ Z), however other transformations could be applied using other packages if need be (Gurrindgi green meta-analsis handbook).
Coefficients
est_to_Zr()
0.2.2 Standardising out-of-sample predictions to \(Z_{y_i}\)
Before standardising out-of-sample predictions, we need to ensure that all estimates are on the same scale. Some analysts may report estimates on the link scale, while others may report estimates on the response scale, for instance. ManyEcoEvo::
provides a suite of functions for both back-transforming estimates prior to standardising effect sizes.
0.2.2.1 Cleaning response-transformation values and assigning a back-transformation
Analysts may report estimates on various scales, for example they may report values on the link or response scales, they may also, or may have transformed the response-variable prior to model-fitting and reported effect-sizes on the transformed scale, rather than the scale of the original variable.
In order to proceed with standardisation of effect-sizes or out-of-sample estimates, we back-transform analysts’ reported estimates to the original response scale in the datasets euc_data
and blue_tit_data
, rather than the link- or transformed- scale.
-
assign_transformation_type()
takes information about theresponse_transformation
and thelink_fun
for a given analysis, and assigns the analysis to an appropriate back-transformation rule to be applied, one of either"identity"
, the value of the link-function or response-transformation,"double.transformation"
, orNA
if an appropriate transformation type cannot be assigned. - Next, the type of response transformation is cleaned using
clean_response_transformation()
, which cleans any value returned byassign_transformation_type()
in step 1 that is not inc("identity", "double.transformation", NA)
to a value in a lookup-tibble that assigns the appropriate transformation to apply. Users can supply their own lookup table, or else use or modify the version supplied inManyEcoEvo:::transformation_tbl
. - The estimates are now ready for back-transformation (section 0.2.2.2) and/or standardisation (section 0.2.1).
#TODO demonstrate assign transformation and clean response transformation
0.2.2.2 Back-transforming analysts’ reported out-of-sample predictions
Function Name | Description |
---|---|
log_back() |
Back-transform beta estimates for models with log-link |
logit_back() |
Back-transform beta estimates for models with logit-link |
probit_back() |
Back-transform beta estimates for models with probit-link |
inverse_back() |
Back-transform beta estimates for models with \(1/x\) link |
square_back() |
Back-transform beta estimates for models with \(x^2\)-link |
cube_back() |
Back-transform beta estimates for models with \(x^3\)-link |
identity_back() |
Back-transform beta estimates for models with identity-link |
power_back() |
Back-transform beta estimates for models with power-link |
divide_back() |
Back-transform beta estimates or out-of-sample predictions from models whose response variable has been divided by some number, n
|
square_root_back() |
Back-transform beta estimates or out-of-sample predictions from models whose response variable has been transformed by the square root |
We provide the conversion()
function, which applies the relevant back()
function depending on the required transformation assigned to that analysis:
#TODO demonstrate conversion() with back functions
0.2.2.3 Standardising out-of-sample predictions
-
pred_to_Z()
(data frame level),Z_VZ_preds()
#TODO demonstrate application of V_Zr_preds() and or pred_to_z()
0.2.3 Calculating Sorensen similarity index
apply_sorensen_calc()
-
calculate_sorensen_diversity_index()
(also needs to be renamed)