Data Cleaning & Preparation for Analysis • ManyEcoEvo

library(ManyEcoEvo)
#> Loading required package: rmarkdown
#> Loading required package: bookdown
#> Registered S3 method overwritten by 'parsnip':
#>   method          from 
#>   print.nullmodel vegan
#> Registered S3 method overwritten by 'lava':
#>   method         from    
#>   print.estimate EnvStats
suppressPackageStartupMessages(library(tidyverse))

0.1 Data Cleaning

0.1.1 Anonymising Data

We have anonymised our public dataset data(ManyEcoEvo) anonymise_teams(), which takes a look-up table of new and old identifier names with which to replace each analysis identifier. The lookup table and original non-anonymised data can be stored in a private repository or component, on the OSF for example, while the anonymised dataset can be released publicly.

0.2 Data Pre-processing for Meta-analysis

The meta-analysis requires that all estimates are on the same scale. This is because the meta-analysis is based on the assumption that the outcome measures are comparable. Note that the ManyAnalysts project utilises two different outcomes for meta-analysis, standardised effect-sizes, or \(Z_r\) and out-of-sample predictions \(y_i\), But alternative effect-size measures may be utilised instead¹.

We provide the function standardise_response() to standardise a data-frame of analyst-data.


data("ManyEcoEvo")

blue_tit_effect_sizes <- 
  ManyEcoEvo %>% 
  dplyr::filter(dataset == "blue tit") %>% 
  pluck("data", 1) %>% 
  slice(1:10) %>% 
  select(contains("id"), 
         -response_id_S2,
         contains("beta"), 
         adjusted_df)

blue_tit_effect_sizes
#> # A tibble: 10 × 9
#>    response_id       submission_id analysis_id split_id TeamIdentifier id_col   
#>    <chr>                     <dbl>       <dbl>    <dbl> <chr>          <chr>    
#>  1 R_11787O3NmejXKAH             1           2        2 Ayr            Ayr-1-2-2
#>  2 R_11787O3NmejXKAH             1           2        3 Ayr            Ayr-1-2-3
#>  3 R_11787O3NmejXKAH             1           2        1 Ayr            Ayr-1-2-1
#>  4 R_126erjKKuN3IwSJ             2           2        1 Bega           Bega-2-2…
#>  5 R_126erjKKuN3IwSJ             2           2        2 Bega           Bega-2-2…
#>  6 R_126erjKKuN3IwSJ             1           1        1 Bega           Bega-1-1…
#>  7 R_126erjKKuN3IwSJ             1           1        2 Bega           Bega-1-1…
#>  8 R_12cozGev3IOOBG2             4           4        1 Bell           Bell-4-4…
#>  9 R_12cozGev3IOOBG2             3           3        1 Bell           Bell-3-3…
#> 10 R_12cozGev3IOOBG2             1           1        1 Bell           Bell-1-1…
#> # ℹ 3 more variables: beta_estimate <dbl>, beta_SE <dbl>, adjusted_df <dbl>

standardise_response(dat = blue_tit_effect_sizes, 
                     estimate_type = "Zr",
                     param_table = NULL, 
                     dataset = "blue tit") %>% 
  select(id_col, contains("beta"), adjusted_df, Zr, VZr )
#> 
#> ── Computing meta-analysis inputsfor `estimate_type` = "Zr" ────────────────────
#> 
#> ── Computing standardised effect sizes `Zr` and variance `VZr` ──
#> 
#> ✖ Required values for computing standardised effect sizes missing:
#> ! Returning "NA" for tupple:
#> 1. beta_estimate NA,
#> 2. beta_se NA,
#> 3. adjusted_df 484.0193.
#> ✖ Required values for computing standardised effect sizes missing:
#> ! Returning "NA" for tupple:
#> 1. beta_estimate NA,
#> 2. beta_se NA,
#> 3. adjusted_df 666.56874.
#> ✖ Required values for computing standardised effect sizes missing:
#> ! Returning "NA" for tupple:
#> 1. beta_estimate NA,
#> 2. beta_se NA,
#> 3. adjusted_df 590.18263.
#> ✖ Required values for computing standardised effect sizes missing:
#> ! Returning "NA" for tupple:
#> 1. beta_estimate NA,
#> 2. beta_se 0.006225,
#> 3. adjusted_df NA.
#> ✖ Required values for computing standardised effect sizes missing:
#> ! Returning "NA" for tupple:
#> 1. beta_estimate NA,
#> 2. beta_se 0.003996,
#> 3. adjusted_df NA.
#> ✖ Required values for computing standardised effect sizes missing:
#> ! Returning "NA" for tupple:
#> 1. beta_estimate NA,
#> 2. beta_se NA,
#> 3. adjusted_df NA.
#> # A tibble: 10 × 6
#>    id_col     beta_estimate  beta_SE adjusted_df      Zr      VZr
#>    <chr>              <dbl>    <dbl>       <dbl>   <dbl>    <dbl>
#>  1 Ayr-1-2-2          NA    NA              484. NA      NA      
#>  2 Ayr-1-2-3          NA    NA              667. NA      NA      
#>  3 Ayr-1-2-1          NA    NA              590. NA      NA      
#>  4 Bega-2-2-1         -4.05  2.11           389. -0.0972  0.00257
#>  5 Bega-2-2-2         -2.55  1.91           384. -0.0681  0.00260
#>  6 Bega-1-1-1         -9.2   2.45           388. -0.189   0.00257
#>  7 Bega-1-1-2          1.26  2.21           382.  0.0292  0.00262
#>  8 Bell-4-4-1         NA     0.00622         NA  NA      NA      
#>  9 Bell-3-3-1         NA     0.00400         NA  NA      NA      
#> 10 Bell-1-1-1         NA    NA               NA  NA      NA

Note that if any of beta_estimate, beta_SE or adjusted_df are missing, standardise_response() is unable to compute standardised correlation coefficients \(Z_r\) and the associated variance \(\text{VZ}_r\).

Below we standardise a data frame containing out-of-sample point-estimate predictions, which are stored in a list-column of dataframes, called augmented_data, notice some additional console messages about back-transformations, as well as an additional step Transforming out of sample predictions from link to response scale. That’s because, depending on what estimate_type is being standardised, a different workflow will be implemented by standardise_response().

# ----- Create example blue tit dataset ----

data("ManyEcoEvo_yi")
blue_tit_predictions <- 
  ManyEcoEvo_yi %>% 
  dplyr::filter(dataset == "blue tit") %>% 
  pluck("data", 1) %>% 
  head()

# ----- back-transform analyst estimates to original response scale ----
blue_tit_back_transformed <- 
  blue_tit_predictions %>% 
  back_transform_response_vars_yi(estimate_type = "yi",
                                  dataset = "blue tit") %>% 
  ungroup %>% 
  select(
    id_col,
    response_variable_name,
    contains("transformation"),
    augmented_data, 
    back_transformed_data
  ) #TODO transformation column seems wrong! but output from convert_predictions() suggests correct transformation occured!
#> ✔ Applied back-transformation for squared effect sizes or out-of-sample predictions.
#> ✔ Applied back-transformation for squared effect sizes or out-of-sample predictions.
#> ✔ Applied back-transformation for squared effect sizes or out-of-sample predictions.
#> ✔ Applied back-transformation for squared effect sizes or out-of-sample predictions.
#> ✔ Applied back-transformation for squared effect sizes or out-of-sample predictions.
#> ✔ Applied back-transformation for squared effect sizes or out-of-sample predictions.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.
#> ℹ No back-transformation required, identity link used.

blue_tit_back_transformed
#> # A tibble: 6 × 8
#>   id_col    response_variable_name response_transformat…¹ response_transformat…²
#>   <chr>     <chr>                  <chr>                  <chr>                 
#> 1 Bega-1-1… day_14_weight          power2                 square                
#> 2 Bega-2-2… day_14_tarsus_length   power2                 square                
#> 3 Bell-2-2… day_14_tarsus_length   NA                     NA                    
#> 4 Berr-1-1… day_14_weight          z.score                identity              
#> 5 Burr-1-1… day_14_tarsus_length   NA                     NA                    
#> 6 Burr-2-2… day_14_weight          NA                     NA                    
#> # ℹ abbreviated names: ¹response_transformation_description,
#> #   ²response_transformation_status
#> # ℹ 4 more variables: transformation <chr>, transformation_type <chr>,
#> #   augmented_data <named list>, back_transformed_data <named list>

# ----- standardize to Z scale ------

blue_tit_standardised <- 
  blue_tit_back_transformed %>% 
  standardise_response(
    estimate_type = "yi" ,
    param_table = ManyEcoEvo:::analysis_data_param_tables, 
    dataset = "blue tit"
  ) %>% 
  ungroup %>% 
  select(
    id_col,
    params, 
    transformation,
    augmented_data, 
    back_transformed_data
  )
#> 
#> ── Computing meta-analysis inputsfor `estimate_type` = "yi" ────────────────────
#> 
#> ── Standardising out-of-sample predictions ──
#> 

blue_tit_standardised
#> # A tibble: 6 × 5
#>   id_col     params   transformation augmented_data     back_transformed_data
#>   <chr>      <list>   <chr>          <named list>       <named list>         
#> 1 Bega-1-1-1 <tibble> identity       <gropd_df [3 × 5]> <tibble [3 × 3]>     
#> 2 Bega-2-2-1 <tibble> identity       <gropd_df [3 × 5]> <tibble [3 × 3]>     
#> 3 Bell-2-2-1 <tibble> identity       <gropd_df [3 × 5]> <tibble [3 × 3]>     
#> 4 Berr-1-1-1 <tibble> identity       <gropd_df [3 × 5]> <tibble [3 × 3]>     
#> 5 Burr-1-1-1 <tibble> identity       <gropd_df [3 × 5]> <tibble [3 × 3]>     
#> 6 Burr-2-2-1 <tibble> identity       <gropd_df [3 × 5]> <tibble [3 × 3]>

# ----- parameters ---- 
blue_tit_standardised %>% pluck("params", 1) 
#> # A tibble: 2 × 4
#>   variable      parameter value dataset 
#>   <chr>         <chr>     <dbl> <chr>   
#> 1 day_14_weight mean      10.3  blue tit
#> 2 day_14_weight sd         1.19 blue tit
blue_tit_standardised %>% pluck("params", 2) # gets a different set depending on the variable
#> # A tibble: 2 × 4
#>   variable             parameter  value dataset 
#>   <chr>                <chr>      <dbl> <chr>   
#> 1 day_14_tarsus_length mean      16.7   blue tit
#> 2 day_14_tarsus_length sd         0.684 blue tit

# ---- raw predictions data ----
blue_tit_back_transformed %>% pluck("augmented_data", 1)
#> # A tibble: 3 × 5
#> # Groups:   scenario [3]
#>   scenario estimate se.fit ci.low ci.hi
#>      <int>    <dbl>  <dbl>  <dbl> <dbl>
#> 1        1     87.6   6.20   74.9  99.1
#> 2        2    115.    6.31  102.  126. 
#> 3        3    124.    6.04  112.  135.
blue_tit_back_transformed %>% pluck("back_transformed_data", 1)
#> # A tibble: 3 × 5
#>   scenario estimate  se.fit ci.low ci.hi
#>      <int>    <dbl>   <dbl>  <dbl> <dbl>
#> 1        1     9.35 0.00329   8.70  9.97
#> 2        2    10.7  0.00294  10.1  11.3 
#> 3        3    11.1  0.00269  10.6  11.6

# ---- back-transformed & standardised predictions_data ----
blue_tit_standardised %>% pluck("back_transformed_data", 1) 
#> # A tibble: 3 × 3
#>   scenario      Z      VZ
#>      <int>  <dbl>   <dbl>
#> 1        1 -0.778 0.00277
#> 2        2  0.360 0.00248
#> 3        3  0.708 0.00227

MA_data_yi <- blue_tit_standardised %>% 
  select(id_col, back_transformed_data) %>% 
  unnest(back_transformed_data) %>% 
  pointblank::col_vals_between(columns = "Z", left = -3, right = 3, inclusive = TRUE)

0.2.1 Standardising effect-sizes to \(Z_r\)

Standardisation of effect-sizes (fishers’ Z), however other transformations could be applied using other packages if need be (Gurrindgi green meta-analsis handbook).
Coefficients est_to_Zr()

0.2.2 Standardising out-of-sample predictions to \(Z_{y_i}\)

Before standardising out-of-sample predictions, we need to ensure that all estimates are on the same scale. Some analysts may report estimates on the link scale, while others may report estimates on the response scale, for instance. ManyEcoEvo:: provides a suite of functions for both back-transforming estimates prior to standardising effect sizes.

0.2.2.1 Cleaning response-transformation values and assigning a back-transformation

Analysts may report estimates on various scales, for example they may report values on the link or response scales, they may also, or may have transformed the response-variable prior to model-fitting and reported effect-sizes on the transformed scale, rather than the scale of the original variable.

In order to proceed with standardisation of effect-sizes or out-of-sample estimates, we back-transform analysts’ reported estimates to the original response scale in the datasets euc_data and blue_tit_data, rather than the link- or transformed- scale.

assign_transformation_type() takes information about the response_transformation and the link_fun for a given analysis, and assigns the analysis to an appropriate back-transformation rule to be applied, one of either "identity", the value of the link-function or response-transformation, "double.transformation", or NA if an appropriate transformation type cannot be assigned.
Next, the type of response transformation is cleaned using clean_response_transformation(), which cleans any value returned by assign_transformation_type() in step 1 that is not in c("identity", "double.transformation", NA) to a value in a lookup-tibble that assigns the appropriate transformation to apply. Users can supply their own lookup table, or else use or modify the version supplied in ManyEcoEvo:::transformation_tbl.
The estimates are now ready for back-transformation (section 0.2.2.2) and/or standardisation (section 0.2.1).

#TODO demonstrate assign transformation and clean response transformation

0.2.2.2 Back-transforming analysts’ reported out-of-sample predictions

Function Name	Description
`log_back()`	Back-transform beta estimates for models with log-link
`logit_back()`	Back-transform beta estimates for models with logit-link
`probit_back()`	Back-transform beta estimates for models with probit-link
`inverse_back()`	Back-transform beta estimates for models with \(1/x\) link
`square_back()`	Back-transform beta estimates for models with \(x^2\)-link
`cube_back()`	Back-transform beta estimates for models with \(x^3\)-link
`identity_back()`	Back-transform beta estimates for models with identity-link
`power_back()`	Back-transform beta estimates for models with power-link
`divide_back()`	Back-transform beta estimates or out-of-sample predictions from models whose response variable has been divided by some number, `n`
`square_root_back()`	Back-transform beta estimates or out-of-sample predictions from models whose response variable has been transformed by the square root

We provide the conversion() function, which applies the relevant back() function depending on the required transformation assigned to that analysis:

#TODO demonstrate conversion() with back functions

0.2.2.3 Standardising out-of-sample predictions

pred_to_Z() (data frame level), Z_VZ_preds()

#TODO demonstrate application of V_Zr_preds() and or pred_to_z()

0.2.3 Calculating Sorensen similarity index

apply_sorensen_calc()
calculate_sorensen_diversity_index() (also needs to be renamed)

0.2.4 Box-cox transforming deviation from meta-analytic mean

0.2.5 Excluding Data

exclude_extreme_VZ() - exclude extreme values of VZ

References

Nakagawa, Shinichi, Yefeng Yang, Erin L. Macartney, Rebecca Spake, and Malgorzata Lagisz. 2023. “Quantitative Evidence Synthesis: A Practical Guide on Meta-Analysis, Meta-Regression, and Publication Bias Tests for Environmental Sciences.” Environmental Evidence 12 (1): 8. https://doi.org/10.1186/s13750-023-00301-6.