Sample modifiers in pepr: derive

Michal Stolarczyk

2023-11-21

Learn derived attributes in pepr

This vignette will show you how and why to use the derived attributes functionality of the pepr package.

Problem/Goal

The example below demonstrates how to use the derived attributes to flexibly define the samples attributes the file_path column of the sample_table.csv file to match the file names in your project. Please consider the example below for reference:

sample_name protocol organism time file_path
pig_0h RRBS pig 0 data/lab/project/pig_0h.fastq
pig_1h RRBS pig 1 data/lab/project/pig_1h.fastq
frog_0h RRBS frog 0 data/lab/project/frog_0h.fastq
frog_1h RRBS frog 1 data/lab/project/frog_1h.fastq

Solution

As the name suggests the attributes in the specified attributes (here: file_path) can be derived from other ones. The way how this process is carried out is indicated explicitly in the project_config.yaml file (presented below). The name of the column is determined in the sample_modifiers.derive.attributes key-value pair, whereas the pattern for the attributes construction - in the sample_modifiers.derive.sources one. Note that the second level key (here: source) has to exactly match the attributes in the file_path column of the modified sample_annotation.csv (presented below).

   pep_version: 2.0.0
   sample_table: sample_table.csv
   output_dir: $HOME/hello_looper_results
   sample_modifiers:
      derive:
          attributes: file_path
          sources:
              source1: $HOME/data/lab/project/{organism}_{time}h.fastq
              source2: 
  /path/from/collaborator/weirdNamingScheme_{external_id}.fastq

Let’s introduce a few modifications to the original sample_annotation.csv file to map the appropriate data sources from the project_config.yaml with attributes in the derived column - [file_path]:

sample_name protocol organism time file_path
pig_0h RRBS pig 0 source1
pig_1h RRBS pig 1 source1
frog_0h RRBS frog 0 source1
frog_1h RRBS frog 1 source1

Code

Load pepr and read in the project metadata by specifying the path to the project_config.yaml:

library(pepr)
projectConfig = system.file(
"extdata",
paste0("example_peps-", branch),
"example_derive",
"project_config.yaml",
package = "pepr"
)
p = Project(projectConfig)
#> Loading config file: /tmp/RtmpoymTo9/Rinstb3055bff7/pepr/extdata/example_peps-master/example_derive/project_config.yaml

And inspect it:

sampleTable(p)
#>    sample_name protocol organism time
#> 1:      pig_0h     RRBS      pig    0
#> 2:      pig_1h     RRBS      pig    1
#> 3:     frog_0h     RRBS     frog    0
#> 4:     frog_1h     RRBS     frog    1
#>                                      file_path
#> 1:  /home/nsheff/data/lab/project/pig_0h.fastq
#> 2:  /home/nsheff/data/lab/project/pig_1h.fastq
#> 3: /home/nsheff/data/lab/project/frog_0h.fastq
#> 4: /home/nsheff/data/lab/project/frog_1h.fastq

As you can see, the resulting samples are annotated the same way as if they were read from the original, unwieldy, annotations file.

What is more, the p object consists of all the information from the project config file (project_config.yaml). Run the following line to explore it:

config(p)
#> Config object. Class: Config
#>  pep_version: 2.0.0
#>  sample_table: 
#> /tmp/RtmpoymTo9/Rinstb3055bff7/pepr/extdata/example_peps-master/example_derive/sample_table.csv
#>  output_dir: /home/nsheff/hello_looper_results
#>  sample_modifiers:
#>     derive:
#>         attributes: file_path
#>         sources:
#>             source1: /home/nsheff/data/lab/project/{organism}_{time}h.fastq
#>             source2: 
#> /path/from/collaborator/weirdNamingScheme_{external_id}.fastq
#>  name: example_derive