Learn derived attributes in `pepr`

This vignette will show you how and why to use the derived attributes functionality of the pepr package.

basic information about the PEP concept on the project website.
broader theoretical description in the derived attributes documentation section.

Problem/Goal

The example below demonstrates how to use the derived attributes to flexibly define the samples attributes the file_path column of the sample_table.csv file to match the file names in your project. Please consider the example below for reference:

sample_name	protocol	organism	time	file_path
pig_0h	RRBS	pig	0	data/lab/project/pig_0h.fastq
pig_1h	RRBS	pig	1	data/lab/project/pig_1h.fastq
frog_0h	RRBS	frog	0	data/lab/project/frog_0h.fastq
frog_1h	RRBS	frog	1	data/lab/project/frog_1h.fastq

Solution

As the name suggests the attributes in the specified attributes (here: file_path) can be derived from other ones. The way how this process is carried out is indicated explicitly in the project_config.yaml file (presented below). The name of the column is determined in the sample_modifiers.derive.attributes key-value pair, whereas the pattern for the attributes construction - in the sample_modifiers.derive.sources one. Note that the second level key (here: source) has to exactly match the attributes in the file_path column of the modified sample_annotation.csv (presented below).

   pep_version: 2.0.0
   sample_table: sample_table.csv
   output_dir: $HOME/hello_looper_results
   sample_modifiers:
      derive:
          attributes: file_path
          sources:
              source1: $HOME/data/lab/project/{organism}_{time}h.fastq
              source2: 
  /path/from/collaborator/weirdNamingScheme_{external_id}.fastq

Let’s introduce a few modifications to the original sample_annotation.csv file to map the appropriate data sources from the project_config.yaml with attributes in the derived column - [file_path]:

sample_name	protocol	organism	time	file_path
pig_0h	RRBS	pig	0	source1
pig_1h	RRBS	pig	1	source1
frog_0h	RRBS	frog	0	source1
frog_1h	RRBS	frog	1	source1

Code

Load pepr and read in the project metadata by specifying the path to the project_config.yaml:

library(pepr)
projectConfig = system.file(
"extdata",
paste0("example_peps-", branch),
"example_derive",
"project_config.yaml",
package = "pepr"
)
p = Project(projectConfig)
#> Loading config file: /tmp/RtmpoymTo9/Rinstb3055bff7/pepr/extdata/example_peps-master/example_derive/project_config.yaml

And inspect it:

sampleTable(p)
#>    sample_name protocol organism time
#> 1:      pig_0h     RRBS      pig    0
#> 2:      pig_1h     RRBS      pig    1
#> 3:     frog_0h     RRBS     frog    0
#> 4:     frog_1h     RRBS     frog    1
#>                                      file_path
#> 1:  /home/nsheff/data/lab/project/pig_0h.fastq
#> 2:  /home/nsheff/data/lab/project/pig_1h.fastq
#> 3: /home/nsheff/data/lab/project/frog_0h.fastq
#> 4: /home/nsheff/data/lab/project/frog_1h.fastq

As you can see, the resulting samples are annotated the same way as if they were read from the original, unwieldy, annotations file.

What is more, the p object consists of all the information from the project config file (project_config.yaml). Run the following line to explore it:

config(p)
#> Config object. Class: Config
#>  pep_version: 2.0.0
#>  sample_table: 
#> /tmp/RtmpoymTo9/Rinstb3055bff7/pepr/extdata/example_peps-master/example_derive/sample_table.csv
#>  output_dir: /home/nsheff/hello_looper_results
#>  sample_modifiers:
#>     derive:
#>         attributes: file_path
#>         sources:
#>             source1: /home/nsheff/data/lab/project/{organism}_{time}h.fastq
#>             source2: 
#> /path/from/collaborator/weirdNamingScheme_{external_id}.fastq
#>  name: example_derive

Sample modifiers in pepr: derive

Michal Stolarczyk

2023-11-21

Learn derived attributes in `pepr`

Problem/Goal

Solution

Code

Sample modifiers in pepr: derive

Michal Stolarczyk

2023-11-21

Learn derived attributes in pepr

Problem/Goal

Solution

Code

Learn derived attributes in `pepr`