QSAR Workflow

George Oche Ambrose

3/27/2024

library(rQSAR)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: corrplot
#> corrplot 0.92 loaded
#> Loading required package: tibble
#> Loading required package: gridExtra
#> 
#> Attaching package: 'gridExtra'
#> The following object is masked from 'package:dplyr':
#> 
#>     combine

Introduction

Quantitative Structure-Activity Relationship (QSAR) modeling is a valuable tool in computational chemistry and drug design, where it aims to predict the activity or property of chemical compounds based on their molecular structure. In this vignette, we present the rQSAR package, which provides functions for variable selection and QSAR modeling using Multiple Linear Regression (MLR), Partial Least Squares (PLS), and Random Forest algorithms.

Background

QSAR modeling relies on mathematical models to establish relationships between molecular descriptors (features representing chemical compounds) and their corresponding biological activities or properties. MLR, PLS, and Random Forest are commonly used algorithms for QSAR modeling, each with its own strengths and applications:

Multiple Linear Regression (MLR): MLR fits a linear equation to the data, allowing us to understand the linear relationship between predictors and the response variable. It is simple, interpretable, and provides insights into the importance of each predictor.

Partial Least Squares (PLS): PLS is a regression technique that combines the features of principal component analysis and MLR. It is particularly useful when dealing with multicollinearity and high-dimensional data, making it suitable for QSAR modeling with correlated predictors.

Random Forest: Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting. It is robust to outliers and non-linear relationships, making it suitable for complex QSAR modeling tasks. # Basic Usage Generating Molecular Descriptors The generate_descriptors_from_sdf function can be used to generate molecular descriptors from an SDF file.

library(rQSAR)

# Path to the SDF file
sdf_file<-sdf_file <- system.file("sample.sdf", package = "rQSAR")

# Generate descriptors
descriptors <- generate_descriptors_from_sdf(sdf_file)
#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

#> Warning in FUN(X[[i]], ...): The molecule should be of type IBioPolymer

Variable Selection

The perform_variable_selection function allows users to perform variable selection based on a given outcome column in a dataset. This step is crucial for identifying relevant predictors and improving the performance of QSAR models.

descriptors<-system.file("descriptor1.csv", package = "rQSAR")
selected_data <- perform_variable_selection(descriptors, "Outcome_column_name")
#> Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 273
#> linear dependencies found
#> Reordering variables and trying again:
#> Warning in leaps.exhaustive(a, really.big = really.big): XHAUST returned error
#> code -999
print(head(selected_data))
#>        Fsp3 Weta2.unity Outcome
#> 1 0.7755102   0.3936616    6.04
#> 2 0.7800000   0.3833549    7.15
#> 3 0.7843137   0.3896566    6.92
#> 4 0.7884615   0.4427484    8.44
#> 5 0.7924528   0.4253881    8.18
#> 6 0.7843137   0.4288922    5.27
write.csv(selected_data, "descriptor2.csv")

This function reads the data from the specified CSV file, extracts predictors and the outcome variable, performs variable selection using the ‘leaps’ package, and returns a dataframe containing the selected variables along with the outcome.

NB: The descriptors (independent variables) starts from the first column while the dependent variable should be placed in the last column. Adjust accordingly. Also, replace “D:/QSAR DATA/rQSAR/inst/descriptor1.csv” with directory where your file is located

Visualization

We can visualize the correlation between selected variables and the outcome using the ‘corrplot’ package:

# Load the selected data
selected_data<-system.file("descriptor2.csv", package = "rQSAR")
selected_data <- read.csv(selected_data, header = TRUE)

# Compute the correlation matrix
correlation_matrix <- cor(selected_data)

# Plot the heatmap
corrplot(correlation_matrix, method = "color", type = "lower", tl.col = "black", tl.srt = 45)

This heatmap provides insights into the correlation structure of the selected variables.

NB: Replace “D:/QSAR DATA/rQSAR/inst/descriptor2.csv” with directory where your file is located

QSAR Modeling

The ‘build_qsar_models’ function builds QSAR models using MLR, PLS, and Random Forest algorithms with k-fold cross-validation. This approach helps to assess the predictive performance of the models and generalize their performance to unseen data.

# Example usage:
data_file <- system.file("descriptor2.csv", package = "rQSAR")
model_results <- build_qsar_models(data_file)
print(model_results)
#> $MLR
#> $MLR$results
#> $MLR$results$predictions
#>        3       12       16       17       23       24       25       38 
#> 5.856441 6.562101 6.292082 6.186520 5.536905 6.075208 5.815405 6.572631 
#>        2        4       18       19       20       28       32       35 
#> 5.708786 6.094718 6.083246 6.002863 5.736472 5.864125 6.435836 6.616797 
#>       37       40       45       48        1       14       15       21 
#> 6.597596 6.607633 6.087807 6.588688 5.734606 6.543994 6.428450 6.448148 
#>       34       39       42       43       46        6        8       10 
#> 6.447825 6.526915 6.225151 6.219818 6.219700 6.083252 6.339170 6.377816 
#>       22       27       31       36       44       49        5        7 
#> 6.378533 6.234843 6.283524 6.584599 6.222332 6.399540 6.219362 6.080012 
#>        9       11       13       26       29       30       33       41 
#> 6.292861 6.419134 6.422461 6.159069 6.070043 6.267542 6.410233 6.464090 
#>       47 
#> 6.317965 
#> 
#> $MLR$results$actuals
#>  [1] 6.92 7.48 5.43 6.68 5.61 6.01 6.06 5.81 7.15 8.44 6.27 5.31 5.00 5.15 6.01
#> [16] 5.66 6.16 5.84 6.11 6.71 6.04 7.62 6.19 7.48 5.50 5.45 5.64 5.54 6.61 5.27
#> [31] 6.44 8.23 7.50 6.15 5.74 6.00 5.31 6.58 8.18 5.45 6.57 8.54 6.70 5.93 4.99
#> [46] 5.60 5.74 5.68 6.11
#> 
#> 
#> $MLR$formula
#> $MLR$formula$MLR
#> train_y ~ Fsp3 + Weta2.unity
#> <environment: 0x00000238b9b9ce08>
#> 
#> 
#> 
#> $PLS
#> $PLS$results
#> $PLS$results$predictions
#>  [1] 6.097143 6.354820 6.251680 6.259741 5.984494 6.674929 6.027939 6.372579
#>  [9] 6.095681 6.347985 6.317015 6.487035 6.170419 6.112523 6.192226 6.347489
#> [17] 6.295657 6.322752 6.329328 6.271611 5.807102 6.498281 6.471333 6.375410
#> [25] 6.376986 6.581448 6.215367 6.241339 6.241913 6.258182 6.386136 6.266630
#> [33] 6.189015 6.261069 6.234975 6.436048 6.246070 6.214198 6.249246 6.119952
#> [41] 6.231779 6.168656 6.352436 6.304892 6.096630 6.361955 6.147832 6.273822
#> [49] 6.107987
#> 
#> $PLS$results$actuals
#>  [1] 6.92 7.48 5.43 6.68 5.61 6.01 6.06 5.81 7.15 8.44 6.27 5.31 5.00 5.15 6.01
#> [16] 5.66 6.16 5.84 6.11 6.71 6.04 7.62 6.19 7.48 5.50 5.45 5.64 5.54 6.61 5.27
#> [31] 6.44 8.23 7.50 6.15 5.74 6.00 5.31 6.58 8.18 5.45 6.57 8.54 6.70 5.93 4.99
#> [46] 5.60 5.74 5.68 6.11
#> 
#> 
#> $PLS$formula
#> $PLS$formula$PLS
#> train_y ~ Fsp3 + Weta2.unity
#> attr(,"variables")
#> list(train_y, Fsp3, Weta2.unity)
#> attr(,"factors")
#>             Fsp3 Weta2.unity
#> train_y        0           0
#> Fsp3           1           0
#> Weta2.unity    0           1
#> attr(,"term.labels")
#> [1] "Fsp3"        "Weta2.unity"
#> attr(,"order")
#> [1] 1 1
#> attr(,"intercept")
#> [1] 1
#> attr(,"response")
#> [1] 1
#> attr(,"predvars")
#> list(train_y, Fsp3, Weta2.unity)
#> attr(,"dataClasses")
#>     train_y        Fsp3 Weta2.unity 
#>   "numeric"   "numeric"   "numeric" 
#> <environment: 0x00000238b9b9ce08>
#> 
#> 
#> 
#> $Random_Forest
#> $Random_Forest$results
#> $Random_Forest$results$predictions
#>        3       12       16       17       23       24       25       38 
#> 5.694584 6.709175 6.099719 5.988812 5.808241 5.800001 5.458266 6.798748 
#>        2        4       18       19       20       28       32       35 
#> 5.993947 5.904296 5.871323 5.944234 6.169401 5.957486 6.500019 6.944697 
#>       37       40       45       48        1       14       15       21 
#> 6.694396 6.609412 5.872967 6.760561 6.261787 6.408571 6.746487 6.713458 
#>       34       39       42       43       46        6        8       10 
#> 6.714597 6.193397 6.608745 5.976888 6.033985 6.147460 6.199636 6.326178 
#>       22       27       31       36       44       49        5        7 
#> 6.573173 6.917114 6.159476 6.515402 6.475051 6.350333 6.121623 6.426793 
#>        9       11       13       26       29       30       33       41 
#> 5.695258 6.393562 6.624859 7.102298 6.067309 6.424552 5.993967 6.337189 
#>       47 
#> 6.006727 
#> 
#> $Random_Forest$results$actuals
#>  [1] 6.92 7.48 5.43 6.68 5.61 6.01 6.06 5.81 7.15 8.44 6.27 5.31 5.00 5.15 6.01
#> [16] 5.66 6.16 5.84 6.11 6.71 6.04 7.62 6.19 7.48 5.50 5.45 5.64 5.54 6.61 5.27
#> [31] 6.44 8.23 7.50 6.15 5.74 6.00 5.31 6.58 8.18 5.45 6.57 8.54 6.70 5.93 4.99
#> [46] 5.60 5.74 5.68 6.11
#> 
#> 
#> $Random_Forest$formula
#> $Random_Forest$formula$Random_Forest
#> [1] "randomForest(x = train_X, y = train_y)"

This function reads the data from the specified CSV file, splits it into training and testing sets, builds QSAR models using MLR, PLS, and Random Forest, and returns the model predictions along with the actual values.

Visualization

We can visualize the performance of QSAR models using correlation plots and residual plots:

# Correlation plots
plots <- correlation_plots(model_results)
for (i in seq_along(plots)) {
  print(plots[[i]])
}
#> `geom_smooth()` using formula = 'y ~ x'

#> `geom_smooth()` using formula = 'y ~ x'

#> `geom_smooth()` using formula = 'y ~ x'


# Residual plots
plots <- residual_plots(model_results)
grid.arrange(grobs = plots, ncol = 2)

These plots help evaluate the predictive performance of QSAR models.

Conclusion

In this vignette, we introduced the rQSAR package for QSAR modeling using MLR, PLS, and Random Forest algorithms. By leveraging variable selection techniques and cross-validation, users can build robust QSAR models for predicting chemical properties or activities. For further details and examples, please refer to the package documentation.