Use of quasi-SMILES to build models based on quantitative results from experiments with nanomaterials

Quasi-SMILES


Introduction
Engineered nanoparticles (NPs) are defined as materials of 1-100 nm in at least one of their dimensions. NPs are applied today in more than 5000 consumer products ranging from those pertinent to biomedical applications, energy conservation/generation, food additives/preservation, electronics (quantum components and sensors), chemical catalysis, construction materials (nanocomposites), and others (Abd Elkodous et al., 2019;Theerthagiri et al., 2019;Bhuyan et al., 2019;Kaur et al., 2018;Kumar et al., 2018). As a consequence, increased human exposure to NP is expected (Westmeier et al., 2016). The tiny size of NPs makes them capable of penetrating the cell membrane and affect biological functions, which potentially make nanoparticles a global health risk factor. Due to the diversity of NPs, it is not possible to perform a single health and safety assessment that covers all of them. Instead, each type of NPs must be classified by composition, size, and other parameters regarding the NPs. The importance of future sustainable nanotechnology and the potential health risk justifies developments of improved analysis methods to assess the interactions of NPs with cells and organs from plants, animals, and humans (EFSA, 2021). Computational methods are being investigated for reducing the cost, time, and resources of nanotoxicology testing (Buglak et al., 2019). One such approach is based on quantitative structure-property/activity relationships (QSPR/QSAR) which assume that the activity of a substance is related to its physicochemical properties. However, QSPR/QSAR was originally intended to make use of the properties of smaller organic compounds (Cronin et al., 2019). Applying QSPR/QSAR to nanomaterials meets several challenges, which not only concern the difficulty of defining the 'structure' as a source of descriptors in the case of nanomaterials . Configurations of atoms and bonds are not enough to account for all system behavior of nanomaterials because other observed (as well as latent) factors can influence the biological impact of nanomaterials. In fact, it is well known today that the biological response to NP exposure is highly affected by a diversity of eclectic experimental parameters that needs further attention in future computational biology (Toropova and Toropov, 2022;Trinh et al., 2018;Toropov and Toropova, 2015). The consequence of the wide experimental difference in published nano-bio studies makes comparison and generalization of the biological impact to specific NPs complicated using traditional QSPR/QSAR SMILES-based models. This essentially becomes insurmountable when one considers variable parameters such as organisms/organs/cell type, gender specificity, dose, NP physicochemical characteristics, media/temperature/ionic strength/pH, synergy with other NPs/contaminants, and exposure duration. Because of this, models that include all available experimental parameters are highly valuable and should increase both confidence, accuracy, and predictive power of nano-bio interactions. Traditional QSAR models are based on the representation of the molecular structure via molecular graphs, vector of physicochemical parameters, and simplified molecular input-line entry system (SMILES). To include other  experimental parameters, we and others have shown that SMILES can be extended by special symbols to represent n numbers of diverse eclectic data (Toropov and Toropova, 2015Toropova and Toropov, 2019). We term this methodology 'Quasi-SMILES' (Toropov and Toropova, 2015Toropova and Toropov, 2019). Clearly, the quasi-SMILES approach is promoting stronger collaborative work and understanding between experimentalists and computational researchers. In the present study, we demonstrate that quasi-SMILES is an efficient methodology to compare and build predictive models from data obtained under widely different experimental conditions. As a case study, we have analyzed the toxicity testing data of silver NPs obtained from fish and daphnia.

Data
The data used in this study are identical to those in publication (Jung et al., 2021) and describe the ecological acute AgNP toxicity to daphnia and fish. Fig. 1 demonstrates the general scheme of converting the experimental data into quasi-SMILES using the experimental data acquired for the species daphnia (Daphnia magna) and zebrafish (Danio rerio), while considering both size (nm), zeta potentials (mV), and surface coating material of the NPs (bare, coated NPs, and coated NPs including coating material descriptors according to the reference Jung et al., 2021). In total, the 170 experimental quantities of reference (Jung et al., 2021) are converted into quasi-SMILES. In doing so, the quasi-SMILES operation produces duplicates, i.e., situations expressed by the same quasi-SMILES. After removing the duplicates, the total set contains 102 quasi-SMILES represented in Table 1. These quasi-SMILES were randomly split into the active training set (indicated by '+'); passive training set ('-'); calibration set ('#'); and validation set ('*'). The active training set is used to build the model: molecular features extracted from quasi-SMILES of the active training set are applied in the  process of Monte Carlo optimization aimed to provide correlation weights for the above features, which give maximal correlation coefficient between the descriptor and the endpoint on the active training set.
Since the passive training set contains data that were not included in the active training set, it serves as a source to validate whether the model obtained for the active training set is satisfactory for external invisible quasi-SMILES. The calibration set is used to limit overtraining (overfitting) of the model. The overall workflow of the model generation is as follows: At the beginning of the optimization, the correlation coefficients between the experimental values of the endpoints and the descriptor contemporaneously increases for all sets, while the correlation coefficient for the calibration set reaches a maximum indicating the beginning of the overtraining. At this point, the Monte Carlo optimization procedure is kept on hold, and the validation set is then applied to assess the predictive potential of the obtained model. Tables 1 and 2 contain the ranges of NP size and zeta potential used to generate the quasi-SMILES examined here. Table 3 contains the list of quasi-SMILES used to build the models for pLC50.

Optimal descriptors calculated with quasi-SMILES
The model of toxicity examined here is a mathematical function of four variables (T) Status of NPs (bare, coat, cons); (DF) organisms (Daphnia or Zebrafish); (S) Size (nm); (Z) Zeta-potential (mV) the function based on the values of the optimal descriptors calculated with the so-called correlation weights.
The qS are elements of quasi-SMILES from the list represented in Table 2. The CW(qSk) are correlation weights of quasi-SMILES elements, i.e., special coefficients calculated by the Monte Carlo method. The model for pLC50 is one variable correlation T is the threshold for detecting rare codes. If the frequency of codes in the active training set is less than T, the code is blocked, i.e., its correlation weights are fixed to be zero, and the code is not included further in the modeling process. N is the number of epochs of the Monte Carlo optimization. One epoch is the modification of all non-blocked codes. The sequence of the modifications is random. a '+' represents the active training set (≈25%); '-' represents the passive training set (≈25%); '#' represents the calibration set (≈25%), and '*' represents the validation set. Three possible forms of silver nanoparticles, namely "bare" describes nanoparticles without any coating; coat (coating) describes nanoparticles with a shell; "cons" describes nanoparticles including coating material descriptors; Daphnia magna (Daph) and zebrafish (Fish) are organisms examined; size (nm); zeta is zeta-potential of nanoparticles (mV); pLC50 is the decimal logarithm of concentration causing mortality in 50% of daphnia or fish.

The Monte Carlo method
Eq.
(2) needs the numerical data on the above correlation weights. The Monte Carlo optimization is a tool to calculate those correlation weights. The target functions for the Monte Carlo optimization is the following: The r AT and r PT are correlation coefficients between observed and predicted endpoint for the active training set and passive training set, respectively.
The IIC C is the index of ideality of correlation . The IIC C is calculated with data on the calibration set as follows: The observed and calculated are corresponding values of the endpoint. Table 4 contains the numerical data on the correlation weights of quasi-SMILES codes obtained by the Monte Carlo method. Table 5 contains as an example of the calculation of the DCW(1,15).

Models
The models for pLC50 obtained for the three random splits are as follows: Table 6 contains the statistical characteristics of models for splits #1, #2, and #3. The statistical quality of the above models (pLC50) is quite good. Fig. 2 contains the graphical representation of model for pLC50 obtained for split 1.

Applicability domain
The statistical significance of the different codes is different. The defect of the codes is the measure of the statistical significance of a code. Defects of quasi-SMILES codes calculated as  Table 6 Statistical quality of models for pLC50 for three random splits. Q 2 is the leave-one-out cross-validated R 2 (Shayanfar and Shayanfar, 2014). Q 2 F1 , Q 2 F2 , and Q 2 F1 are the modifications of Q 2 (Chirico and Gramatica, 2011). a IIC is the index of ideality correlation (Eq. (6)).
where P(qS k ) and P ′ (qS k ) are the probability of qS k in the active training and passive training sets, respectively; N(qS k ) and N ′ (qS k ) are frequencies of qS k in the active training and passive training sets, respectively. The statistical quasi-SMILES-defects (D j ) calculated as where N is the number of non-blocked quasi-SMILES codes in the quasi-

SMILES. A quasi-SMILES falls in the domain of applicability if
Dj < 2*D (17) Fig. 3 represents the defects of codes that are not blocked.

Mechanistic interpretation
Having numerical data on the correlation weights of quasi-SMILES codes observed in several runs of the Monte Carlo optimization one can obtain for a given code correlation weights which are solely positive. In this case, the code can be interpreted as a promoter of increase for the endpoint. If a code has in the several runs solely negative values, the code can be interpreted as a promoter of decrease for the endpoint. If the code has both positive and negative values, the role of the code is unclear. Table 7 contains a collection of promoters of the increase and decrease for pLC50 together with the statistical defects calculated by Eq. (15).
One can observe in Table 7 that coating (coat) and coating that considers the molecular features of the coating (cons) are quasi-SMILES codes which are promoters of a pLC50 increase. With regard to animal species, the observation can be made that the quasi-SMILES code for "Fish", which is negative, dictates promotion towards a decrease in pLC50, while the opposite is true for "Daph". In other words, the impact of NPs on fish is smaller than that on daphnia.

Estimation of influence of size and zeta potential to pLC50
To study of the impact of different sizes and zeta potentials, one can compare the distributions displayed in Fig. 3. The ranges of the largest sizes (80-140 nm) were excluded from the consideration, because the number of NPs that fall in these ranges (from 80 to 140) is very small. From this analysis, it becomes clear that between the variables of zeta potential and NP size, the greatest contribution to the increase in pLC50 originate from the zeta potential. The range of zeta potential denoted by z%14 (i.e., from − 46.210 to − 42.780) defines the global maximum for this component for both Fish and Daphnia. The NP size has a moderate impact and reaches maximum at [s%11], but also show many local maxima with comparable magnitude (Fig. 4).
The ability to predict quantitative results is an attractive aptitude in computational model development. The quasi-SMILES model presented here is an example of a model that enables that even when data are obtained with different experimental conditions. Thus, the work demonstrated here is an attempt to bridge experimental data and predictable models. The gradual expansion of the experimental results in the field continually adds to improve the reliability of the system "experiment-model" (or of the system "experimentalist-model developer) to provide new style and new research possibilities.
Supplementary materials section contains the technical details on three random models.

Conclusions
The nano-QSAR models were generated to predict the pLC50 in two biological systems. Three random distributions formed from the available experimental data, including the training and validation sub-sets, confirm the robust predictive potential of these models. Quasi-SMILES is an approach to build models of new quality: the descriptor becomes a mathematical function of structure and experimental conditions or even a mathematical function of experimental conditions together with arbitrary circumstances that can impact the results of the experiment. In other words, the quasi-SMILES technique can be the source of a new way forward to address both theoretical and practical chemistry, biochemistry, and toxicology.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.