Advanced Sampling¶
This module contains functions that can be used in order to subsample from very large datasets.
Theory/Introduction¶
Example¶
Assuming we already extracted all of the configurations, forces (and possibly local energies) from a .xyz file, we can apply one of the methods contained in advanced_sampling in order to subsample a meaningful and representative training set.
We first load the configurations and forces previously extracted from the .xyz file:
confs = np.load(configurations_file)
forces = np.load(configurations_file)
We then initialize the sampling class and separate ntest configurations for the test set:
s = Sampling(confs=confs,forces=forces, sigma_2b = 0.05, sigma_3b = 0.1, sigma_mb = 0.2, noise = 0.001, r_cut = 8.5, theta = 0.5)
s.train_test_split(confs=confs, forces = forces, ntest = 200)
Now we can subsample a training set using our preferred method, for example importance vector machine sampling on the variance of force predicion:
MAE, STD, RMSE, index, time = s.ivm_f(method = '2b', ntrain = ntr, batchsize = 1000)
or importance vector machine sampling on the measured error of force predicion for a 3-body kernel:
MAE, STD, RMSE, index, time = s.ivm_f(method = '3b', ntrain = ntr, batchsize = 1000, use_pred_error = False)
Other methods include a sampling based on the interatomic distance values present in every configuration:
MAE, STD, RMSE, index, time = s.grid(method = '2b', nbins = 1000)
Or a sampling based on the interatomic distance values present in every configuration:
MAE, STD, RMSE, index, time = s.grid(method = '2b', nbins = 1000)
-
class
mff.advanced_sampling.Sampling(confs=None, energies=None, forces=None, sigma_2b=0.05, sigma_3b=0.1, sigma_mb=0.2, noise=0.001, r_cut=8.5, theta=0.5) Sampling methods class Class containing sampling methods to optimize the trainng database selection. The class is currently set in order to work with local atomic energies, and is therefore made to be used in confined systems (nanoclusters, molecules). Some of the mothods used can be applied to force training too (ivm, random), or are independent to the training outputs (grid). These methods can be used on systems with PBCs where a local energy is not well defined. The class also initializes two GP objects to use in some of its methods.
Parameters: - confs (list of arrays) – List of the configurations as M*5 arrays
- energies (array) – Local atomic energies, one per configuration
- forces (array) – Forces acting on the central atoms of confs, one per configuration
- sigma_2b (float) – Lengthscale parameter of the 2-body kernels in Amstrongs
- sigma_3b (float) – Lengthscale parameter of the 3-body kernels in Amstrongs
- sigma_mb (float) – Lengthscale parameter of the many-body kernel in Amstrongs
- noise (float) – Regularization parameter of the Gaussian process
- r_cut (float) – Cutoff function for the Gaussian process
- theta (float) – Decay lengthscale of the cutoff function for the Gaussian process
-
elements¶ list – List of the atomic number of the atoms present in the system
-
natoms¶ int – Number of atoms in the system, used for nanoclusters
-
K2¶ array – Gram matrix for the energy-energy 2-body kernel using the full reduced dataset
-
K3¶ array – Gram matrix for the energy-energy 3-body kernel using the full reduced dataset
-
clean_dataset(randomized=True, shuffling=True) Function used to subsample from a complete trajectory only one atomic environment per snapshot. This is necessary when training on energies of nanoclusters in order to assign an unique energy value to every configuration and to avoid using redundant information in the form of local atomic environments centered around different atoms in the same snapshot.
Parameters: - randomized (bool) – If True, an atom at random is chosen every snapshot, if false always the first atom in the configurations will be chosen to represent said snapshot.
- shuffling (bool) – if True, once the dataset is created, it is shuffled randomly in order to avoid any bias during incremental training set optimization methods (e.g. rvm, cur, ivm).
-
cur(method='2b', ntrain=1000, batchsize=1000, error_metric='energy') Sampling using the CUR decomposition technique. The complete dataset is first divided into batches, then the energy-energy Gram matrix is calculated for each batch. An svd decomposition is subsequently applied to each gram matrix, and a number of entries (columns) is selected based on their importance score. The method is calibrated so that the final number of training points selected is roughly equal to the input parameter ntrain.
Parameters: - method (str) – 2b or 3b, speciefies which energy kernel to use to calculate the gram matrix
- ntrain (int) – Number of training points to be selected from the whole dataset
- batchsize (int) – Number of data points to be used for each calculation of the gram matrix. Lower values make the computation faster but the error might be higher.
- errror_metric (str) – specifies whether the final error is calculated on energies or on forces
Returns: Mean absolute error made by the final iteration of the method on the test set SMAE (float):Standard deviation of the absolute error made by the final iteration of the method on the test set RMSE (float): Root mean squared error made by the final iteration of the method on the test set index (list): List containing the indexes of all the selected training points total_time (float): Excecution time in seconds
Return type: MAE (float)
-
grid(method='2b', nbins=100, error_metric='energy', return_error=True) Grid sampling, based either on interatomic distances (2b) or on triplets of interatomic distances (3b). Training configurations are shuffled and are then included in the final database only if they contain a distance value (or a triplet of distance values) which is not yet present in the binned histogram of distance values (or triplets of distance values) of the final database. This method is very fast since it does not evaluate kernel functions nor gram matrices.
Parameters: - method (str) – 2b or 3b, speciefies which energy kernel to use to calculate the gram matrix
- nbins (int) – Number of bins to use when building an histogram of interatomic distances. If method is 2b, this will specify the value only for distances from the central atom, if method is 3b, this will specify the value for triplets of distances.
- errror_metric (str) – specifies whether the final error is calculated on energies or on forces
- return_error (bool) – if true, error on test set using sampled database is returned
Returns: Mean absolute error made by the final iteration of the method on the test set SMAE (float):Standard deviation of the absolute error made by the final iteration of the method on the test set RMSE (float): Root mean squared error made by the final iteration of the method on the test set index (list): List containing the indexes of all the selected training points total_time (float): Excecution time in seconds
Return type: MAE (float)
-
ivm_e(method='2b', ntrain=500, batchsize=1000, use_pred_error=True, error_metric='energy') Importance vector machine sampling for energies. This method uses a 2- or 2-body energy kernel and trains it on the energies of the partitioned training dataset. The algortihm starts from two configurations chosen at random. At each iteration, the predicted variance or on the observed error calculated on batchsize configurations from the training set is calculated, and the configuration with the highest value is included in the final set. The method finishes when ntrain configurations are included in the final set.
Parameters: - method (str) – 2b or 3b, speciefies which energy kernel to use to calculate the gram matrix
- ntrain (int) – Number of training points to extract from the training dataset
- batchsize (int) – number of training points to use in each iteration of the error prediction
- use_pred_error (bool) – if true, the predicted variance is used as a metric of the ivm, if false the observed error is used instead
- errror_metric (str) – specifies whether the final error is calculated on energies or on forces
Returns: Mean absolute error made by the final iteration of the method on the test set SMAE (float):Standard deviation of the absolute error made by the final iteration of the method on the test set RMSE (float): Root mean squared error made by the final iteration of the method on the test set index (list): List containing the indexes of all the selected training points total_time (float): Excecution time in seconds
Return type: MAE (float)
-
ivm_f(method='2b', ntrain=500, batchsize=1000, use_pred_error=True, error_metric='energy') Importance vector machine sampling for forces. This method uses a 2- or 2-body energy kernel and trains it on the energies of the partitioned training dataset. The algortihm starts from two configurations chosen at random. At each iteration, the predicted variance or on the observed error calculated on batchsize configurations from the training set is calculated, and the configuration with the highest value is included in the final set. The method finishes when ntrain configurations are included in the final set.
Parameters: - method (str) – 2b or 3b, speciefies which energy kernel to use to calculate the gram matrix
- ntrain (int) – Number of training points to extract from the training dataset
- batchsize (int) – number of training points to use in each iteration of the error prediction
- use_pred_error (bool) – if true, the predicted variance is used as a metric of the ivm, if false the observed error is used instead
- errror_metric (str) – specifies whether the final error is calculated on energies or on forces
Returns: Mean absolute error made by the final iteration of the method on the test set SMAE (float):Standard deviation of the absolute error made by the final iteration of the method on the test set RMSE (float): Root mean squared error made by the final iteration of the method on the test set index (list): List containing the indexes of all the selected training points total_time (float): Excecution time in seconds
Return type: MAE (float)
-
random(method='2b', ntrain=500, error_metric='energy', return_error=True) Random subsampling of training points from the larger training dataset.
Parameters: - method (str) – 2b or 3b, speciefies which energy kernel to use to calculate the gram matrix
- ntrain (int) – Number of points to include in the final dataset.
- errror_metric (str) – specifies whether the final error is calculated on energies or on forces
- return_error (bool) – if True, train a GP and run a test
Returns: Mean absolute error made by the final iteration of the method on the test set SMAE (float):Standard deviation of the absolute error made by the final iteration of the method on the test set RMSE (float): Root mean squared error made by the final iteration of the method on the test set index (list): List containing the indexes of all the selected training points total_time (float): Excecution time in seconds
Return type: MAE (float)
-
rvm(method='2b', batchsize=1000) Relevance vector machine sampling. This method trains a 2-, 3- or many-body kernel on the energies of the partitioned training dataset. The algortihm starts from a dataset containing a batchsize number of training configurations extracted from the whole dataset at random. Subsequently, a rvm method is called and a variable number of configurations is selected. These are then included in the next batch, and the operation is repeated until every point in the training dataset was included at least once. The function then returns the indexes of the points returned by the last call of the rvm method.
Parameters: - method (str) – 2b or 3b, speciefies which energy kernel to use to calculate the gram matrix
- batchsize (int) – number of training points to include in each iteration of the gram matrix calculation
Returns: Mean absolute error made by the final iteration of the method on the test set SMAE (float):Standard deviation of the absolute error made by the final iteration of the method on the test set RMSE (float): Root mean squared error made by the final iteration of the method on the test set index (list): List containing the indexes of all the selected training points total_time (float): Excecution time in seconds
Return type: MAE (float)
-
test_forces(index, method='2b', sig_2b=0.2, sig_3b=0.8, noise=0.001) Random subsampling of training points from the larger training dataset.
Parameters: - method (str) – 2b or 3b, speciefies which energy kernel to use to calculate the gram matrix
- ntrain (int) – Number of points to include in the final dataset.
- errror_metric (str) – specifies whether the final error is calculated on energies or on forces
Returns: Mean absolute error made by the final iteration of the method on the test set SMAE (float):Standard deviation of the absolute error made by the final iteration of the method on the test set RMSE (float): Root mean squared error made by the final iteration of the method on the test set index (list): List containing the indexes of all the selected training points total_time (float): Excecution time in seconds
Return type: MAE (float)
-
train_test_split(confs, forces=None, energies=None, ntest=10) Function used to subsample a training and a test set: the test set is extracted at random and the remaining dataset is trated as a training set (from which we then subsample using the various methods).
Parameters: - confs (array or list) – List of the configurations as M*5 arrays
- energies (array) – Local atomic energies, one per configuration
- forces (array) – Forces acting on the central atoms of confs, one per configuration
- ntest (int) – Number of test points, if None, every point that is not a training point will be used as a test point