| Title: | Data Splitting Algorithms for Model Developments |
|---|---|
| Description: | Providing six different algorithms that can be used to split the available data into training, test and validation subsets with similar distribution for hydrological model developments. The dataSplit() function will help you divide the data according to specific requirements, and you can refer to the par.default() function to set the parameters for data splitting. The getAUC() function will help you measure the similarity of distribution features between the data subsets. For more information about the data splitting algorithms, please refer to: Chen et al. (2022) <doi:10.1016/j.jhydrol.2022.128340>, Zheng et al. (2022) <doi:10.1029/2021WR031818>. |
| Authors: | Feifei Zheng [aut, ths], Junyi Chen [aut, cre]
|
| Maintainer: | Junyi Chen <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.2 |
| Built: | 2026-06-06 09:39:35 UTC |
| Source: | https://github.com/lark-max/dsam |
Built-in function: This function includes four arguments, where the first one contains the information of the original dataset as well as the three subsets, and the remaining three augments are the maximum sample sizes for the training, test and validation subsets respectively.
checkFull(split.info, num.train, num.test, num.valid)checkFull(split.info, num.train, num.test, num.valid)
split.info |
List type, which contains the original data set, three sampling subsets, termination signal and other relevant sampling information. |
num.train |
The number of training data points specified by the user. |
num.test |
The number of test data points specified by the user. |
num.valid |
The number of validation data points specified by the user. |
A list with sampling information.
'DSAM' interface function: The user needs to provide a parameter list before data-splitting.
These parameters have default values, with details given in the par.default function.
Conditioned on the parameter list, this function carries out the data-splitting based on the algorithm specified by the user.
The available algorithms include the traditional time-consecutive method (TIMECON), DUPLEX, MDUPLEX SOMPLEX, SBSS.P, SS.
The algorithm details can be found in Chen et al. (2022). Note that this package focuses on deals with the dataset with multiple inputs but one output,
where this output is used to enable the application of various data-splitting algorithms.
dataSplit(data, control = list(), ...)dataSplit(data, control = list(), ...)
data |
The dataset should be matrix or Data.frame. The format should be as follows: Column one is a subscript vector used to mark each data point (each row is considered as a data point); Columns from 2 to N-1 are the input data, and Column N are the output data. |
control |
User-defined parameter list, where each parameter definition refers to the |
... |
A redundant argument list. |
Return the training, test and validation subsets. If the original data are required to be split into two subsets, the training and test subsets can be combined into a single calibration subset.
Feifei Zheng [email protected]
Junyi Chen [email protected]
Chen, J., Zheng F., May R., Guo D., Gupta H., and Maier H. R.(2022).Improved data splitting methods for data-driven hydrological model development based on a large number of catchment samples, Journal of Hydrology, 613.
Zheng, F., Chen J., Maier H. R., and Gupta H.(2022). Achieving Robust and Transferable Performance for Conservation‐Based Models of Dynamical Physical Systems, Water Resources Research, 58(5).
Zheng, F., Chen, J., Ma, Y., Chen Q., Maier H. R., and Gupta H.(2023). A Robust Strategy to Account for Data Sampling Variability in the Development of Hydrological Models, Water Resources Research, 59(3).
data("DSAM_test_smallData") res.sml = dataSplit(DSAM_test_smallData) data("DSAM_test_modData") res.mod = dataSplit(DSAM_test_modData, list(sel.alg = "SBSS.P")) data("DSAM_test_largeData") res.lag = dataSplit(DSAM_test_largeData, list(sel.alg = "SOMPLEX"))data("DSAM_test_smallData") res.sml = dataSplit(DSAM_test_smallData) data("DSAM_test_modData") res.mod = dataSplit(DSAM_test_modData, list(sel.alg = "SBSS.P")) data("DSAM_test_largeData") res.lag = dataSplit(DSAM_test_largeData, list(sel.alg = "SOMPLEX"))
Built-in function: The initial sampling function of DUPLEX algorithm, aimed to obtain the two data points with the farthest Euclidean distance from the original data set and assign them to the corresponding sampling subset.
DP.initialSample(split.info, choice)DP.initialSample(split.info, choice)
split.info |
A list containing relevant sampling information such as the original dataset and three sample subsets. |
choice |
The variable must be one name of the three sample subsets contained in split.info, according to which the function assigns the current two data points to the specific sampling subset. |
Return the training, test and validation subsets. If the original data are required to be split into two subsets, the training and test subsets can be combined into a single calibration subset.
Built-in function: The cyclic sampling function of DUPLEX algorithm that takes the two data points farthest from the current sampling set and assigns them to the corresponding sampling subset.
DP.reSample(split.info, choice)DP.reSample(split.info, choice)
split.info |
A list containing relevant sampling information such as the original dataset and three sample subsets. |
choice |
The variable must be one name of the three sample subsets contained in split.info, according to which the function assigns the current two data points to the specific sampling subset. |
Return the training, test and validation subsets. If the original data are required to be split into two subsets, the training and test subsets can be combined into a single calibration subset.
A large dataset containing the rainfall and runoff time series using for testing data splitting algorithms
DSAM_test_largeDataDSAM_test_largeData
A data frame with 3650 rows and 5 variables
Data subscript that marks the position of each data point
input vectors
input vectors
input vectors
The output vector, usually the runoff
...
A moderate dataset containing the rainfall and runoff time series using for testing data splitting algorithms
DSAM_test_modDataDSAM_test_modData
A data frame with 1000 rows and 5 variables
Data subscript that marks the position of each data point
input vectors
input vectors
input vectors
The output vector, usually the runoff
...
A small dataset containing the rainfall and runoff time series using for testing data splitting algorithms
DSAM_test_smallDataDSAM_test_smallData
A data frame with 200 rows and 5 variables
Data subscript that marks the position of each data point
input vectors
input vectors
input vectors
The output vector, usually the runoff
...
The deterministic DUPLEX algorithm, with details given in Chen et al. (2022).
DUPLEX(data, control)DUPLEX(data, control)
data |
The dataset should be matrix or Data.frame. The format should be as follows: Column one is a subscript vector used to mark each data point (each row is considered as a data point); Columns from 2 to N-1 are the input data, and Column N are the output data. |
control |
User-defined parameter list, where each parameter definition refers to the |
Return the training, test and validation subsets. If the original data are required to be split into two subsets, the training and test subsets can be combined into a single calibration subset.
This function calls [kohonen]{xgboost} to train the classifier, followed by calculating the similarity between the two given datasets. The return value is a AUC index, ranging between 0 and 1, where the AUC is closer to 0.5, the more similar the two data sets is.
getAUC(data1, data2)getAUC(data1, data2)
data1 |
Dataset 1, the data type must be numeric, matrix or Data.frame. |
data2 |
Dataset 2, the data type must be numeric, matrix or Data.frame. |
Return the AUC value.
This function return the maximum of runoff(output columu) for users.
getMax(data)getMax(data)
data |
The original data set, the data type must be numeric, matrix or Data.frame. |
Return the maximum value of the output column.
This function return the mean and standard deviation of runoff(output columu) for users.
getMean(data)getMean(data)
data |
The original data set, the data type must be numeric, matrix or Data.frame. |
Return a list with mean value and standard deviation.
This function return the minimum of runoff(output columu) for users.
getMin(data)getMin(data)
data |
The original data set, the data type must be numeric, matrix or Data.frame. |
Return the minimum value of the output column.
Built-in function: Calculates the maximum number of samples of each subset in each neuron within the SOM network based on the sampling ratio specified by the user.
getSnen(som.info, control)getSnen(som.info, control)
som.info |
The list contains information about the SOM network, including the total number of neurons, the number of rows, and the set of data points within each neuron. |
control |
User-defined parameter list, where each parameter definition refers to the par.default function. |
This function return a list containing three vectors Tr,Ts and Vd, the length of which is the same as the number of neurons. Tr,Ts and Vd vectors record the specified amount of data that need be obtained for the Training, Test and Validation subset in each neuron respectively.
This is a modified MDUPLEX algorithm, which is also deterministic, with details given in Zheng et al. (2022).
MDUPLEX(data, control)MDUPLEX(data, control)
data |
The dataset should be matrix or Data.frame. The format should be as follows: Column one is a subscript vector used to mark each data point (each row is considered as a data point); Columns from 2 to N-1 are the input data, and Column N are the output data. |
control |
User-defined parameter list, where each parameter definition refers to the par.default function. |
Return the training, test and validation subsets. If the original data are required to be split into two subsets, the training and test subsets can be combined into a single calibration subset.
Chen, J., Zheng F., May R., Guo D., Gupta H., and Maier H. R.(2022), Improved data splitting methods for data-driven hydrological model development based on a large number of catchment samples, Journal of Hydrology, 613.
Zheng, F., Chen J., MaierH. R., and Gupta H.(2022), Achieving Robust and Transferable Performance for Conservation‐Based Models of Dynamical Physical Systems, Water Resources Research, 58(5).
The list of parameters needs to be set by the user, each with a default value.
Boolean variable that determines whether the input vectors should be included during the Euclidean distance calculation. The default is TRUE.
Random number seed. The default is 1000.
A string variable that represents the available data splitting algorithms including "SOMPLEX", "MDUPLEX", "DUPLEX", "SBSS.P", "SS" and "TIMECON". The default is "MDUPLEX".
The proportion of data allocated to the training subset, where the default is 0.6.
The proportion of data allocated to the test subset, where the default is 0.2.
A string variable representing the output file name for the training data subset. The default is "Train.txt".
A string variable representing the output file name for the test data subset. The default is "Test.txt".
A string variable representing the output file name for the validation data subset. The default is "Valid.txt".
Vector type: When sel.alg = "TIMECON", the program will select a continuous time-series data subset from the original data set, where the start and end positions are determined by this vector, with the first and the second value representing the start and end position in percentage of the original dataset. The default is c(0,0.6), implying that the algorithm selects the first 60% of the data from the original dataset.
Boolean variable that determines whether the data subsets need to be output or not. The default is FALSE.
Boolean variable that determines the level of user feedback. The default is FALSE.
par.default()par.default()
None
SSsample
Built-in function: This function is used in the semi-deterministic SS algorithm, and it contains two parameters X and Y, both of which are in an increased order. All data points in X vector that have not appeared in Y vector will be recorded and returned by this function.
remainUnsample(X, Y)remainUnsample(X, Y)
X |
A vector that needs to be sampled. |
Y |
A vector with data samples from X. |
A vector containing the remaining data that are not in Y.
SBSS.P algorithm is a stochastic algorithm. It obtains data subsets through uniform sampling in each neuron after clustering through SOM neural network, with details given in May et al. (2010).
SBSS.P(data, control)SBSS.P(data, control)
data |
The dataset should be matrix or Data.frame. The format should be as follows: Column one is a subscript vector used to mark each data point (each row is considered as a data point); Columns from 2 to N-1 are the input data, and Column N are the output data. |
control |
User-defined parameter list, where each parameter definition refers to the par.default function. |
Return the training, test and validation subsets. If the original data are required to be split into two subsets, the training and test subsets can be combined into a single calibration subset.
May, R. J., Maier H. R., and Dandy G. C.(2010), Data splitting for artificial neural networks using SOM-based stratified sampling, Neural Netw, 23(2), 283-294.
Built-in function: This function decides whether to process the input dataset according to the parameter include.inp. If TRUE, this function removes Column 1 of the input dataset; otherwise, it returns the Column N of the data set.
selectData(data, control)selectData(data, control)
data |
The dataset should be matrix or Data.frame. The format should be as follows: Column one is a subscript vector used to mark each data point (each row is considered as a data point); Columns from 2 to N-1 are the input data, and Column N are the output data. |
control |
User-defined parameter list, where each parameter definition refers to the par.default function. |
Returns a matrix for subsequent calculations.
Built-in function: This function performs clustering for a given dataset by calling the [kohonen]{som} function from a “kohonen” package.
somCluster(data)somCluster(data)
data |
The dataset in matrix or data.frame, containing only input and output vectors, but with no subscript vector. |
Return a data list of clustering neurons in the SOM network.
SOMPLEX algorithm is a stochastic algorithm, with details given in Chen et al. (2022) and Zheng et al. (2023)
SOMPLEX(data, control)SOMPLEX(data, control)
data |
The dataset should be matrix or Data.frame. The format should be as follows: Column one is a subscript vector used to mark each data point (each row is considered as a data point); Columns from 2 to N-1 are the input data, and Column N are the output data. |
control |
User-defined parameter list, where each parameter definition refers to the |
Return the training, test and validation subsets. If the original data are required to be split into two subsets, the training and test subsets can be combined into a single calibration subset.
Chen, J., Zheng F., May R., Guo D., Gupta H., and Maier H. R.(2022), Improved data splitting methods for data-driven hydrological model development based on a large number of catchment samples, Journal of Hydrology, 613.
The systematic stratified (SS) is a semi-deterministic method, with details given in Zheng et al. (2018).
SS(data, control)SS(data, control)
data |
The type of data set to be divided should be matrix or Data.frame, and the data format is as follows: The first column is a subscript vector, which is used to mark each data point (each row is regarded as a data point); Columns 2 through N-1 are the input vectors, and columns N (the last) are the output vectors. |
control |
User-defined parameter list, where each parameter definition refers to the |
Return the training, test and validation subsets. If the original data are required to be split into two subsets, the training and test subsets can be combined into a single calibration subset.
Zheng, F., Maier, H.R., Wu, W., Dandy, G.C., Gupta, H.V. and Zhang, T. (2018) On Lack of Robustness in Hydrological Model Development Due to Absence of Guidelines for Selecting Calibration and Evaluation Data: Demonstration for Data‐Driven Models. Water Resources Research 54(2), 1013-1030.
Built-in function: This function performs the SS algorithm.
SSsample(index, prop)SSsample(index, prop)
index |
A subscript vector whose subscript corresponds to the output vector of the data point sorted in an ascending order. |
prop |
The sampling ratio, with the value ranging between 0 and 1. |
Return a vector containing the subscript of the sampled data points.
Built-in function: This function is used to standardize the data.
standardise(data)standardise(data)
data |
The dataset should be of type matrix or Data.frame and contain only the input and output vectors. |
Return a matrix with normalized data.
This function selects a time-consecutive data from the original data set as the calibration (training and test) subset, and the remaining data is taken as the evaluation subset.
TIMECON(data, control)TIMECON(data, control)
data |
The dataset should be matrix or Data.frame. The format should be as follows: Column one is a subscript vector used to mark each data point (each row is considered as a data point); Columns from 2 to N-1 are the input data, and Column N are the output data. |
control |
User-defined parameter list, where each parameter definition refers to the |
Return the calibration and validation subsets.