dsframework.impl.experiments#

class Experiment[source]#

Implements full cycle to run experiment.

serialize_config()[source]#

Dump current instance config to dict, that could be used to reconstruct class by build_experiment_class().

get_data(datetime_column: Optional[str] = None) Union[pandas.core.frame.DataFrame, onetick.py.core.source.Source][source]#

Function of loading external data (e.g. onetick or csv). If an experiment is set to use of several data feeds, then they are merged by index or by the datetime_column (if it is specified). Before returning the result, reset_index is applied to the dataframe, since in the current version of the framework, preprocessing is based on a numerical index, and not on a datetime field.

Parameters

datetime_column (str, optional) – The field by which the data is merged if the experiment contains several data feeds.

Returns

result – Dataframe with data that will participate in the subsequent stages of the experiment pipeline (prepare_data() function). Important! reset_index() was applied to the data (see the description of the function above). The result is also stored in the attribute of an instance of the class Experiment: experiment.df

Return type

pd.DataFrame

prepare_data(src: Optional[Union[pandas.core.frame.DataFrame, onetick.py.core.source.Source]] = None)[source]#

Function of preparing data. It includes following steps: features calculation, sample splitting, and data preprocessing.

Parameters
  • src (pd.DataFrame or otp.Source, optional) – Data to prepare for feeding into a machine learning model. If the parameter is not specified, then data obtained at the previous step of an experiment is taken (the experiment.df attribute defined at the get_data() function execution stage).

  • Results

  • ----------

    The results are stored in the Experiment properties, which is pandas slices from full dataframe:

    experiment.x_train experiment.x_val experiment.x_test experiment.y_train experiment.y_val experiment.y_test

    In addition, you can refer to the following attributes:

    experiment.y_unprocessed - dataframe with not preprocessed (original) target columns, used to calculate metrics

init_fit(train_params: Optional[dict] = None, remote_mode: bool = False)[source]#

Function of training models. It includes the model initialization and model fitting steps. In addition, at this stage, the logic of overfitting control, validation (including cross-validation), and hyperparameter tuning is implemented.

Parameters
  • train_params (dict, optional) –

    loss: str, optional

    One of {‘MAE’, ‘RMSE’, ‘MAPE’, ‘MSLE’}. The loss function that is used to train the model. Default value is ‘RMSE’. Important! The loss function can also be set in the init_params of the model, in which case it overrides the loss parameter.

    overfitting: dict, optional

    Parameters to limit the overfitting of the model during training. It includes:

    eval_metric: str

    One of {‘MAE’, ‘RMSE’, ‘MAPE’, ‘R2’, ‘MSLE’}. The metric calculated on the validation set to control overfitting (determining the best iteration of model training, early stopping of model training). Default value is ‘MAE’.

    early_stopping_rounds: int

    The maximum number of training iterations that occurs without improving the quality of model prediction on the validation sample. If the eval_metric does not improve during the early_stopping_rounds iterations, then the model training is completed. If early_stopping_rounds is not set or equal to zero, then early stopping is disabled and the model will be trained for the full number of iterations. Default value is 0.

    search_cv: dict, optional

    Validation and hyperparameter tuning settings. It includes:

    val_type: str

    The type of validation, one of {‘Simple’, ‘Cross’, ‘WalkForward’}. Default is ‘Simple’.

    folds: int

    The number of folds (used only for ‘Cross’ and ‘WalkForward’ validation). Default is 5.

    eval_metric: str

    One of {‘MAE’, ‘RMSE’, ‘MAPE’, ‘R2’, ‘MSLE’}. The metric calculated on validation samples in order to determine the best set of hyperparameters when tuning them. In addition, when using cross and walk-forward validation, metric calculation results can be obtained using the experiment.gscv_model.cv_results_ attribute (for example, to estimate the error variance). Default is ‘MAE’.

    search_optimization: str

    Hyperparameter search method, one of {‘grid’, ‘random’, ‘bayesian’, ‘bohb’, ‘hyperopt’, ‘none’} If value is ‘none’, then only first params combination used to train model. Default is ‘grid’.

    n_trials: int

    Number of search iterations, used for ‘random’, ‘bayesian’, ‘bohb’, ‘hyperopt’ types. Default is 10.

  • remote_mode (bool, optional) – If True, then using joblib with ray backand to parallelize processes on remote Ray workers|cluster. If False, then using joblib with threading backend to parallelize processes locally. Default value is False.

Returns

result – Dataframe with data that will participate in the subsequent stages of the experiment pipeline. RegressorModel attributes:

experiment.current_model_params: dict

the best model parameters found as a result of tuning hyperparameters (when tuning is disabled, the first element from the list of values is selected for each hyperparameter)

experiment.gscv_model: GridSearchCV/RandomizedSearchCV

Scikit-learn GridSearchCV/RandomizedSearchCV model object

experiment.model_result: object

result of GridSearchCV/RandomizedSearchCV model fit() function

experiment.dsf_model: object

DS Framework model object, the model attribute of this object stores the original trained ML model (e.g. XGBoost or sklearn model).

Return type

dsframework.impl.models.RegressorModel

predict(x=None, model=None, preproc_reverse=None)[source]#

Function of predicting results using a trained model.

Parameters
  • x (pd.DataFrame, optional) – Data containing features on the basis of which it needs to make a prediction. If the parameter is not specified, then data obtained at the previous steps of an experiment is taken: experiment.self.x_processed (see prepare_data function above).

  • model (object (it depends on used model), optional) – Trained model. If the parameter is not specified, then model obtained at the previous step of an experiment is taken: experiment.dsf_model (see init_fit function above).

  • preproc_reverse (bool, optional) – Enable/disable reverse data processing. In case the value is True, if processing was applied to the target values at the data preparation stage, then reverse processing is applied to the received predicted values. Default value is True.

Returns

result – Model prediction data. In addition, the results obtained are stored in the attributes of the experiment class:

experiment.prediction - the original prediction of the model experiment.prediction_reverse_processed - prediction of the model after applying reverse processing, in the case when preproc_reverse = False, the parameter value is None

Return type

pd.DataFrame

calc_metrics(y=None, prediction=None, group_by_column: bool = False)[source]#

Function of calculating metrics.

Parameters
  • y (pd.DataFrame, optional) – Target data. If the parameter is not specified, then data obtained at the previous steps of an experiment is taken: experiment.y_unprocessed (see prepare_data function above) if preproc_reverse = True (see predict function above) experiment.y_processed (see prepare_data function above) if preproc_reverse = False (see predict function above) Obviously, only the indices present in the prediction are selected from the target data. If group_by_column is True, then y must contain the column specified in Experiment.group_column_name.

  • prediction (pd.DataFrame, optional) – Model prediction. If the parameter is not specified, then data obtained at the previous steps of an experiment is taken: experiment.prediction_reverse_processed (see prepare_data function above) if preproc_reverse = True (see predict function above) experiment.prediction (see prepare_data function above) if preproc_reverse = False (see predict function above)

  • group_by_column (bool, optional) – Apply grouping by column during metric calculation. Each metric is calculated for each group separately. If the parameter is False, then the data is not grouped. Resulting metrics are returned as a dictionary of dictionaries. Default value is False.

Returns

result – Calculated metric values in dict. If group_by_column is True, then the result is a dictionary of dictionaries.

Return type

dict or dict of dict

calc_baseline(y: Optional[pandas.core.frame.DataFrame] = None, group_by_column: bool = False) dict[source]#

Function of building a baseline model and calculating its metrics. The following model is used as a baseline: the current predicted value of the time series is equal to the previous actual value of the time series.

Parameters
  • y (pd.DataFrame, optional) – Target data. If the parameter is not specified, then data obtained at the previous steps of an experiment is taken: experiment.y_unprocessed (see prepare_data function above) if preproc_reverse = True (see predict function above) experiment.y_processed (see prepare_data function above) if preproc_reverse = False (see predict function above) Obviously, only the indices present in the prediction are selected from the target data.

  • group_by_column (bool, optional) – Apply grouping by column. If the parameter is False, then the data is not grouped. Each metric is calculated for each group separately. Resulting metrics are returned as a dictionary of dictionaries. Default value is False.

Returns

result – Calculated metric values of the baseline model. If group_by_column is specified, then the result is a dictionary of dictionaries.

Return type

dict

prediction_intervals_onestep(y: Optional[pandas.core.frame.DataFrame] = None, prediction: Optional[pandas.core.frame.DataFrame] = None, z_value: float = 1.96) pandas.core.frame.DataFrame[source]#

Function of calculating one-step prediction interval using the standard deviation of the residuals. See: https://otexts.com/fpp3/prediction-intervals.html#one-step-prediction-intervals

Parameters
  • y (pd.DataFrame, optional) – Target data. If the parameter is not specified, then data obtained at the previous steps of an experiment is taken: experiment.y_unprocessed (see prepare_data function above) if preproc_reverse = True (see predict function above) experiment.y_processed (see prepare_data function above) if preproc_reverse = False (see predict function above) Obviously, only the indices present in the prediction are selected from the target data.

  • prediction (pd.DataFrame, optional) – Model prediction. If the parameter is not specified, then data obtained at the previous steps of an experiment is taken: experiment.prediction_reverse_processed (see prepare_data function above) if preproc_reverse = True (see predict function above) experiment.prediction (see prepare_data function above) if preproc_reverse = False (see predict function above)

  • z_value (float, optional) – Z value for confidence interval. Default value is 1.96.

Returns

result – Calculated prediction intervals.

Return type

pd.DataFrame

save_pyfunc_model(path: str)[source]#

Save model as pyfunc MLFlow model

Parameters

path (str) – Path to save model.

save_model(*args, **kwargs)[source]#

Save Experiment.dsf_model to disk, using native save_model() from used ML library. Pass any args and kwargs, that native ML library consume.

load_model(dsf_model_class, *args, **kwargs)[source]#

Initialize and load model from disk, using native load_model() from used ML library. Pass any args and kwargs, that native ML library consume. Loaded model will be set to dsf_model attribute of current experiment instance.

Returns

loaded model instance

Return type

any

run(x=None, y=None, remote_mode=False)[source]#

Run full experiment cycle, consisting of 3 stages:

  1. Data load and preprocessing

  2. Model training

  3. Evaluation

Trained model could be found in Experiment.dsf_model attribute. Method will return tuple, containing metrics and predictions calculated on test set, if test_size is greater then 0, or on whole train set otherwise.

Parameters
  • x (pd.DataFrame, optional) – Input data, by default None

  • y (pd.DataFrame, optional) – Target data, by default None

  • remote_mode (bool, optional) – Flag, indicating if experiment should be run remotely on Ray, by default False

Returns

metrics, predictions

Return type

tuple

save_mlflow_run()[source]#

Saves last run to MLFlow and returns run ID.

Returns

MLFlow run ID.

Return type

str

load_mlflow_model(run_id: str)[source]#

Load model from MLFlow Registry from run having specified run_id. Resulted model with be set to dsf_model attribute of current experiment instance.

Returns

loaded model instance

Return type

model