dsframework.impl.data_pipelines#

class CalcLags(columns: Optional[list], periods: Optional[list] = None, remove_first_rows: bool = False, group_by_column: bool = False, suffix: str = '_LAG_')[source]#

Class that calculates lag values for selected columns. Note: this feature processing will remove the first max(periods) rows from the dataframe, as there is no lagged value for them.

Parameters
  • columns (list) – List of columns for which lags are calculated.

  • periods (list, optional) – Values of the lag to be used. Maximum value of periods is used to remove the first rows from the dataframe. Default: [1]

  • suffix (str, optional) – The names of columns with calculated lag values have a suffix of the form: f”{suffix}{value_of_lag}” Default: “_LAG_”

  • remove_first_rows (bool, optional) – If True, remove the first max(periods) rows from the dataframe. Default: False

transform_ot(src: onetick.py.core.source.Source) onetick.py.core.source.Source[source]#

Calculates lags for the given columns inside OneTick pipeline.

Parameters
  • src (otp.Source) – Source to calculate lags for.

  • columns (list) – List of columns to calculate lags for.

Returns

Source with calculated lags in a new columns.

Return type

otp.Source

class SelectFeatures(columns=None, override=False)[source]#

Class that selects the specified columns as features.

Parameters
  • columns (list) – List of columns that are defined as features.

  • override (bool) – If True, override the existing list of features columns. If false, adds the columns to the existing list of features columns.

class SelectTargets(columns, override: bool = False, shift: bool = False)[source]#

Class that selects the specified columns as targets.

Parameters
  • columns (list) – List of columns that are defined as targets.

  • override (bool) – If True, override the existing list of targets columns. If False, adds the columns to the existing list of targets columns.

  • shift (bool) – DONT USE IT YET! If True, shift the resulted target columns by one row, providing the next value of the target column for each row. It is useful for the case when we want to predict the next value of a time series, and we want to use the current value as a feature. Default is False.

class OneTickBarsDatafeed(**kwargs)[source]#

OneTick datafeed with bars (Open, High, Low, Close, Volume, Trade Count).

Parameters
  • db (str) – Name for database to use. Default: ‘NYSE_TAQ_BARS’.

  • tick_type (str) – Tick type to load. Default: ‘TRD_1M’.

  • symbols (List[str]) – List of symbols to load. Default: [‘AAPL’].

  • start (otp.datetime) – Start datetime. Default: datetime(2022, 3, 1, 9, 30)

  • end (otp.datetime) – End datetime. Default: datetime(2022, 3, 10, 16, 0)

  • bucket (int) – Bucket size used to aggregate data (timeframe). Default: 600.

  • bucket_time (str) – Bucket time to use: start or end. Default: start.

  • timezone (str) – Timezone to use. Default: ‘EST5EDT’.

  • suffix (str) – Add a suffix to all columns of the result dataframe. Default: None.

  • apply_times_daily (bool) – Apply times daily to the data, skipping data outside of the specified times for all days. Default: True.

load(*args)[source]#

Main method used to load data.

Returns

result – Loaded data

Return type

pd.DataFrame

class LimitOutliers(*args, **kwargs)[source]#

Data preprocessing class that limits outliers by using standard deviations. The maximum and minimum allowable values are calculated using the formula: mean ± std_num * std, where mean and std are the mean value and standard deviation calculated on the training set.

Parameters
  • columns (list) – List of columns to which preprocessing will be applied

  • std_num (float) – The number of standard deviations used to limit outliers. Default: 3

fit_pandas(df: pandas.core.frame.DataFrame)[source]#

Fits selected preprocessor on data without transforming it.

Parameters

df (pd.DataFrame) – Dataframe to be preprocessed.

transform_pandas(df: pandas.core.frame.DataFrame)[source]#

Process data by limiting (capping) outliers.

Parameters
  • df (pd.DataFrame) – Dataframe to be preprocessed.

  • columns (list, optional) – List of columns to which preprocessing will be applied. If set to None, then all columns will be used. By default None.

class LaggedDifferences(*args, **kwargs)[source]#

Data preprocessing class that calculates the difference between the current and lag values (of time series).

Note: this preprocessing will make first lag rows contain NaN values. Filter them out before training.

Parameters
  • columns (list) – List of columns to which preprocessing will be applied.

  • lag (int) – Value of the lag to be used. This value is equals, how many rows will be removed from the beggining of the dataframe. Default: 39

transform_ot(src: onetick.py.core.source.Source)[source]#

Calculates lagged differences for the given columns.

Parameters
  • src (otp.Source) – Source to calulate lagged differences for.

  • columns (list, optional) – List of columns to calculate lagged differences for.

Returns

Source with calculated lagged differences in a new columns.

Return type

pd.DataFrame

fit_pandas(df: pandas.core.frame.DataFrame)[source]#

Preprocess data by calculating the difference between the current and lag values (of time series).

Parameters

df (pd.DataFrame) – Dataframe to be preprocessed.

inverse_transform(prediction_df: Optional[pandas.core.frame.DataFrame] = None) pandas.core.frame.DataFrame[source]#

Reverse process data by adding the lagged values to the corresponding prediction values.

Parameters

prediction_df (pd.DataFrame) – Dataframe to be deprocessed.

Returns

Deprocessed dataframe with the same shape as the input dataframe or None if prediction_df is None.

Return type

pd.DataFrame

class MinMaxScaler(*args, **kwargs)[source]#

Data preprocessing class that scales data to a given range.

Parameters
  • columns (list, optional) – List of columns to which preprocessing will be applied. By default None.

  • transformed_range (tuple) – Desired range of transformed data. Default: (0, 1)

inverse_transform(prediction_df: Optional[pandas.core.frame.DataFrame] = None) pandas.core.frame.DataFrame[source]#

Deprocess data using sklearn preprocessor.

Parameters

df (pd.DataFrame) – Dataframe to be preprocessed.

Returns

Deprocessed dataframe.

Return type

pd.DataFrame

class ApplyLog(*args, **kwargs)[source]#

Data preprocessing class that logarithms data.

Parameters
  • columns (list) – List of columns to which preprocessing will be applied

  • base (float) – Base of the logarithm. Default: math.e

  • suffix (str) – Suffix to be added to column name after preprocessing. Default is “”, which means that preprocessing will not create new column and apply logarithm to the original column.

transform_pandas(df: pandas.core.frame.DataFrame)[source]#

Preprocess data by applying logarithm.

Parameters

df (pd.DataFrame) – Dataframe to be preprocessed.

inverse_transform(prediction_df: Optional[pandas.core.frame.DataFrame] = None) pandas.core.frame.DataFrame[source]#

Reverse process data by applying the exponential function.

Parameters

prediction_df (pd.DataFrame, optional) – Dataframe with predictions, by default None

Returns

Reverse processed dataframe with predictions or None if prediction_df is None.

Return type

pd.DataFrame or None

class PercentageSplitter(val_size=0, test_size=0.15, shuffle=False)[source]#

Class for splitting data to X (features) and Y (target) sets, as well as to train-test-validate subsets (samples are determined using percentage size for validation and test subsets).

Parameters
  • val_size (float) – The size of the validation subset. Default: 0

  • test_size (float) – The size of the test subset. Default: 0.15

  • shuffle (bool) – Whether or not to shuffle the data before splitting. Default: False

class IndexSplitter(val_indexes=None, test_indexes=None)[source]#

Class for splitting data into X (features) and Y (target) sets, as well as into train-test-validate subsets (samples are determined using indexes for validation and test subsets).

Parameters
  • val_indexes (list) – The indexes of the validation subset. Default: []

  • test_indexes (list) – The indexes of the test subset. Default: []

class TimeSplitter(datetime_column='Time', val_time_range=(), test_time_range=())[source]#

Class for splitting data into X (features) and Y (target) sets, as well as into train-test-validate subsets (samples are determined using time ranges for validation and test subsets).

Parameters
  • datetime_column (str) – Name of the column that stores the datetime values. Set ‘’ for selecting index column. Default: ‘Time’

  • val_time_range (tuple) – Tuple of the start and end datetimes for the validation subset. Start time, end time are included in the range. Default: ()

  • test_time_range (tuple) – Tuple of the start and end datetimes for the test subset. Start time, end time are included in the range. Default: ()

class CSVDatafeed(**kwargs)[source]#

Datafeed class for loading data from CSV.

Parameters
  • params (dict) – Arguments goes directly to pd.read_csv() function.

  • suffix (str) – Add a suffix to all columns of the result dataframe. Default: None

class RandomBarsDatafeed(bars=5000, bucket=600, candle_body_baseval=0.01, candle_shadow_baseval=0.005, random_state=None, group_column=None)[source]#

Datafeed with randomly generated bars (Open, High, Low, Close, Volume, Time).

Parameters
  • bars (int) – Number of bars to generate for each ticker. Default: 5000.

  • bucket (int) – Timeframe of bars to be used (specified in seconds). Default: 600

  • candle_body_baseval (float) – Parameter that determines the average body size of candlestick bars. Default: 0.01

  • candle_shadow_baseval (float) – Parameter that determines the average upper/lower shadow size of candlestick bars. Default: 0.0025

  • random_state (int) – Random state (seed). Default None.

  • group_column (str, optional) – Column name to be used for grouping (tickers, etc.) Default None.

class IntradayAveraging(*args, **kwargs)[source]#

Data preprocessing class that calculates the difference between the current value and the average value of the same intraday interval over the past N days.

Note: this preprocessing will remove the first bins`*`window_days rows from the dataframe.

Parameters
  • columns (list) – List of columns to which preprocessing will be applied.

  • window_days (int) – Number of days for averaging. Default: 5

  • bins (int) – Number of intraday intervals. If None is specified, then it is determined automatically by the number of unique hh:mm buckets. Default: None

  • datetime_column (str) – Name of the column that stores the datetime values. Default: ‘Time’

  • suffix (str) – Suffix to be added to the column names. Default: ‘’

fit_pandas(df: pandas.core.frame.DataFrame)[source]#

Preprocess (timeseries) data by calculating the difference between the current value and the average value of the same intraday interval in the past period of time.

Parameters

df (pd.DataFrame) – Dataframe to be preprocessed.

inverse_transform(prediction_df: Optional[pandas.core.frame.DataFrame] = None) pandas.core.frame.DataFrame[source]#

Reverse process data by adding the average value of the intraday intervals to the corresponding prediction values.

Parameters

prediction_df (pd.DataFrame) – Dataframe to be deprocessed.

Returns

Reverse processed dataframe with the same shape as the input dataframe or None if prediction_df is None.

Return type

pd.DataFrame

class GroupByColumn(*args, **kwargs)[source]#
get_init_params()[source]#

Override get_init_params in order to replace GroupByColumn with nested preprocessor and add group_by_column to its dict.

inverse_transform(prediction_df: Optional[pandas.core.frame.DataFrame] = None) pandas.core.frame.DataFrame[source]#

Reverse process prediction dataframe.

Parameters

prediction_df (pd.DataFrame, optional) – Prediction dataframe to be deprocessed, by default None

Returns

Reverse processed prediction dataframe (if deprocessable)

Return type

pd.DataFrame

class SKLearnPreprocessor(*args, **kwargs)[source]#

Data preprocessing class that uses sklearn preprocessor. Could use only preprocessors that do not change the number of columns.

Parameters
  • preprocessor_class (sklearn.preprocessing class) – Class of sklearn preprocessor to be used.

  • columns (list, str) – List of columns to which preprocessing will be applied. String “__all__” means that all columns will be used. Default is “__all__”.

  • fit_on_train (bool) – If True, then preprocessor will be fitted only on train data. Default is False.

  • params (dict) – Keyword arguments to be passed to sklearn preprocessor constructor.

  • suffix (str) – Suffix to be added to column name after preprocessing. Default is None, which means that suffix will be equal to preprocessor class name in upper case.

fit_pandas(df: pandas.core.frame.DataFrame)[source]#

Fits selected preprocessor on data without transforming it.

Parameters
  • df (pd.DataFrame) – Dataframe to be preprocessed.

  • columns (list, optional) – List of columns to which preprocessing will be applied. If set to None, then all columns will be used. By default None.

transform_pandas(df: pandas.core.frame.DataFrame)[source]#

Transform data using selected sklearn preprocessor.

Parameters
  • df (pd.DataFrame) – Dataframe to be preprocessed.

  • columns (list, optional) – List of columns to which preprocessing will be applied. If set to None, then all columns will be used. By default None.

inverse_transform(prediction_df: Optional[pandas.core.frame.DataFrame] = None) pandas.core.frame.DataFrame[source]#

Deprocess data using sklearn preprocessor.

Parameters

df (pd.DataFrame) – Dataframe to be preprocessed.

Returns

Deprocessed dataframe.

Return type

pd.DataFrame

class FilterValues(columns: Union[list, str] = '__all__', exclude_values: Union[list, str] = 'na', exclude_from_test_set_only: bool = False)[source]#