dsframework.impl.data_pipelines#

class CalcLags(columns: Optional[list], periods: Optional[list] = None, remove_first_rows: bool = False, group_by_column: bool = False, suffix: str = '_LAG_')[source]#

Class that calculates lag values for selected columns. Note: this feature processing will remove the first max(periods) rows from the dataframe, as there is no lagged value for them.

Parameters

columns (list) – List of columns for which lags are calculated.
periods (list, optional) – Values of the lag to be used. Maximum value of periods is used to remove the first rows from the dataframe. Default: [1]
suffix (str, optional) – The names of columns with calculated lag values have a suffix of the form: f”{suffix}{value_of_lag}” Default: “_LAG_”
remove_first_rows (bool, optional) – If True, remove the first max(periods) rows from the dataframe. Default: False

transform_ot(src: onetick.py.core.source.Source) → onetick.py.core.source.Source[source]#

Calculates lags for the given columns inside OneTick pipeline.

Parameters

src (otp.Source) – Source to calculate lags for.
columns (list) – List of columns to calculate lags for.

Returns

Source with calculated lags in a new columns.

Return type

otp.Source

class SelectFeatures(columns=None, override=False)[source]#

Class that selects the specified columns as features.

Parameters

columns (list) – List of columns that are defined as features.
override (bool) – If True, override the existing list of features columns. If false, adds the columns to the existing list of features columns.

class SelectTargets(columns, override: bool = False, shift: bool = False)[source]#

Class that selects the specified columns as targets.

Parameters

columns (list) – List of columns that are defined as targets.
override (bool) – If True, override the existing list of targets columns. If False, adds the columns to the existing list of targets columns.
shift (bool) – DONT USE IT YET! If True, shift the resulted target columns by one row, providing the next value of the target column for each row. It is useful for the case when we want to predict the next value of a time series, and we want to use the current value as a feature. Default is False.

class OneTickBarsDatafeed(**kwargs)[source]#

OneTick datafeed with bars (Open, High, Low, Close, Volume, Trade Count).

Parameters

db (str) – Name for database to use. Default: ‘NYSE_TAQ_BARS’.
tick_type (str) – Tick type to load. Default: ‘TRD_1M’.
symbols (List[str]) – List of symbols to load. Default: [‘AAPL’].
start (otp.datetime) – Start datetime. Default: datetime(2022, 3, 1, 9, 30)
end (otp.datetime) – End datetime. Default: datetime(2022, 3, 10, 16, 0)
bucket (int) – Bucket size used to aggregate data (timeframe). Default: 600.
bucket_time (str) – Bucket time to use: start or end. Default: start.
timezone (str) – Timezone to use. Default: ‘EST5EDT’.
suffix (str) – Add a suffix to all columns of the result dataframe. Default: None.
apply_times_daily (bool) – Apply times daily to the data, skipping data outside of the specified times for all days. Default: True.

load(*args)[source]#

Main method used to load data.

Returns: result – Loaded data
Return type: pd.DataFrame

class LimitOutliers(*args, **kwargs)[source]#

Data preprocessing class that limits outliers by using standard deviations. The maximum and minimum allowable values are calculated using the formula: mean ± std_num * std, where mean and std are the mean value and standard deviation calculated on the training set.

Parameters

columns (list) – List of columns to which preprocessing will be applied
std_num (float) – The number of standard deviations used to limit outliers. Default: 3

fit_pandas(df: pandas.core.frame.DataFrame)[source]#

Fits selected preprocessor on data without transforming it.

Parameters: df (pd.DataFrame) – Dataframe to be preprocessed.

transform_pandas(df: pandas.core.frame.DataFrame)[source]#

Process data by limiting (capping) outliers.

Parameters

df (pd.DataFrame) – Dataframe to be preprocessed.
columns (list, optional) – List of columns to which preprocessing will be applied. If set to None, then all columns will be used. By default None.

class LaggedDifferences(*args, **kwargs)[source]#

Data preprocessing class that calculates the difference between the current and lag values (of time series).

Note: this preprocessing will make first lag rows contain NaN values. Filter them out before training.

Parameters

columns (list) – List of columns to which preprocessing will be applied.
lag (int) – Value of the lag to be used. This value is equals, how many rows will be removed from the beggining of the dataframe. Default: 39

transform_ot(src: onetick.py.core.source.Source)[source]#

Calculates lagged differences for the given columns.

Parameters

src (otp.Source) – Source to calulate lagged differences for.
columns (list, optional) – List of columns to calculate lagged differences for.

Returns

Source with calculated lagged differences in a new columns.

Return type

pd.DataFrame

fit_pandas(df: pandas.core.frame.DataFrame)[source]#

Preprocess data by calculating the difference between the current and lag values (of time series).

Parameters: df (pd.DataFrame) – Dataframe to be preprocessed.

inverse_transform(prediction_df: Optional[pandas.core.frame.DataFrame] = None) → pandas.core.frame.DataFrame[source]#

Reverse process data by adding the lagged values to the corresponding prediction values.

Parameters: prediction_df (pd.DataFrame) – Dataframe to be deprocessed.
Returns: Deprocessed dataframe with the same shape as the input dataframe or None if prediction_df is None.
Return type: pd.DataFrame

class MinMaxScaler(*args, **kwargs)[source]#

Data preprocessing class that scales data to a given range.

Parameters

columns (list, optional) – List of columns to which preprocessing will be applied. By default None.
transformed_range (tuple) – Desired range of transformed data. Default: (0, 1)

inverse_transform(prediction_df: Optional[pandas.core.frame.DataFrame] = None) → pandas.core.frame.DataFrame[source]#

Deprocess data using sklearn preprocessor.

Parameters: df (pd.DataFrame) – Dataframe to be preprocessed.
Returns: Deprocessed dataframe.
Return type: pd.DataFrame

class ApplyLog(*args, **kwargs)[source]#

Data preprocessing class that logarithms data.

Parameters

columns (list) – List of columns to which preprocessing will be applied
base (float) – Base of the logarithm. Default: math.e
suffix (str) – Suffix to be added to column name after preprocessing. Default is “”, which means that preprocessing will not create new column and apply logarithm to the original column.

transform_pandas(df: pandas.core.frame.DataFrame)[source]#

Preprocess data by applying logarithm.

Parameters: df (pd.DataFrame) – Dataframe to be preprocessed.

inverse_transform(prediction_df: Optional[pandas.core.frame.DataFrame] = None) → pandas.core.frame.DataFrame[source]#

Reverse process data by applying the exponential function.

Parameters: prediction_df (pd.DataFrame, optional) – Dataframe with predictions, by default None
Returns: Reverse processed dataframe with predictions or None if prediction_df is None.
Return type: pd.DataFrame or None

class PercentageSplitter(val_size=0, test_size=0.15, shuffle=False)[source]#

Class for splitting data to X (features) and Y (target) sets, as well as to train-test-validate subsets (samples are determined using percentage size for validation and test subsets).

Parameters

val_size (float) – The size of the validation subset. Default: 0
test_size (float) – The size of the test subset. Default: 0.15
shuffle (bool) – Whether or not to shuffle the data before splitting. Default: False

class IndexSplitter(val_indexes=None, test_indexes=None)[source]#

Class for splitting data into X (features) and Y (target) sets, as well as into train-test-validate subsets (samples are determined using indexes for validation and test subsets).

Parameters

val_indexes (list) – The indexes of the validation subset. Default: []
test_indexes (list) – The indexes of the test subset. Default: []

class TimeSplitter(datetime_column='Time', val_time_range=(), test_time_range=())[source]#

Class for splitting data into X (features) and Y (target) sets, as well as into train-test-validate subsets (samples are determined using time ranges for validation and test subsets).

Parameters

datetime_column (str) – Name of the column that stores the datetime values. Set ‘’ for selecting index column. Default: ‘Time’
val_time_range (tuple) – Tuple of the start and end datetimes for the validation subset. Start time, end time are included in the range. Default: ()
test_time_range (tuple) – Tuple of the start and end datetimes for the test subset. Start time, end time are included in the range. Default: ()

class CSVDatafeed(**kwargs)[source]#

Datafeed class for loading data from CSV.

Parameters

params (dict) – Arguments goes directly to pd.read_csv() function.
suffix (str) – Add a suffix to all columns of the result dataframe. Default: None

class RandomBarsDatafeed(bars=5000, bucket=600, candle_body_baseval=0.01, candle_shadow_baseval=0.005, random_state=None, group_column=None)[source]#

Datafeed with randomly generated bars (Open, High, Low, Close, Volume, Time).

Parameters

bars (int) – Number of bars to generate for each ticker. Default: 5000.
bucket (int) – Timeframe of bars to be used (specified in seconds). Default: 600
candle_body_baseval (float) – Parameter that determines the average body size of candlestick bars. Default: 0.01
candle_shadow_baseval (float) – Parameter that determines the average upper/lower shadow size of candlestick bars. Default: 0.0025
random_state (int) – Random state (seed). Default None.
group_column (str, optional) – Column name to be used for grouping (tickers, etc.) Default None.

class IntradayAveraging(*args, **kwargs)[source]#

Data preprocessing class that calculates the difference between the current value and the average value of the same intraday interval over the past N days.

Note: this preprocessing will remove the first bins`*`window_days rows from the dataframe.

Parameters

columns (list) – List of columns to which preprocessing will be applied.
window_days (int) – Number of days for averaging. Default: 5
bins (int) – Number of intraday intervals. If None is specified, then it is determined automatically by the number of unique hh:mm buckets. Default: None
datetime_column (str) – Name of the column that stores the datetime values. Default: ‘Time’
suffix (str) – Suffix to be added to the column names. Default: ‘’

fit_pandas(df: pandas.core.frame.DataFrame)[source]#

Preprocess (timeseries) data by calculating the difference between the current value and the average value of the same intraday interval in the past period of time.

Parameters: df (pd.DataFrame) – Dataframe to be preprocessed.

inverse_transform(prediction_df: Optional[pandas.core.frame.DataFrame] = None) → pandas.core.frame.DataFrame[source]#

Reverse process data by adding the average value of the intraday intervals to the corresponding prediction values.

Parameters: prediction_df (pd.DataFrame) – Dataframe to be deprocessed.
Returns: Reverse processed dataframe with the same shape as the input dataframe or None if prediction_df is None.
Return type: pd.DataFrame

class GroupByColumn(*args, **kwargs)[source]#

get_init_params()[source]#: Override get_init_params in order to replace GroupByColumn with nested preprocessor and add group_by_column to its dict.

inverse_transform(prediction_df: Optional[pandas.core.frame.DataFrame] = None) → pandas.core.frame.DataFrame[source]#

Reverse process prediction dataframe.

Parameters: prediction_df (pd.DataFrame, optional) – Prediction dataframe to be deprocessed, by default None
Returns: Reverse processed prediction dataframe (if deprocessable)
Return type: pd.DataFrame

class SKLearnPreprocessor(*args, **kwargs)[source]#

Data preprocessing class that uses sklearn preprocessor. Could use only preprocessors that do not change the number of columns.

Parameters

preprocessor_class (sklearn.preprocessing class) – Class of sklearn preprocessor to be used.
columns (list, str) – List of columns to which preprocessing will be applied. String “__all__” means that all columns will be used. Default is “__all__”.
fit_on_train (bool) – If True, then preprocessor will be fitted only on train data. Default is False.
params (dict) – Keyword arguments to be passed to sklearn preprocessor constructor.
suffix (str) – Suffix to be added to column name after preprocessing. Default is None, which means that suffix will be equal to preprocessor class name in upper case.

fit_pandas(df: pandas.core.frame.DataFrame)[source]#

Fits selected preprocessor on data without transforming it.

Parameters

df (pd.DataFrame) – Dataframe to be preprocessed.
columns (list, optional) – List of columns to which preprocessing will be applied. If set to None, then all columns will be used. By default None.

transform_pandas(df: pandas.core.frame.DataFrame)[source]#

Transform data using selected sklearn preprocessor.

Parameters

df (pd.DataFrame) – Dataframe to be preprocessed.
columns (list, optional) – List of columns to which preprocessing will be applied. If set to None, then all columns will be used. By default None.

inverse_transform(prediction_df: Optional[pandas.core.frame.DataFrame] = None) → pandas.core.frame.DataFrame[source]#

Deprocess data using sklearn preprocessor.

Parameters: df (pd.DataFrame) – Dataframe to be preprocessed.
Returns: Deprocessed dataframe.
Return type: pd.DataFrame

class FilterValues(columns: Union[list, str] = '__all__', exclude_values: Union[list, str] = 'na', exclude_from_test_set_only: bool = False)[source]#

OneTick Data Science Framework

dsframework.impl.data_pipelines

dsframework.impl.data_pipelines#