onetick.ml.impl.data_pipelines
onetick.ml.impl.data_pipelines#
- class CalcLags(columns: Optional[list], periods: Optional[list] = None, remove_first_rows: bool = False, group_by_column: bool = False, suffix: str = '_LAG_')[source]#
Class that calculates lag values for selected columns. Note: this feature processing will remove the first max(periods) rows from the dataframe, as there is no lagged value for them.
- Parameters
columns (list) – List of columns for which lags are calculated.
periods (list, optional) – Values of the lag to be used. Maximum value of periods is used to remove the first rows from the dataframe. Default: [1]
suffix (str, optional) – The names of columns with calculated lag values have a suffix of the form: f”{suffix}{value_of_lag}” Default: “_LAG_”
remove_first_rows (bool, optional) – If True, remove the first max(periods) rows from the dataframe. Default: False
- transform_ot(src: onetick.py.core.source.Source) onetick.py.core.source.Source [source]#
Calculates lags for the given columns inside OneTick pipeline.
- Parameters
src (otp.Source) – Source to calculate lags for.
columns (list) – List of columns to calculate lags for.
- Returns
Source with calculated lags in a new columns.
- Return type
otp.Source
- class SelectFeatures(columns=None, override=False)[source]#
Class that selects the specified columns as features.
- Parameters
columns (list) – List of columns that are defined as features.
override (bool) – If True, override the existing list of features columns. If false, adds the columns to the existing list of features columns.
- class SelectTargets(columns, override: bool = False, shift: bool = False)[source]#
Class that selects the specified columns as targets.
- Parameters
columns (list) – List of columns that are defined as targets.
override (bool) – If True, override the existing list of targets columns. If False, adds the columns to the existing list of targets columns.
shift (bool) – DONT USE IT YET! If True, shift the resulted target columns by one row, providing the next value of the target column for each row. It is useful for the case when we want to predict the next value of a time series, and we want to use the current value as a feature. Default is False.
- class OneTickBarsDatafeed(**kwargs)[source]#
OneTick datafeed with bars (Open, High, Low, Close, Volume, Trade Count).
- Parameters
db (str) – Name for database to use. Default: ‘NYSE_TAQ_BARS’.
tick_type (str) – Tick type to load. Default: ‘TRD_1M’.
symbols (List[str]) – List of symbols to load. Default: [‘AAPL’].
start (otp.datetime) – Start datetime. Default: datetime(2022, 3, 1, 9, 30)
end (otp.datetime) – End datetime. Default: datetime(2022, 3, 10, 16, 0)
bucket (int) – Bucket size used to aggregate data (timeframe). Default: 600.
bucket_time (str) – Bucket time to use: start or end. Default: start.
timezone (str) – Timezone to use. Default: ‘EST5EDT’.
suffix (str) – Add a suffix to all columns of the result dataframe. Default: None.
apply_times_daily (bool) – Apply times daily to the data, skipping data outside of the specified times for all days. Default: True.
- class LimitOutliers(*args, **kwargs)[source]#
Data preprocessing class that limits outliers by using standard deviations. The maximum and minimum allowable values are calculated using the formula: mean ± std_num * std, where mean and std are the mean value and standard deviation calculated on the training set.
- Parameters
columns (list) – List of columns to which preprocessing will be applied
std_num (float) – The number of standard deviations used to limit outliers. Default: 3
- fit_pandas(df: pandas.core.frame.DataFrame)[source]#
Fits selected preprocessor on data without transforming it.
- Parameters
df (pd.DataFrame) – Dataframe to be preprocessed.
- transform_pandas(df: pandas.core.frame.DataFrame)[source]#
Process data by limiting (capping) outliers.
- Parameters
df (pd.DataFrame) – Dataframe to be preprocessed.
columns (list, optional) – List of columns to which preprocessing will be applied. If set to None, then all columns will be used. By default None.
- class LaggedDifferences(*args, **kwargs)[source]#
Data preprocessing class that calculates the difference between the current and lag values (of time series).
Note: this preprocessing will make first lag rows contain NaN values. Filter them out before training.
- Parameters
columns (list) – List of columns to which preprocessing will be applied.
lag (int) – Value of the lag to be used. This value is equals, how many rows will be removed from the beggining of the dataframe. Default: 39
- transform_ot(src: onetick.py.core.source.Source)[source]#
Calculates lagged differences for the given columns.
- Parameters
src (otp.Source) – Source to calulate lagged differences for.
columns (list, optional) – List of columns to calculate lagged differences for.
- Returns
Source with calculated lagged differences in a new columns.
- Return type
pd.DataFrame
- fit_pandas(df: pandas.core.frame.DataFrame)[source]#
Preprocess data by calculating the difference between the current and lag values (of time series).
- Parameters
df (pd.DataFrame) – Dataframe to be preprocessed.
- inverse_transform(prediction_df: Optional[pandas.core.frame.DataFrame] = None) pandas.core.frame.DataFrame [source]#
Reverse process data by adding the lagged values to the corresponding prediction values.
- Parameters
prediction_df (pd.DataFrame) – Dataframe to be deprocessed.
- Returns
Deprocessed dataframe with the same shape as the input dataframe or None if prediction_df is None.
- Return type
pd.DataFrame
- class MinMaxScaler(*args, **kwargs)[source]#
Data preprocessing class that scales data to a given range.
- Parameters
columns (list, optional) – List of columns to which preprocessing will be applied. By default None.
transformed_range (tuple) – Desired range of transformed data. Default: (0, 1)
- class ApplyLog(*args, **kwargs)[source]#
Data preprocessing class that logarithms data.
- Parameters
columns (list) – List of columns to which preprocessing will be applied
base (float) – Base of the logarithm. Default: math.e
suffix (str) – Suffix to be added to column name after preprocessing. Default is “”, which means that preprocessing will not create new column and apply logarithm to the original column.
- transform_pandas(df: pandas.core.frame.DataFrame)[source]#
Preprocess data by applying logarithm.
- Parameters
df (pd.DataFrame) – Dataframe to be preprocessed.
- inverse_transform(prediction_df: Optional[pandas.core.frame.DataFrame] = None) pandas.core.frame.DataFrame [source]#
Reverse process data by applying the exponential function.
- Parameters
prediction_df (pd.DataFrame, optional) – Dataframe with predictions, by default None
- Returns
Reverse processed dataframe with predictions or None if prediction_df is None.
- Return type
pd.DataFrame or None
- class PercentageSplitter(val_size=0, test_size=0.15, shuffle=False)[source]#
Class for splitting data to X (features) and Y (target) sets, as well as to train-test-validate subsets (samples are determined using percentage size for validation and test subsets).
- Parameters
val_size (float) – The size of the validation subset. Default: 0
test_size (float) – The size of the test subset. Default: 0.15
shuffle (bool) – Whether or not to shuffle the data before splitting. Default: False
- class IndexSplitter(val_indexes=None, test_indexes=None)[source]#
Class for splitting data into X (features) and Y (target) sets, as well as into train-test-validate subsets (samples are determined using indexes for validation and test subsets).
- Parameters
val_indexes (list) – The indexes of the validation subset. Default: []
test_indexes (list) – The indexes of the test subset. Default: []
- class TimeSplitter(datetime_column='Time', val_time_range=(), test_time_range=())[source]#
Class for splitting data into X (features) and Y (target) sets, as well as into train-test-validate subsets (samples are determined using time ranges for validation and test subsets).
- Parameters
datetime_column (str) – Name of the column that stores the datetime values. Set ‘’ for selecting index column. Default: ‘Time’
val_time_range (tuple) – Tuple of the start and end datetimes for the validation subset. Start time, end time are included in the range. Default: ()
test_time_range (tuple) – Tuple of the start and end datetimes for the test subset. Start time, end time are included in the range. Default: ()
- class CSVDatafeed(**kwargs)[source]#
Datafeed class for loading data from CSV.
- Parameters
params (dict) – Arguments goes directly to pd.read_csv() function.
suffix (str) – Add a suffix to all columns of the result dataframe. Default: None
- class RandomBarsDatafeed(bars=5000, bucket=600, candle_body_baseval=0.01, candle_shadow_baseval=0.005, random_state=None, group_column=None)[source]#
Datafeed with randomly generated bars (Open, High, Low, Close, Volume, Time).
- Parameters
bars (int) – Number of bars to generate for each ticker. Default: 5000.
bucket (int) – Timeframe of bars to be used (specified in seconds). Default: 600
candle_body_baseval (float) – Parameter that determines the average body size of candlestick bars. Default: 0.01
candle_shadow_baseval (float) – Parameter that determines the average upper/lower shadow size of candlestick bars. Default: 0.0025
random_state (int) – Random state (seed). Default None.
group_column (str, optional) – Column name to be used for grouping (tickers, etc.) Default None.
- class IntradayAveraging(*args, **kwargs)[source]#
Data preprocessing class that calculates the difference between the current value and the average value of the same intraday interval over the past N days.
Note: this preprocessing will remove the first bins`*`window_days rows from the dataframe.
- Parameters
columns (list) – List of columns to which preprocessing will be applied.
window_days (int) – Number of days for averaging. Default: 5
bins (int) – Number of intraday intervals. If None is specified, then it is determined automatically by the number of unique hh:mm buckets. Default: None
datetime_column (str) – Name of the column that stores the datetime values. Default: ‘Time’
suffix (str) – Suffix to be added to the column names. Default: ‘’
- fit_pandas(df: pandas.core.frame.DataFrame)[source]#
Preprocess (timeseries) data by calculating the difference between the current value and the average value of the same intraday interval in the past period of time.
- Parameters
df (pd.DataFrame) – Dataframe to be preprocessed.
- inverse_transform(prediction_df: Optional[pandas.core.frame.DataFrame] = None) pandas.core.frame.DataFrame [source]#
Reverse process data by adding the average value of the intraday intervals to the corresponding prediction values.
- Parameters
prediction_df (pd.DataFrame) – Dataframe to be deprocessed.
- Returns
Reverse processed dataframe with the same shape as the input dataframe or None if prediction_df is None.
- Return type
pd.DataFrame
- class GroupByColumn(*args, **kwargs)[source]#
- get_init_params()[source]#
Override get_init_params in order to replace GroupByColumn with nested preprocessor and add group_by_column to its dict.
- inverse_transform(prediction_df: Optional[pandas.core.frame.DataFrame] = None) pandas.core.frame.DataFrame [source]#
Reverse process prediction dataframe.
- Parameters
prediction_df (pd.DataFrame, optional) – Prediction dataframe to be deprocessed, by default None
- Returns
Reverse processed prediction dataframe (if deprocessable)
- Return type
pd.DataFrame
- class SKLearnPreprocessor(*args, **kwargs)[source]#
Data preprocessing class that uses sklearn preprocessor. Could use only preprocessors that do not change the number of columns.
- Parameters
preprocessor_class (sklearn.preprocessing class) – Class of sklearn preprocessor to be used.
columns (list, str) – List of columns to which preprocessing will be applied. String “__all__” means that all columns will be used. Default is “__all__”.
fit_on_train (bool) – If True, then preprocessor will be fitted only on train data. Default is False.
params (dict) – Keyword arguments to be passed to sklearn preprocessor constructor.
suffix (str) – Suffix to be added to column name after preprocessing. Default is None, which means that suffix will be equal to preprocessor class name in upper case.
- fit_pandas(df: pandas.core.frame.DataFrame)[source]#
Fits selected preprocessor on data without transforming it.
- Parameters
df (pd.DataFrame) – Dataframe to be preprocessed.
columns (list, optional) – List of columns to which preprocessing will be applied. If set to None, then all columns will be used. By default None.
- transform_pandas(df: pandas.core.frame.DataFrame)[source]#
Transform data using selected sklearn preprocessor.
- Parameters
df (pd.DataFrame) – Dataframe to be preprocessed.
columns (list, optional) – List of columns to which preprocessing will be applied. If set to None, then all columns will be used. By default None.
- class FilterValues(columns: Union[list, str] = '__all__', exclude_values: Union[list, str] = 'na', exclude_from_test_set_only: bool = False)[source]#
- class BaseOnetickLoader(**kwargs)[source]#
- class OneTickBarsDatafeedOT(**kwargs)[source]#
OneTick datafeed with bars (Open, High, Low, Close, Volume, Trade Count).
- Parameters
db (str) – Name for database to use. Default: ‘NYSE_TAQ_BARS’.
tick_type (str) – Tick type to load. Default: ‘TRD_1M’.
symbols (List[str]) – List of symbols to load. Default: [‘AAPL’].
start (otp.datetime) – Start datetime. Default: datetime(2022, 3, 1, 9, 30)
end (otp.datetime) – End datetime. Default: datetime(2022, 3, 10, 16, 0)
bucket (int) – Bucket size used to aggregate data (timeframe). Default: 600.
bucket_time (str) – Bucket time to use: start or end. Default: start.
timezone (str) – Timezone to use. Default: ‘EST5EDT’.
columns (list) – List of columns to load.
apply_times_daily (bool) – Apply times daily to the data, skipping data outside of the specified times for all days. Default: True.
- class WindowFunction(columns: Optional[list] = None, suffix: str = '_WINDOW_', window_function: str = typing.Literal['mean', 'std', 'min', 'max'], window_size: int = 10)[source]#
- transform_ot(src: onetick.py.core.source.Source)[source]#
Calculates rolling window function for the given columns.
- Parameters
src (otp.Source) – Source to calulate rolling window function for.
columns (list) – List of columns to calculate rolling window function for.
- Returns
Source with calculated rolling window function in a new columns.
- Return type
otp.Source