Configuring Workloads
Anovos workloads can be described by a YAML configuration file.
Such a configuration file defines:
- the input dataset(s)
- the analyses and transformations to be performed on the data
- the output files and dataset(s)
- the reports to be generated
Defining workloads this way allows users to make full use of Anovos capabilities while maintaining an easy-to-grasp overview. Since each configuration file fully describes one workload, these files can be shared, versioned, and run across different compute environments.
In the following, we'll describe in detail each of the sections in an Anovos configuration file. If you'd rather see a full example right away, have a look at this example.
Note that each section of the configuration file maps to a module of Anovos. You'll find links to the respective sections of the API Documentation that provide much more detailed information on each modules' capabilities than we can squeeze into this guide.
📑 input_dataset
This configuration block describes how the input dataset is loaded and prepared
using the data_ingest.data_ingest
module.
Each Anovos configuration file must contain exactly one input_dataset
block.
Note that the subsequent operations are performed in the order given here: First, columns are deleted, then selected, then renamed, and then recast.
read_dataset
🔎 Corresponds to data_ingest.read_dataset
-
file_path
: The file (or directory) path to read the input dataset from. It can be a local path, an 📖 S3 path (when running on AWS), a path to a file resource on Google Colab (see 📖 this tutorial for an overview), or a path on the 📖 Databricks File System (when running on Azure). -
file_type
: The file format of the input data. Currently, Anovos supports CSV (csv
), Parquet (parquet
), and Avro (avro
). (Please note that if you're using Avro data sources, you need to add the external packageorg.apache.spark:spark-avro
when submitting the Spark job.) -
file_configs
(optional): Options to pass to the respective Spark file reader, e.g., delimiters, schemas, headers. In the case of a CSV file, this might look like:
For more information on available configuration options, see the following external documentation:
delete_column
🔎 Corresponds to data_ingest.delete_column
List of column names (list of strings or string of column names separated by |
)
to be deleted from the loaded input data.
🤓 Example:
select_column
🔎 Corresponds to data_ingest.select_column
List of column names (list of strings or string of column names separated by |
)
to be selected for further processing.
🤓 Example:
rename_column
🔎 Corresponds to data_ingest.rename_column
-
list_of_cols
: List of the names of columns (list of strings or string of column names separated by|
) to be renamed. -
list_of_newcols
: The new column names. The first element inlist_of_cols
will be renamed to the first name inlist_of_newcols
and so on.
🤓 Example:
rename_column:
list_of_cols: ['very_long_column_name', 'price']
list_of_newcols: ['short_name', 'label']
This will rename the column very_long_column_name
to short_name
and the column price
to label
.
recast_column
🔎 Corresponds to data_ingest.recast_column
-
list_of_cols
: List of the names of columns (list of strings or string of column names separated by|
) to be cast to a different type. -
list_of_dtypes
: The new datatypes. The first element inlist_of_cols
will be recast to the first type inlist_of_dtypes
and so on. See 📖 the Spark documentation for a list of valid datatypes. Note that this field is case-insensitive.
🤓 Example:
📑 concatenate_dataset
🔎 Corresponds to data_ingest.concatenate_dataset
This configuration block describes how to combine multiple loaded dataframes into a single one.
method
There are two different methods to concatenate dataframes:
index
: Concatenate by column index, i.e., the first column of the first dataframe is matched with the first column of the second dataframe and so forth.name
: Concatenate by column name, i.e., columns of the same name are matched.
Note that in both cases, the first dataframe will define both the names and the order of the
columns in the final dataframe.
If the subsequent dataframes have too few columns (index
) or are missing named columns (`name´)
for the concatenation to proceed, an error will be raised.
🤓 Example:
dataset1
read_dataset
🔎 Corresponds to data_ingest.read_dataset
-
file_path
: The file (or directory) path to read the other concatenating input dataset from. It can be a local path, an 📖 S3 path (when running on AWS), a path to a file resource on Google Colab (see 📖 this tutorial for an overview), or a path on the 📖 Databricks File System (when running on Azure). -
file_type
: The file format of the other concatenating input data. Currently, Anovos supports CSV (csv
), Parquet (parquet
), and Avro (avro
). (Please note that if you're using Avro data sources, you need to add the external packageorg.apache.spark:spark-avro
when submitting the Spark job.) -
file_configs
(optional): Options to pass to the respective Spark file reader, e.g., delimiters, schemas, headers.
delete_column
🔎 Corresponds to data_ingest.delete_column
List of column names (list of strings or string of column names separated by |
)
to be deleted from the loaded input data.
select_column
🔎 Corresponds to data_ingest.select_column
List of column names (list of strings or string of column names separated by |
)
to be selected for further processing.
rename_column
🔎 Corresponds to data_ingest.rename_column
-
list_of_cols
: List of the names of columns (list of strings or string of column names separated by|
) to be renamed. -
list_of_newcols
: The new column names. The first element inlist_of_cols
will be renamed to the first name inlist_of_newcols
and so on.
recast_column
🔎 Corresponds to data_ingest.recast_column
-
list_of_cols
: List of the names of columns (list of strings or string of column names separated by|
) to be cast to a different type. -
list_of_dtypes
: The new datatypes. The first element inlist_of_cols
will be recast to the first type inlist_of_dtypes
and so on. See 📖 the Spark documentation for a list of valid datatypes. Note that this field is case-insensitive.
dataset2
, dataset3
, …
Additional datasets are configured in the same manner as dataset1
.
📑 join_dataset
🔎 Corresponds to data_ingest.join_dataset
This configuration block describes how multiple dataframes are joined into a single one.
join_cols
The key of the column(s) to join on.
In the case that the key consists of multiple columns, they can be passed as a list of strings or
a single string where the column names are separated by |
.
🤓 Example:
join_type
The type of join to perform: inner
, full
, left
, right
, left_semi
, or left_anti
.
For a general introduction to joins, see 📖 this tutorial.
🤓 Example:
dataset1
read_dataset
🔎 Corresponds to data_ingest.read_dataset
-
file_path
: The file (or directory) path to read the other joining input dataset from. It can be a local path, an 📖 S3 path (when running on AWS), a path to a file resource on Google Colab (see 📖 this tutorial for an overview), or a path on the 📖 Databricks File System (when running on Azure). -
file_type
: The file format of the other joining input data. Currently, Anovos supports CSV (csv
), Parquet (parquet
), and Avro (avro
). (Please note that if you're using Avro data sources, you need to add the external packageorg.apache.spark:spark-avro
when submitting the Spark job.) -
file_configs
(optional): Options to pass to the respective Spark file reader, e.g., delimiters, schemas, headers.
delete_column
🔎 Corresponds to data_ingest.delete_column
List of column names (list of strings or string of column names separated by |
)
to be deleted from the loaded input data.
select_column
🔎 Corresponds to data_ingest.select_column
List of column names (list of strings or string of column names separated by |
)
to be selected for further processing.
rename_column
🔎 Corresponds to data_ingest.rename_column
-
list_of_cols
: List of the names of columns (list of strings or string of column names separated by|
) to be renamed. -
list_of_newcols
: The new column names. The first element inlist_of_cols
will be renamed to the first name inlist_of_newcols
and so on.
recast_column
🔎 Corresponds to data_ingest.recast_column
-
list_of_cols
: List of the names of columns (list of strings or string of column names separated by|
) to be cast to a different type. -
list_of_dtypes
: The new datatypes. The first element inlist_of_cols
will be recast to the first type inlist_of_dtypes
and so on. See 📖 the Spark documentation for a list of valid datatypes. Note that this field is case-insensitive.
dataset2
, dataset3
, …
Additional datasets are configured in the same manner as dataset1
.
📑 timeseries_analyzer
🔎 Corresponds to data_analyzer.ts_analyzer
Configuration for the time series analyzer.
-
auto_detection
: Can be set toTrue
orFalse
. IfTrue
, it attempts to automatically infer the date/timestamp format in the input dataset. -
id_col
: Name of the ID column in the input dataset. -
tz_offset
: The timezone offset of the timestamps in the input dataset. Can be set to eitherlocal
,gmt
, orutc
. The default setting islocal
. -
inspection
: Can be set toTrue
orFalse
. IfTrue
, the time series elements undergo an inspection. -
analysis_level
: Can be set todaily
,weekly
, orhourly
. The default setting isdaily
. If set todaily
, the daily view is populated. If set tohourly
, the view is shown at a day part level. If set toweekly
, the display it per individual weekdays (1-7) as captured. -
max_days
: Maximum number of days up to which the data will be aggregated. If the dataset contains a timestamp/date field with very high number of unique dates (e.g., 20 years worth of daily data), this option can be used to reduce the timespan that is analyzed.
🤓 Example:
timeseries_analyzer:
auto_detection: True
id_col: 'id_column'
tz_offset: 'local'
inspection: True
analysis_level: 'daily'
max_days: 3600
📑 anovos_basic_report
🔎 Corresponds to data_report.basic_report_generation
The basic report consists of a summary of the outputs of the stats_generator, quality_checker, and association evaluator See the 📖 documentation for data reports for more details.
The basic report can be customized using the following options:
basic_report
If True
, a basic report is generated after completion of the
data_analyzer modules.
If False
, no report is generated.
Nevertheless, all the computed statistics and metrics will be available in the final report.
report_args
-
id_col
: The name of the ID column in the input dataset. -
label_col
: The name of the label or target column in the input dataset. -
event_lable
: The value of the event (label1
/true
) in the label column. -
output_path
: Path where the basic report is saved. It can be a local path, an 📖 S3 path (when running on AWS), a path to a file resource on Google Colab (see 📖 this tutorial for an overview), or a path on the 📖 Databricks File System (when running on Azure).
🤓 Example:
📑 stats_generator
🔎 Corresponds to data_analyzer.stats_generator
This module generates descriptive statistics of the ingested data. Descriptive statistics are split into different metric types. Each function corresponds to one metric type.
metric
List of metrics to calculate for the input dataset. Available options are:
- 📖
global_summary
- 📖
measures_of_count
- 📖
measures_of_centralTendency
- 📖
measures_of_cardinality
- 📖
measures_of_dispersion
- 📖
measures_of_percentiles
- 📖
measures_of_shape
🤓 Example:
metric: ['global_summary', 'measures_of_counts', 'measures_of_cardinality', 'measures_of_dispersion']
metric_args
-
list_of_cols
: List of column names (list of strings or string of column names separated by|
) to compute the metrics for. Alternatively, if set to"all"
, all columns are included. -
drop_cols
: List of column names (list of strings or string of column names separated by|
) to exclude from metrics computation. This option is especially useful iflist_of_cols
is set to"all"
, as it allows computing metrics for all except a few columns without having to specify a potentially very long list of column names to include.
🤓 Example:
📑 quality_checker
🔎 Corresponds to data_analyzer.quality_checker
This module assesses the data quality along different dimensions. Quality metrics are computed at both the row and column level. Further, the module includes appropriate treatment options to fix several common quality issues.
duplicate_detection
🔎 Corresponds to quality_checker.duplicate_detection
-
list_of_cols
: List of column names (list of strings or string of column names separated by|
) to consider when searching for duplicates. Alternatively, if set to"all"
, all columns are included. -
drop_cols
: List of column names (list of strings or string of column names separated by|
) to be excluded from duplicate detection. -
treatment
: IfFalse
, duplicates are detected and reported. IfTrue
, duplicate rows are removed from the input dataset.
🤓 Example:
nullRows_detection
🔎 Corresponds to quality_checker.nullRows_detection
-
list_of_cols
: List of column names (list of strings or string of column names separated by|
) to consider duringnull
rows detection. Alternatively, if set to"all"
, all columns are included. -
drop_cols
: List of column names (list of strings or string of column names separated by|
) to exclude fromnull
rows detection. -
treatment
: IfFalse
,null
rows are detected and reported. IfTrue
, rows where more thantreatment_threshold
columns arenull
are removed from the input dataset. -
treatment_threshold
: It takes a value between0
and1
(default0.8
) that specifies which fraction of columns has to benull
for a row to be considered anull
row. If the threshold is0
, rows with any missing value will be flagged asnull
. If the threshold is1
, only rows where all values are missing will be flagged asnull
.
🤓 Example:
invalidEntries_detection
🔎 Corresponds to quality_checker.invalidEntries_detection
-
list_of_cols
: List of column names (list of strings or string of column names separated by|
) to be considered during invalid entries' detection. Alternatively, if set to"all"
, all columns are included. -
drop_cols
: List of column names (list of strings or string of column names separated by|
) to exclude from invalid entries' detection. -
treatment
: IfFalse
, invalid entries are detected and reported. IfTrue
, invalid entries are replaced withnull
. -
output_mode
: Can be either"replace"
or"append"
. If set to"replace"
, the original columns will be replaced with the treated columns. If set to"append"
, the original columns will be kept and the treated columns will be appended to the dataset. The appended columns will be named as the original column with a suffix"_cleaned"
(e.g., the column"cost_of_living_cleaned"
corresponds to the original column"cost_of_living"
).
🤓 Example:
invalidEntries_detection:
list_of_cols: all
drop_cols: ['id_column']
treatment: True
output_mode: replace
IDness_detection
🔎 Corresponds to quality_checker.IDness_detection
-
list_of_cols
: List of column names (list of strings or string of column names separated by|
) to be considered for IDness detection. Alternatively, if set to"all"
, all columns are included. -
drop_cols
: List of column names (list of strings or string of column names separated by|
) to exclude from IDness detection. -
treatment
: IfFalse
, columns with high IDness are detected and reported. IfTrue
, columns with an IDness abovetreatment_threshold
are removed. -
treatment_threshold
: A value between0
and1
(default1.0
).
🤓 Example:
IDness_detection:
list_of_cols: all
drop_cols: ['id_column']
treatment: True
treatment_threshold: 0.9
biasedness_detection
🔎 Corresponds to quality_checker.biasedness_detection
-
list_of_cols
: List of column names (list of strings or string of column names separated by|
) to be considered for biasedness detection. Alternatively, if set to"all"
, all columns are included. -
drop_cols
: List of column names (list of strings or string of column names separated by|
) to exclude from biasedness detection. -
treatment
: IfFalse
, columns with high IDness are detected and reported. IfTrue
, columns with a bias abovetreatment_threshold
are removed. -
treatment_threshold
: A value between0
and1
(default1.0
).
🤓 Example:
biasedness_detection:
list_of_cols: all
drop_cols: ['label_col']
treatment: True
treatment_threshold: 0.98
outlier_detection
🔎 Corresponds to quality_checker.outlier_detection
list_of_cols
: List of column names (list of strings or string of column names separated by|
) to be considered for outlier detection. Alternatively, if set to"all"
, all columns are included.
⚠ Note that any column that contains just a single value or only null values is not subjected to outlier detection even if it is selected under this argument.
-
drop_cols
: List of column names (list of strings or string of column names separated by|
) to exclude from outlier detection. -
detection_side
: Whether outliers should be detected on the"upper"
, the"lower"
, or"both"
sides. -
detection_configs
: A map that defines the input parameters for different outlier detection methods. Possible keys are: pctile_lower
(default0.05
)pctile_upper
(default0.95
)stdev_lower
(default3.0
)stdev_upper
(default3.0
)IQR_lower
(default1.5
)IQR_upper
(default1.5
)-
min_validation
(default2
) For details, see 📖 theoutlier_detection
API documentation -
treatment
: IfFalse
, outliers are detected and reported. IfTrue
, outliers are treated with the specifiedtreatment_method
. -
treatment_method
: Specifies how outliers are treated. Possible options are"null_replacement"
,"row_removal"
,"value_replacement"
. -
pre_existing_model
: IfTrue
, the file specified undermodel_path
with lower/upper bounds is loaded. If no such file exists, set toFalse
(the default). -
model_path
: The path to the file with lower/upper bounds. It can be a local path, an 📖 S3 path (when running on AWS), a path to a file resource on Google Colab (see 📖 this tutorial for an overview), or a path on the 📖 Databricks File System (when running on Azure). Ifpre_existing_model
isTrue
, the pre-saved will be loaded from this location. Ifpre_existing_model
isFalse
, a file with lower/upper bounds will be saved at this location. By default, it is set toNA
, indicating that there is neither a pre-saved file nor should such a file be generated. -
output_mode
: Can be either"replace"
or"append"
. If set to"replace"
, the original columns will be replaced with the treated columns. If set to"append"
, the original columns will be kept and the treated columns will be appended to the dataset. The appended columns will be named as the original column with a suffix"_outliered"
(e.g., the column"cost_of_living_outliered"
corresponds to the original column"cost_of_living"
).
🤓 Example:
outlier_detection:
list_of_cols: all
drop_cols: ['id_column', 'label_col']
detection_side: upper
detection_configs:
pctile_lower: 0.05
pctile_upper: 0.90
stdev_lower: 3.0
stdev_upper: 3.0
IQR_lower: 1.5
IQR_upper: 1.5
min_validation: 2
treatment: True
treatment_method: value_replacement
pre_existing_model: False
model_path: NA
output_mode: replace
nullColumns_detection
🔎 Corresponds to quality_checker.nullColumns_detection
-
list_of_cols
: List of column names (list of strings or string of column names separated by|
) to be considered fornull
columns detection. Alternatively, if set to"all"
, all columns are included. If set to"missing"
(the default) only columns with missing values are included. One of the use cases where"all"
may be preferable over"missing"
is when the user wants to save the imputation model for future use. This can be useful, for example, if a column may not have missing values in the training dataset but missing values are acceptable in the test dataset. -
drop_cols
: List of column names (list of strings or string of column names separated by|
) to be excluded fromnull
columns detection. -
treatment
: IfFalse
,null
columns are detected and reported. IfTrue
, missing values are treated with the specifiedtreatment_method
. -
treatment_method
: Specifies hownull
columns are treated. Possible values are"MMM"
, "row_removal"
, or"column_removal"
. -
treatment_configs
: Additional parameters for thetreatment_method
. Iftreatment_method
is"column_removal"
, the keytreatment_threshold
can be used to define the fraction of missing values above which a column is flagged as anull
column and remove. Iftreatment_method
is"MMM"
, possible keys are the parameters of theimputation_MMM
function.
🤓 Example:
nullColumns_detection:
list_of_cols: all
drop_cols: ['id_column', 'label_col']
treatment: True
treatment_method: MMM
treatment_configs:
method_type: median
pre_existing_model: False
model_path: NA
output_mode: replace
📑 association_evaluator
🔎 Corresponds to data_analyzer.association_evaluator
This block configures the association evaluator that focuses on understanding the interaction between different attributes or the relationship between an attribute and a binary target variable.
correlation_matrix
🔎 Corresponds to association_evaluator.correlation_matrix
-
list_of_cols
: List of column names (list of strings or string of column names separated by|
) to include in the correlation matrix. Alternatively, when set toall
, all columns are included. -
drop_cols
: List of column names (list of strings or string of column names separated by|
) to be excluded from the correlation matrix. This is especially useful when almost all columns should be included in the correlation matrix: Setlist_of_cols
toall
and drop the few excluded columns.
🤓 Example:
IV_calculation
🔎 Corresponds to association_evaluator.IV_calculation
-
list_of_cols
: List of column names (list of strings or string of column names separated by|
) to include in the IV calculation. -
drop_cols
: List of column names (list of strings or string of column names separated by|
) to exclude from IV calculation. -
label_col
: Name of label or target column in the input dataset. -
event_label
: Value of event (label1
/true
) in the label column. -
encoding_configs
: Detailed configuration of the binning step. -
bin_method
: The binning method. Defaults toequal_frequency
. bin_size
: The bin size. Defaults to10
.monotonicity_check
: If set to1
, dynamically computes thebin_size
such that monotonicity is ensured. Can be a computationally expensive calculation. Defaults to0
.
🤓 Example:
IV_calculation:
list_of_cols: all
drop_cols: id_column
label_col: label_col
event_label: 'class1'
encoding_configs:
bin_method: equal_frequency
bin_size: 10
monotonicity_check: 0
IG_calculation
🔎 Corresponds to association_evaluator.IG_calculation
-
list_of_cols
: List of column names (list of strings or string of column names separated by|
) to consider for IG calculation. -
drop_cols
: List of column names (list of strings or string of column names separated by|
) to exclude from IG calculation. -
label_col
: Name of label or target column in the input dataset. -
event_label
: Value of event (label1
/true
) in the label column. -
encoding_configs
: Detailed configuration of the binning step. -
bin_method
: The binning method. Defaults toequal_frequency
. bin_size
: The bin size. Defaults to10
.monotonicity_check
: If set to1
, dynamically computes thebin_size
such that monotonicity is ensured. Can be a computationally expensive calculation. Defaults to0
.
🤓 Example:
IG_calculation:
list_of_cols: all
drop_cols: id_column
label_col: label_col
event_label: 'class1'
encoding_configs:
bin_method: equal_frequency
bin_size: 10
monotonicity_check: 0
variable_clustering
🔎 Corresponds to association_evaluator.variable_clustering
-
list_of_cols
: List of column names (list of strings or string of column names separated by|
) to include for variable clustering -
drop_cols
: List of column names (list of strings or string of column names separated by|
) to exclude from variable clustering.
🤓 Example:
📑 drift_detector
🔎 Corresponds to drift_stability.drift_detector
This block configures the drift detector module that provides a range of methods to detect drift within and between datasets.
drift_statistics
🔎 Corresponds to drift_stability.drift_detector.statistics
configs
-
list_of_cols
: List of columns to check drift (list or string of col names separated by|
) to include in the drift statistics. Can be set toall
to include all non-array columns (except those given indrop_cols
). -
drop_cols
: List of columns to be dropped (list or string of col names separated by|
) to exclude from the drift statistics. -
method_type
: Method(s) to apply to detect drift (list or string of methods separated by|
). Possible values arePSI
,JSD
,HD
, andKS
. If set toall
, all available metrics are calculated. -
threshold
: Threshold above which attributes are flagged as exhibiting drift. -
bin_method
: The binning method. Possible values areequal_frequency
andequal_range
. -
bin_size
: The bin size. We recommend setting it to10
to20
forPSI
and above100
for all other metrics. -
pre_existing_source
: Set totrue
if a pre-computed binning model as well as frequency counts and attributes are available.false
otherwise. -
source_path
: Ifpre_existing_source
istrue
, this described from where the pre-computed data is loaded. -
drift_statistics_folder
. drift_statistics folder must contain the output fromattribute_binning
andfrequency_counts
. Ifpre_existing_source
is False, this can be used for saving the details. Default folder "NA" is used for saving the intermediate output.
🤓 Example:
configs:
list_of_cols: all
drop_cols: ['id_column', 'label_col']
method_type: all
threshold: 0.1
bin_method: equal_range
bin_size: 10
pre_existing_source: False
source_path: NA
source_dataset
The reference/baseline dataset.
read_dataset
-
file_path
: The file (or directory) path to read the source dataset from. It can be a local path, an 📖 S3 path (when running on AWS), a path to a file resource on Google Colab (see 📖 this tutorial for an overview), or a path on the 📖 Databricks File System (when running on Azure). -
file_type
: The file format of the source data. Currently, Anovos supports CSV (csv
), Parquet (parquet
), and Avro (avro
). (Please note that if you're using Avro data sources, you need to add the external packageorg.apache.spark:spark-avro
when submitting the Spark job.) -
file_configs
(optional): Options to pass to the respective Spark file reader, e.g., delimiters, schemas, headers.
delete_column
List of column names (list of strings or string of column names separated by |
)
to be deleted from the loaded input data.
select_column
List of column names (list of strings or string of column names separated by |
)
to be selected for further processing.
rename_column
-
list_of_cols
: List of the names of columns (list of strings or string of column names separated by|
) to be renamed. -
list_of_newcols
: The new column names. The first element inlist_of_cols
will be renamed to the first name inlist_of_newcols
and so on.
recast_column
-
list_of_cols
: List of the names of columns (list of strings or string of column names separated by|
) to be cast to a different type. -
list_of_dtypes
: The new datatypes. The first element inlist_of_cols
will be recast to the first type inlist_of_dtypes
and so on. See the 📖 Spark documentation for a list of valid datatypes. Note that this field is case-insensitive.
stability_index
🔎 Corresponds to drift_detector.stability_index_computation
configs
-
metric_weightages
: A dictionary where the keys are the metric names (mean
,stdev
,kurtosis
) and the values are the weight of the metric (between0
and1
). All weights must sum to1
. -
existing_metric_path
: Location of previously computed metrics of historical datasets (idx
,attribute
,mean
,stdev
,kurtosis
whereidx
is index number of the historical datasets in chronological order). -
appended_metric_path
: The path where the input dataframe metrics are saved after they have been appended to the historical metrics. -
threshold
: The threshold above which attributes are flagged as unstable.
🤓 Example:
configs:
metric_weightages:
mean: 0.5
stddev: 0.3
kurtosis: 0.2
existing_metric_path: ''
appended_metric_path: 'si_metrics'
threshold: 2
dataset1
read_dataset
Corresponds to data_ingest.read_dataset
-
file_path
: The file (or directory) path to read the other joining input dataset from. It can be a local path, an 📖 S3 path (when running on AWS), a path to a file resource on Google Colab (see 📖 this tutorial for an overview), or a path on the 📖 Databricks File System (when running on Azure). -
file_type
: The file format of the other joining input data. Currently, Anovos supports CSV (csv
), Parquet (parquet
), and Avro (avro
). (Please note that if you're using Avro data sources, you need to add the external packageorg.apache.spark:spark-avro
when submitting the Spark job.) -
file_configs
(optional): Options to pass to the respective Spark file reader, e.g., delimiters, schemas, headers.
dataset2
, dataset3
, …
Additional datasets are configured in the same manner as dataset1
.
📑 report_preprocessing
🔎 Corresponds to data_report.report_preprocessing
This configuration block describes the data pre–processing necessary for report generation.
master_path
The path where all outputs are saved.
🤓 Example:
charts_to_objects
🔎 Corresponds to report_preprocessing.charts_to_objects
This is the core function of the report preprocessing stage. It saves the chart data in the form of objects that are used by the subsequent report generation scripts.
See the intermediate report documentation for more details.
-
list_of_cols
: List of column names (list of strings or string of column names separated by|
) to include in preprocessing. -
drop_cols
: List of column names (list of strings or string of column names separated by|
) to exclude from preprocessing. -
label_col
: Name of the label or target column in the input dataset. -
event_label
: Value of the event (label1
/true
) in the label column. -
bin_method
: The binning method. Possible values areequal_frequency
andequal_range
. -
bin_size
: The bin size. We recommend setting it to10
to20
forPSI
and above100
for all other metrics. -
drift_detector
: Indicates whether data drift has already analyzed. Defaults toFalse
. -
outlier_charts
: Indicates whether outlier charts should be included. Defaults toFalse
. -
source_path
: The source data path for drift analysis. If it has not been computed or is not required, set it to the default value"NA"
.
🤓 Example:
charts_to_objects:
list_of_cols: all
drop_cols: id_column
label_col: label_col
event_label: 'class1'
bin_method: equal_frequency
bin_size: 10
drift_detector: True
outlier_charts: False
source_path: "NA"
📑 report_generation
🔎 Corresponds to data_report.report_generation
This configuration block controls the generation of the actual report, i.e., the data that is included and the layout. See the report generation documentation for more details.
-
master_path
: The path to the preprocessed data generated during thereport_preprocessing
step. -
id_col
: The ID column present in the input dataset -
label_col
: Name of label or target column in the input dataset. -
corr_threshold
: The threshold above which attributes are considered to be correlated and thus, redundant. Its value is between0
and1
. -
iv_threshold
: The threshold above which attributes are considered ot be significant. Its value is between0
and1
.
Information Value | Variable's Predictiveness |
---|---|
<0.02 | Not useful for prediction |
0.02 to 0.1 | Weak predictive power |
0.1 to 0.3 | Medium predictive power |
0.3 to 0.5 | Strong predictive power |
>0.5 | Suspicious predictive power |
-
drift_threshold_model
: The threshold above which an attribute is flagged as exhibiting drift. Its value is between0
and1
. -
dataDict_path
: The path to the data dictionary containing the exact names and definitions of the attributes. This information is used in the report to aid comprehensibility. -
metricDict_path
: Path to the metric dictionary. -
final_report_path
: The path where final report will be saved. It can be a local path, an 📖 S3 path (when running on AWS), a path to a file resource on Google Colab (see 📖 this tutorial for an overview), or a path on the 📖 Databricks File System (when running on Azure).
🤓 Example:
report_generation:
master_path: 'report_stats'
id_col: 'id_column'
label_col: 'label_col'
corr_threshold: 0.4
iv_threshold: 0.02
drift_threshold_model: 0.1
dataDict_path: 'data/income_dataset/data_dictionary.csv'
metricDict_path: 'data/metric_dictionary.csv'
final_report_path: 'report_stats'
📑 transformers
🔎 Corresponds to data_transformer.transformers
This block configures the data_transformer
module that supports
numerous pre-processing and transformation functions, such as binning, encoding, scaling, and imputation.
numerical_mathops
This group of functions is used to perform mathematical transformations of numerical attributes.
feature_transformation
🔎 Corresponds to transformers.feature_transformation
-
list_of_cols
: The numerical columns (list of strings or string of column names separated by|
) to transform. Can be set to"all"
to include all numerical columns. -
drop_cols
: The numerical columns (list of strings or string of column names separated by|
) to exclude from feature transformation. -
method_type
: The method to apply to use for transformation. The default method issqrt
(\sqrt{x}). Possible values are: ln
log10
log2
exp
powOf2
(2^x)powOf10
(10^x)powOfN
(N^x)Zsqrt
(\sqrt{x})cbrt
(\sqrt[3]{x})sq
(x^2)cb
(x^3)toPowerN
(x^N)sin
cos
tan
asin
acos
atan
radians
remainderDivByN
(x % N)factorial
(x!)mul_inv
(1/x)floor
ceil
-
roundN
(round toN
decimal places) -
N
:None
by default. Ifmethod_type
ispowOfN
,toPowerN
,remainderDivByN
, orroundN
,N
will be used as the required constant.
🤓 Example 1:
🤓 Example 2:
feature_transformation:
list_of_cols: ['capital-gain', 'capital-loss']
drop_cols: []
method_type: sqrt
feature_transformation:
list_of_cols: ['age','education_num']
drop_cols: []
method_type: sq
boxcox_transformation
🔎 Corresponds to transformers.boxcox_transformation
-
list_of_cols
: The columns (list of strings or string of column names separated by|
) to transform. Can be set to"all"
to include all columns. -
drop_cols
: The columns (list of strings or string of column names separated by|
) to exclude from Box-Cox transformation. -
boxcox_lambda
: The \lambda value for the Box-Cox transformation. It can be given as a - list where each element represents the value of \lambda for a single attribute. The length of the list must be the same as the number of columns to transform.
- number that is used for all attributes.
If no value is given (the default), a search for the best \lambda will be conducted among the following values:
[1, -1, 0.5, -0.5, 2, -2, 0.25, -0.25, 3, -3, 4, -4, 5, -5]
. The search is conducted independently for each column.
🤓 Example 1:
🤓 Example 2:
boxcox_transformation:
list_of_cols: num_feature3|num_feature4
drop_cols: []
boxcox_lambda: [-2, -1]
numerical_binning
This group of functions is used to transform numerical attributes into discrete (integer or categorical) attribute.
attribute_binning
🔎 Corresponds to transformers.attribute_binning
-
list_of_cols
: The numerical columns (list of strings or string of column names separated by|
) to transform. Can be set to"all"
to include all numerical columns. -
drop_cols
: The columns (list of strings or string of column names separated by|
) to exclude from attribute binning. -
method_type
: The binning method. Possible values areequal_frequency
andequal_range
. Withequal_range
, each bin is of equal size/width and withequal_frequency
, each bin contains an equal number of rows. Defaults toequal_range
. -
bin_size
: The number of bins. Defaults to10
. -
bin_dtype
: Thedtype
of the transformed column. Possible values arenumerical
andcategorical
. Withnumerical
, the values reflect the bin number (1
,2
, …). Withcategorical
option, the values are a string that describes the minimal and maximal value of the bin. Defaults tonumerical
.
🤓 Example:
attribute_binning:
list_of_cols: num_feature1|num_feature2
drop_cols: []
method_type: equal_frequency
bin_size: 10
bin_dtype: numerical
monotonic_binning
🔎 Corresponds to transformers.monotonic_binning
-
list_of_cols
: The numerical columns (list of strings or string of column names separated by|
) to transform. Can be set to"all"
to include all numerical columns. -
drop_cols
: The columns (list of strings or string of column names separated by|
) to exclude from monotonic binning. -
method_type
: The binning method. Possible values areequal_frequency
andequal_range
. Withequal_range
, each bin is of equal size/width and withequal_frequency
, each bin contains an equal number of rows. Defaults toequal_range
. -
bin_size
: The number of bins. Defaults to10
. -
bin_dtype
: Thedtype
of the transformed column. Possible values arenumerical
andcategorical
. Withnumerical
, the values reflect the bin number (1
,2
, …). Withcategorical
option, the values are a string that describes the minimal and maximal value of the bin. Defaults tonumerical
.
🤓 Example:
attribute_binning:
list_of_cols: num_feature1|num_feature2
drop_cols: []
label_col: ["label_col"]
event_label: ["class1"]
method_type: equal_frequency
bin_size: 10
bin_dtype: numerical
numerical_expression
expression_parser
🔎 Corresponds to transformers.expression_parser
This function can be used to evaluate a list of SQL expressions and output the result as new features. Columns used in the SQL expression must be available in the dataset.
-
list_of_expr
: List of expressions to evaluate as new features e.g., ["expr1", "expr2"]. Alternatively, expressions can be specified in a string format, where different expressions are separated by pipe delimiter “|” e.g., "expr1|expr2". -
postfix
: postfix for new feature name.Naming convention "f" + expression_index + postfix e.g. with postfix of "new", new added features are named as f0new, f1new etc. (Default value = "").
🤓 Example 1:
🤓 Example 2:
Both Example 1 and Example 2 generate 2 new features: log(age) + 1.5 and sin(capital-gain)+cos(capital-loss). The newly generated features will be appended to the dataframe as new columns: f0 and f1.
categorical_outliers
This function assigns less frequently seen values in a categorical column to a new category others
.
outlier_categories
🔎 Corresponds to transformers.outlier_categories
-
list_of_cols
: The categorical columns (list of strings or string of column names separated by|
) to transform. Can be set to"all"
to include all categorical columns. -
drop_cols
: The columns (list of strings or string of column names separated by|
) to exclude from outlier transformation. -
coverage
: The minimum fraction of rows that remain in their original category, given as a value between0
and1
. For example, with a coverage of0.8
, the categories that 80% of the rows belong to remain and the more seldom occurring categories are mapped toothers
. The default value is1.0
, which means that no rows are changed toothers
. -
max_category
: Even if coverage is less, only (max_category - 1) categories will be mapped to actual name and rest to others. Caveat is when multiple categories have same rank, then #categories can be more than max_category. Defaults to50
.
🤓 Example 1:
outlier_categories:
list_of_cols: all
drop_cols: ['id_column', 'label_col']
coverage: 0.9
max_category: 20
🤓 Example 2:
outlier_categories:
list_of_cols: ["cat_feature1", "cat_feature2"]
drop_cols: []
coverage: 0.8
max_category: 10
outlier_categories:
list_of_cols: ["cat_feature3", "cat_feature4"]
drop_cols: []
coverage: 0.9
max_category: 15
categorical_encoding
This group of transformers functions used to converting a categorical attribute into numerical attribute(s).
cat_to_num_unsupervised
🔎 Corresponds to transformers.cat_to_num_unsupervised
-
list_of_cols
: The categorical columns (list of strings or string of column names separated by|
) to encode. Can be set to"all"
to include all categorical columns. -
drop_cols
: The columns (list of strings or string of column names separated by|
) to exclude from categorical encoding. -
method_type
: The encoding method. Set to1
for label encoding and to0
for one-hot encoding. With label encoding, each categorical value is assigned a unique integer based on the ordering specified throughindex_order
. With one-hot encoding, each categorical value will be represented by a binary column. Defaults to1
(label encoding). -
index_order
: The order assigned to the categorical values whenmethod_type
is set to1
(label encoding). Possible values are: frequencyDesc
(default): Order by descending frequency.frequencyAsc
: Order by ascending frequency.alphabetDesc
: Order alphabetically (descending).-
alphabetAsc
: Order alphabetically (ascending). -
cardinality_threshold
: Columns with a cardinality above this threshold are excluded from enconding. Defaults to100
.
🤓 Example 1:
cat_to_num_unsupervised:
list_of_cols: all
drop_cols: ['id_column']
method_type: 0
cardinality_threshold: 10
🤓 Example 2:
cat_to_num_unsupervised:
list_of_cols: ["cat_feature1", "cat_feature2"]
drop_cols: []
method_type: 0
cardinality_threshold: 10
cat_to_num_unsupervised:
list_of_cols: ["cat_feature3", "cat_feature4"]
drop_cols: []
method_type: 1
cat_to_num_supervised
🔎 Corresponds to transformers.cat_to_num_supervised
-
list_of_cols
: The categorical columns (list of strings or string of column names separated by|
) to encode. Can be set to"all"
to include all categorical columns. -
drop_cols
: The columns (list of strings or string of column names separated by|
) to exclude from categorical encoding. -
label_col
: The label/target column. Defaults tolabel
. -
event_label
: Value of the (positive) event (i.e, label1
/true
). Defaults to1
.
🤓 Example:
cat_to_num_supervised:
list_of_cols: cat_feature1 | cat_feature2
drop_cols: ['id_column']
label_col: income
event_label: '>50K'
numerical_rescaling
Group of functions to rescale numerical attributes.
normalization
🔎 Corresponds to transformers.normalization
-
list_of_cols
: The numerical columns (list of strings or string of column names separated by|
) to normalize. Can be set to"all"
to include all numerical columns. -
drop_cols
: The columns (list of strings or string of column names separated by|
) to exclude from normalization.
🤓 Example:
z_standardization
🔎 Corresponds to transformers.z_standardization
-
list_of_cols
: The numerical columns (list of strings or string of column names separated by|
) to standardize. Can be set to"all"
to include all numerical columns. -
drop_cols
: The columns (list of strings or string of column names separated by|
) to exclude from standardization.
🤓 Example:
IQR_standardization
🔎 Corresponds to transformers.IQR_standardization
-
list_of_cols
: The numerical columns (list of strings or string of column names separated by|
) to standardize. Can be set to"all"
to include all numerical columns. -
drop_cols
: The columns (list of strings or string of column names separated by|
) to exclude from standardization.
🤓 Example:
numerical_latentFeatures
Group of functions to generate latent features to reduce the dimensionality of the input dataset.
PCA_latentFeatures
🔎 Corresponds to transformers.PCA_latentFeatures
-
list_of_cols
: The numerical columns (list of strings or string of column names separated by|
) to standardize. Can be set to"all"
to include all numerical columns. -
drop_cols
: The columns (list of strings or string of column names separated by|
) to exclude from latent features computation. -
explained_variance_cutoff
: The required explained variance cutoff. Determines the number of encoded columns in the output. IfN
is the smallest integer such that the topN
encoded columns explain more than the given variance threshold, theseN
columns be selected. Defaults to0.95
. -
standardization
: IfTrue
(the default), standardization is applied.False
otherwise. -
standardization_configs
: The arguments for thez_standardization
function in dictionary format. Defaults to{"pre_existing_model": False}
-
imputation
: IfTrue
, imputation is applied.False
(the default) otherwise. -
imputation_configs
: Configuration for imputation in dictionary format. The name of the imputation function is specified with the keyimputation_name
(defaults toimputation_MMM
). Arguments for the imputation function can be passed using additional keys.
🤓 Example 1:
PCA_latentFeatures:
list_of_cols: ["num_feature1", "num_feature2", "num_feature3"]
explained_variance_cutoff: 0.95
standardization: False
imputation: True
🤓 Example 2:
PCA_latentFeatures:
list_of_cols: ["num_feature1", "num_feature2", "num_feature3"]
explained_variance_cutoff: 0.8
standardization: False
imputation: True
PCA_latentFeatures:
list_of_cols: ["num_feature4", "num_feature5", "num_feature6"]
explained_variance_cutoff: 0.6
standardization: True
imputation: True
autoencoder_latentFeatures
🔎 Corresponds to transformers.autoencoder_latentFeatures
-
list_of_cols
: The numerical columns (list of strings or string of column names separated by|
) to standardize. Can be set to"all"
to include all numerical columns. -
drop_cols
: The columns (list of strings or string of column names separated by|
) to exclude from latent features computation. -
reduction_params
: Determines the number of resulting encoded features. Ifreduction_params
is below1
,reduction_params * "number of columns"
columns will be generated. Else,reduction_params
columns will be generated. Defaults to0.5
, i.e., the number of columns in the result is half the of number of columns in the input. -
sample_size
: Maximum number of rows used for training the autoencoder model. Defaults to500000
(5e5
). -
epochs
: The number of epochs to train the autoencoder model. Defaults to100
. -
batch_size
: The batch size for autoencoder model training. Defaults to256
. -
standardization
: IfTrue
(the default), standardization is applied.False
otherwise. -
standardization_configs
: The arguments for thez_standardization
function in dictionary format. Defaults to{"pre_existing_model": False}
-
imputation
: IfTrue
, imputation is applied.False
(the default) otherwise. -
imputation_configs
: Configuration for imputation in dictionary format. The name of the imputation function is specified with the keyimputation_name
(defaults toimputation_MMM
). Arguments for the imputation function can be passed using additional keys.
🤓 Example 1:
autoencoder_latentFeatures:
list_of_cols: ["num_feature1", "num_feature2", "num_feature3"]
reduction_params: 0.5
sample_size: 10000
epochs: 20
batch_size: 256
🤓 Example 2:
autoencoder_latentFeatures:
list_of_cols: ["num_feature1", "num_feature2"]
reduction_params: 0.5
sample_size: 10000
epochs: 20
batch_size: 256
autoencoder_latentFeatures:
list_of_cols: ["num_feature3", "num_feature4", "num_feature5", "num_feature6", "num_feature7"]
reduction_params: 0.8
sample_size: 10000
epochs: 100
batch_size: 256
📑 write_intermediate
-
file_path
: Path where intermediate datasets (after selecting, dropping, renaming, and recasting of columns) for quality checker operations, join dataset and concatenate dataset will be saved. -
file_type
: (CSV, Parquet or Avro). file format of intermediate dataset -
file_configs
(optional): Rest of the valid configuration can be passed through this options e.g., repartition, mode, compression, header, delimiter, inferSchema etc. This might look like:
🤓 Example:
For more information on available configuration options, see the following external documentation:
📑 write_main
-
file_path
: Path where final cleaned input dataset will be saved. -
file_type
: (CSV, Parquet or Avro). file format of final dataset -
file_configs
(optional): Rest of the valid configuration can be passed through this options e.g., repartition, mode, compression, header, delimiter, inferSchema etc. This might look like:
🤓 Example:
For more information on available configuration options, see the following external documentation:
📑 write_stats
-
file_path
: Path where all tables/stats of anovos modules (data drift & data analyzer) will be saved. -
file_type
: (CSV, Parquet or Avro). file format of final dataset -
file_configs
(optional): Rest of the valid configuration can be passed through this options e.g., repartition, mode, compression, header, delimiter, inferSchema etc. This might look like:
For more information on available configuration options, see the following external documentation:
📑 write_feast_features
🔎 Corresponds to feature_store/feast_exporter.generate_feature_description
📖 For details, see the Feature Store Integration documentation
-
file_path
: The path to the feast repository where the generated feature definitions will be stored. -
entity
: The yml block to configure the definition of a feast entity.name
: The name of the feast entity.description
: A human readable description of the entity.id_col
: Defines the identifying column of the Anovos dataframe which will be used as an id in feast.
-
file_source
: The yml block to configure the definition of a feast file source.description
: A human readable description of the file source.owner
: The email of the owner of this file source.timestamp_col
: The name of the logical timestamp at which the feature was observed.create_timestamp_col
: The name of the physical timestamp (wallclock time) of when the feature value was computed.
-
feature_view
: The yml block to configure the definition of a feast feature view.name
: The name of the feature view.owner
: The email of the owner of this feature view.ttl_in_seconds
: The time to live in seconds for features in this view. Feast will use this value to look backwards when performing point in time joins.
-
service_name
(optional): The name of the feature service generated by the workflow.
🤓 Example:
write_feast_features:
file_path: "../anovos_repo/"
entity:
name: "income"
description: "this entity is a ...."
id_col: 'ifa'
file_source:
description: 'data source description'
owner: "me@business.com"
timestamp_col: 'event_time'
create_timestamp_col: 'create_time_col'
feature_view:
name: 'income_view'
owner: 'view@owner.com'
ttl_in_seconds: 36000000
service_name: 'income_feature_service'