Skip to content

Using Anovos at Scale

Anovos is built for feature engineering and data processing at scale. The library was built for and tested on Mobilewalla's mobile engagement data with the following attributes:

Property Value
Size 50 GB
No. of Rows 384,694,946
No. of Columns 35
No. of Numerical Columns 4
No. of Categorical Columns 31

⏱ Benchmark

To benchmark Anovos' performance, we ran a pipeline on this dataset.

The entire pipeline was optimized such that the computed statistics could be reused by other functions as much as possible. For example, the modes (the most frequently seen values) computed by the measures_of_centralTendency function were also used for imputation while treating null values in a column with nullColumns_detection or detecting a columns' biasedness using biasedness_detection.

Hence, the time recorded for a function in the benchmark might (Pipeline Mode) differ significantly from the time taken by the same function when running in isolation (Standalone Mode).

Further, Apache Spark does its own set of optimizations of transformations under the hood while running multiple functions together, which further adds to the time difference.

Function Pipeline (mins) Standalone (minutes)
global_summary 1 1
measures_of_counts 5 5
measures_of_centralTendency 30 30
measures_of_cardinality 49 32
measures_of_percentiles 1 1
measures_of_dispersion 1 1
measures_of_shape 3 3
duplicate_detection 5 5
nullRows_detection 4 5
invalidEntries_detection 15 41
IDness_detection 2 8
biasedness_detection 2 28
outlier_detection 4 8
nullColumns_detection 2 63
variable_clustering 2 3
IV_calculation* 8 9
IG_calculation* 6 8
* A binary categorical column was selected as a target variable to test this function.

To see if the library works with large number of attributes, we horizontally scale tested on different dataset with the following attributes:

Property Value
Size 15 GB
No. of Rows 40,507,005
No. of Columns 284
No. of Numerical Columns 252
No. of Categorical Columns 23
Function Time (mins)
global_summary 0.2
measures_of_counts 3
measures_of_centralTendency 9
measures_of_cardinality 12
measures_of_percentiles 7
measures_of_dispersion 9
measures_of_shape 5
duplicate_detection 2
nullRows_detection 4
invalidEntries_detection 9
IDness_detection 2
biasedness_detection 2
outlier_detection 85
nullColumns_detection 3
cat_to_num_unsupervised 4
cat_to_num_supervised 2
z_standardization 6
IQR_standardization 3
normalization 6
PCA_latentFeatures 20

Limitations

For current performance limitations, see the dedicated overview of Anovos' limitations.