Using Anovos at Scale

Anovos is built for feature engineering and data processing at scale. The library was built for and tested on Mobilewalla's mobile engagement data with the following attributes:

Property	Value
Size	50 GB
No. of Rows	384,694,946
No. of Columns	35
No. of Numerical Columns	4
No. of Categorical Columns	31

⏱ Benchmark

To benchmark Anovos' performance, we ran a pipeline on this dataset.

The entire pipeline was optimized such that the computed statistics could be reused by other functions as much as possible. For example, the modes (the most frequently seen values) computed by the measures_of_centralTendency function were also used for imputation while treating null values in a column with nullColumns_detection or detecting a columns' biasedness using biasedness_detection.

Hence, the time recorded for a function in the benchmark might (Pipeline Mode) differ significantly from the time taken by the same function when running in isolation (Standalone Mode).

Further, Apache Spark does its own set of optimizations of transformations under the hood while running multiple functions together, which further adds to the time difference.

Function	Pipeline (mins)	Standalone (minutes)
`global_summary`	1	1
`measures_of_counts`	5	5
`measures_of_centralTendency`	30	30
`measures_of_cardinality`	49	32
`measures_of_percentiles`	1	1
`measures_of_dispersion`	1	1
`measures_of_shape`	3	3
`duplicate_detection`	5	5
`nullRows_detection`	4	5
`invalidEntries_detection`	15	41
`IDness_detection`	2	8
`biasedness_detection`	2	28
`outlier_detection`	4	8
`nullColumns_detection`	2	63
`variable_clustering`	2	3
`IV_calculation`*	8	9
`IG_calculation`*	6	8
* A binary categorical column was selected as a target variable to test this function.

To see if the library works with large number of attributes, we horizontally scale tested on different dataset with the following attributes:

Property	Value
Size	15 GB
No. of Rows	40,507,005
No. of Columns	284
No. of Numerical Columns	252
No. of Categorical Columns	23

Function	Time (mins)
`global_summary`	0.2
`measures_of_counts`	3
`measures_of_centralTendency`	9
`measures_of_cardinality`	12
`measures_of_percentiles`	7
`measures_of_dispersion`	9
`measures_of_shape`	5
`duplicate_detection`	2
`nullRows_detection`	4
`invalidEntries_detection`	9
`IDness_detection`	2
`biasedness_detection`	2
`outlier_detection`	85
`nullColumns_detection`	3
`cat_to_num_unsupervised`	4
`cat_to_num_supervised`	2
`z_standardization`	6
`IQR_standardization`	3
`normalization`	6
`PCA_latentFeatures`	20

Limitations

For current performance limitations, see the dedicated overview of Anovos' limitations.