Using Anovos at Scale
Anovos is built for feature engineering and data processing at scale. The library was built for and tested on Mobilewalla's mobile engagement data with the following attributes:
Property | Value |
---|---|
Size | 50 GB |
No. of Rows | 384,694,946 |
No. of Columns | 35 |
No. of Numerical Columns | 4 |
No. of Categorical Columns | 31 |
⏱ Benchmark
To benchmark Anovos' performance, we ran a pipeline on this dataset.
The entire pipeline was optimized such that the computed statistics could be reused by other functions as much as possible.
For example, the modes (the most frequently seen values) computed by the measures_of_centralTendency
function were also used for imputation while treating null values in a column with nullColumns_detection
or detecting a columns' biasedness using biasedness_detection
.
Hence, the time recorded for a function in the benchmark might (Pipeline Mode) differ significantly from the time taken by the same function when running in isolation (Standalone Mode).
Further, Apache Spark does its own set of optimizations of transformations under the hood while running multiple functions together, which further adds to the time difference.
Function | Pipeline (mins) | Standalone (minutes) |
---|---|---|
global_summary |
1 | 1 |
measures_of_counts |
5 | 5 |
measures_of_centralTendency |
30 | 30 |
measures_of_cardinality |
49 | 32 |
measures_of_percentiles |
1 | 1 |
measures_of_dispersion |
1 | 1 |
measures_of_shape |
3 | 3 |
duplicate_detection |
5 | 5 |
nullRows_detection |
4 | 5 |
invalidEntries_detection |
15 | 41 |
IDness_detection |
2 | 8 |
biasedness_detection |
2 | 28 |
outlier_detection |
4 | 8 |
nullColumns_detection |
2 | 63 |
variable_clustering |
2 | 3 |
IV_calculation * |
8 | 9 |
IG_calculation * |
6 | 8 |
* A binary categorical column was selected as a target variable to test this function. |
To see if the library works with large number of attributes, we horizontally scale tested on different dataset with the following attributes:
Property | Value |
---|---|
Size | 15 GB |
No. of Rows | 40,507,005 |
No. of Columns | 284 |
No. of Numerical Columns | 252 |
No. of Categorical Columns | 23 |
Limitations
For current performance limitations, see the dedicated overview of Anovos' limitations.