Skip to content

Current Limitations of Anovos

The current 1.0 release of Anovos still has some limitations, which we will address in the upcoming releases. To learn more about what's on the horizon, check out our roadmap.

🔣 Data

  • Anovos currently supports numerical, categorical, geospatial, and datetime/timestamp columns (at the cross-sectional and transactional level). We plan to add support for additional data types such as (struct) arrays in the future.

  • Anovos currently relies on Apache Spark's automatic schema detection. In case some numerical columns were deliberately saved as string, they will show up as categorical columns when loaded into a DataFrame (except for CSV files).

🏎 Performance

  • Computing the mode and/or distinct value counts are the most expensive operations in Anovos. We aim to further optimize them in the upcoming releases.

  • Correlation matrix only supports numerical data. Support for categorical data has been removed due to performance concerns and will return in a later release.

  • The invalid entries detection may yield false positives. Hence, be cautious when using the inbuilt treatment option.

  • The categorical encoding functions cat_to_num_supervised and cat_to_num_unsupervised may exhibit poor performance and scaling behavior with very high-cardinality columns. Therefore, it is recommended to reduce cardinality before subjecting them to encoding or specifying an appropriate threshold to drop them from the analysis while encoding.

  • The sample size for constructing the imputation models in imputation_sklearn or creating latent features through autoencoder_latentFeatures should be selected with caution, taking into the consideration the dataset size and the number of columns. This sample dataset is converted into a Pandas DataFrame and subsequent operations are run on a single node (driver). If the sample dataset is too large to fit into the driver's memory, this will result in a memory overflow error.

🔩 Other