Setting up Anovos locally
There are several ways to setting up and running Anovos locally:
- 🐋 Running Anovos workloads through Docker
You can run Anovos workloads defined in a configuration file
anovos-workerDocker image, without having to set up Spark or anything else. This is the recommended option for developing and testing Anovos workloads without access to a Spark cluster. If you're just starting with Anovos and don't have any special requirements, pick this option.
- 🐍 Using Anovos as a Python library
There are two ways to install Anovos to use it as a Python library in your own code:
- Installation through
pip. If you need more fine-grained control than the configuration file offers, or you have a way to execute Spark jobs, this is likely the best option for you.
- Cloning the GitHub repository. This is recommended if you would like to get access to the latest development version. It is also the way to go if you would like to build custom wheel files.
- Installation through
🐋 Running Anovos workloads through Docker
💿 Software Prerequisites
Running Anovos workloads through Docker requires Python 3.x, a Bash-compatible shell, and Docker. (See the Docker documentation for instructions how to install Docker on your machine.)
At the moment, you need to download two scripts from the Anovos GitHub repository. You can either download the scripts individually:
Or you can clone the entire Anovos GitHub repository, which will also give you access to example configurations:
In both cases, you will have a folder named
local that contains the
run_workload.sh shell script
rewrite_configuration.py Python script.
Launching an Anovos run
To run an Anovos workload defined in a configuration file,
you need to execute
run_workload.sh and pass the name of the configuration file as the first parameter.
All paths to input data in the configuration file have to be given relative to the directory you are calling
For example, the configuration for the basic demo workload
Hence, we need to ensure that the
data directory is a subdirectory of the directory we are launching the workload
Otherwise, the Anovos process inside the
anovos-worker Docker container won't be able to access it.
The following command will run the basic demo workflow included in the Anovos repository on Spark 3.2.2:
# enter the root folder of the repository cd anovos # place the input data that is processed by the basic demo at the location # expected by the configuration mkdir data & cp ./examples/data/income_dataset ./data # launch the anovos-worker container ./local/run_workload.sh config/configs_basic.yaml 3.2.2
Once processing has finished, you will find the output in a folder
output within the directory you called
💡 Note that the
anovos-worker images provided through Docker Hub do not support
the Feast or MLflow integrations out of the box,
as they require interaction with third-party components, access to network communication,
and/or interaction with the file system beyond the pre-configured paths.
You can find the list of currently unsupported configuration blocks at the top of
If you try to run an _Anovos workload that uses unsupported features,
_you will receive an error message and no
anovos-worker container will be launched.
If you would like, you can build and configure a custom pre-processing and launch script by adapting
the files in anovos/local to your specific needs.
For example, for Feast you will likely want to configure an additional volume or bind mount in
run_workload.sh, whereas MLflow requires some network configuration.
Specifying the Spark and Anovos versions
You can optionally define the Anovos version by adding it as a third parameter:
This will use Anovos 1.0.1 on Spark 3.2.2. If no version for Anovos is given, the latest release available for the specified Spark version will be used.
Please note that the corresponding
anovos-worker image has to be available on
Docker Hub for this to work out of the box.
If you need a specific configuration not available as a pre-built image,
you can follow the instructions here to build
In that case, you can then launch
run_workload.sh without specifying the Spark or Anovos version:
🐍 Using Anovos as a Python library
💿 Software Prerequisites
Anovos requires Spark, Python, and Java to be set up. We test for and officially support the following combinations:
The following tutorials can be helpful in setting up Apache Spark:
💡 For the foreseeable future, _Anovos will support Spark 3.1.x, 3.3.x, and 3.3.x. _We will phase out 2.4.x over the course of the next releases. To see which precise combinations we're currently testing, see this workflow configuration.
To install Anovos, simply run
You can select a specific version of Anovos by specifying the version:
For more information on specifying package versions, see
Then, you can import Anovos as a module into your Python applications using
To trigger Spark workloads from Python, you have to ensure that the necessary external packages
are included in the
For this, you can either use the pre-configured
SparkSession provided by Anovos:
If you need to use your own custom
SparkSession, make sure to include the following dependencies:
Cloning the GitHub repository
Clone the Anovos repository to your local environment using the command:
For production use, you'll always want to clone a specific version, e.g.,
to just get the code for version
Afterwards, go to the newly created
anovos directory and execute the following command to clean and build
the latest modules:
Next, install Anovos' dependencies by running
and go to the
dist/ folder. There, you should
Update the input and output paths in
configs.yamland configure the data set. You might also want to adapt the threshold settings to your needs.
main.pysample script. It demonstrates how different functions from Anovos can be stitched together to create a workflow.
If necessary, update
spark-submit.sh. This is the shell script used to run the Spark application via
Once everything is configured, you can start your workflow run using the aforementioned script:
While the job is running, you can check the logs written to
Once the run completes, the script will attempt to automatically open the final report
report_stats/ml_anovos_report.html) in your web browser.