Skip to content

Setting up Anovos locally

There are several ways to setting up and running Anovos locally:

  • 🐋 Running Anovos workloads through Docker You can run Anovos workloads defined in a configuration file using the anovos-worker Docker image, without having to set up Spark or anything else. This is the recommended option for developing and testing Anovos workloads without access to a Spark cluster. If you're just starting with Anovos and don't have any special requirements, pick this option.
  • 🐍 Using Anovos as a Python library There are two ways to install Anovos to use it as a Python library in your own code:
    • Installation through pip. If you need more fine-grained control than the configuration file offers, or you have a way to execute Spark jobs, this is likely the best option for you.
    • Cloning the GitHub repository. This is recommended if you would like to get access to the latest development version. It is also the way to go if you would like to build custom wheel files.

🐋 Running Anovos workloads through Docker

💿 Software Prerequisites

Running Anovos workloads through Docker requires Python 3.x, a Bash-compatible shell, and Docker. (See the Docker documentation for instructions how to install Docker on your machine.)

At the moment, you need to download two scripts from the Anovos GitHub repository. You can either download the scripts individually:

mkdir local && cd local
wget https://raw.githubusercontent.com/anovos/anovos/main/local/rewrite_configuration.py
wget https://raw.githubusercontent.com/anovos/anovos/main/local/run_workload.sh
chmod +x run_workload.sh

Or you can clone the entire Anovos GitHub repository, which will also give you access to example configurations:

git clone https://github.com/anovos/anovos.git
cd anovos/local
chmod +x run_workload.sh

In both cases, you will have a folder named local that contains the run_workload.sh shell script and the rewrite_configuration.py Python script.

Launching an Anovos run

To run an Anovos workload defined in a configuration file, you need to execute run_workload.sh and pass the name of the configuration file as the first parameter.

All paths to input data in the configuration file have to be given relative to the directory you are calling run_workload.sh from. For example, the configuration for the basic demo workload (available here) defines

input_dataset:
  read_dataset:
    file_path: "data/income_dataset/csv"

Hence, we need to ensure that the data directory is a subdirectory of the directory we are launching the workload from. Otherwise, the Anovos process inside the anovos-worker Docker container won't be able to access it.

The following command will run the basic demo workflow included in the Anovos repository on Spark 3.2.2:

# enter the root folder of the repository
cd anovos
# place the input data that is processed by the basic demo at the location
# expected by the configuration
mkdir data & cp ./examples/data/income_dataset ./data
# launch the anovos-worker container
./local/run_workload.sh config/configs_basic.yaml 3.2.2

Once processing has finished, you will find the output in a folder output within the directory you called run_workload.sh from.

💡 Note that the anovos-worker images provided through Docker Hub do not support the Feast or MLflow integrations out of the box, as they require interaction with third-party components, access to network communication, and/or interaction with the file system beyond the pre-configured paths. You can find the list of currently unsupported configuration blocks at the top of rewrite_configuration.py. If you try to run an _Anovos workload that uses unsupported features, _you will receive an error message and no anovos-worker container will be launched. If you would like, you can build and configure a custom pre-processing and launch script by adapting the files in anovos/local to your specific needs. For example, for Feast you will likely want to configure an additional volume or bind mount in run_workload.sh, whereas MLflow requires some network configuration.

Specifying the Spark and Anovos versions

You can optionally define the Anovos version by adding it as a third parameter:

./local/run_workload.sh config.yaml 3.2.2 1.0.1

This will use Anovos 1.0.1 on Spark 3.2.2. If no version for Anovos is given, the latest release available for the specified Spark version will be used.

Please note that the corresponding anovos-worker image has to be available on Docker Hub for this to work out of the box.

If you need a specific configuration not available as a pre-built image, you can follow the instructions here to build your own anovos-worker image. In that case, you can then launch run_workload.sh without specifying the Spark or Anovos version:

./local/run_workload.sh config.yaml

🐍 Using Anovos as a Python library

💿 Software Prerequisites

Anovos requires Spark, Python, and Java to be set up. We test for and officially support the following combinations:

The following tutorials can be helpful in setting up Apache Spark:

💡 For the foreseeable future, _Anovos will support Spark 3.1.x, 3.3.x, and 3.3.x. _We will phase out 2.4.x over the course of the next releases. To see which precise combinations we're currently testing, see this workflow configuration.

Installation through pip

To install Anovos, simply run

pip install anovos

You can select a specific version of Anovos by specifying the version:

pip install anovos==1.0.1

For more information on specifying package versions, see the pip documentation.

Then, you can import Anovos as a module into your Python applications using

import anovos

To trigger Spark workloads from Python, you have to ensure that the necessary external packages are included in the SparkSession.

For this, you can either use the pre-configured SparkSession provided by Anovos:

from anovos.shared.spark import spark

If you need to use your own custom SparkSession, make sure to include the following dependencies:

Cloning the GitHub repository

Clone the Anovos repository to your local environment using the command:

git clone https://github.com/anovos/anovos.git

For production use, you'll always want to clone a specific version, e.g.,

git clone -b v1.0.1 --depth 1 https://github.com/anovos/anovos

to just get the code for version 1.0.1.

Afterwards, go to the newly created anovos directory and execute the following command to clean and build the latest modules:

make clean build

Next, install Anovos' dependencies by running

pip install -r requirements.txt

and go to the dist/ folder. There, you should

  • Update the input and output paths in configs.yaml and configure the data set. You might also want to adapt the threshold settings to your needs.

  • Adapt the main.py sample script. It demonstrates how different functions from Anovos can be stitched together to create a workflow.

  • If necessary, update spark-submit.sh. This is the shell script used to run the Spark application via spark-submit.

Once everything is configured, you can start your workflow run using the aforementioned script:

nohup ./spark-submit.sh > run.txt &

While the job is running, you can check the logs written to stdout using

tail -f run.txt

Once the run completes, the script will attempt to automatically open the final report (report_stats/ml_anovos_report.html) in your web browser.