Setting up Anovos locally
There are several ways to setting up and running Anovos locally:
- 🐋 Running Anovos workloads through Docker
You can run Anovos workloads defined in a configuration file
using the
anovos-worker
Docker image, without having to set up Spark or anything else. This is the recommended option for developing and testing Anovos workloads without access to a Spark cluster. If you're just starting with Anovos and don't have any special requirements, pick this option. - 🐍 Using Anovos as a Python library
There are two ways to install Anovos to use it as a Python library in your own code:
- Installation through
pip
. If you need more fine-grained control than the configuration file offers, or you have a way to execute Spark jobs, this is likely the best option for you. - Cloning the GitHub repository. This is recommended if you would like to get access to the latest development version. It is also the way to go if you would like to build custom wheel files.
- Installation through
🐋 Running Anovos workloads through Docker
💿 Software Prerequisites
Running Anovos workloads through Docker requires Python 3.x, a Bash-compatible shell, and Docker. (See the Docker documentation for instructions how to install Docker on your machine.)
At the moment, you need to download two scripts from the Anovos GitHub repository. You can either download the scripts individually:
mkdir local && cd local
wget https://raw.githubusercontent.com/anovos/anovos/main/local/rewrite_configuration.py
wget https://raw.githubusercontent.com/anovos/anovos/main/local/run_workload.sh
chmod +x run_workload.sh
Or you can clone the entire Anovos GitHub repository, which will also give you access to example configurations:
In both cases, you will have a folder named local
that contains the run_workload.sh
shell script
and the rewrite_configuration.py
Python script.
Launching an Anovos run
To run an Anovos workload defined in a configuration file,
you need to execute run_workload.sh
and pass the name of the configuration file as the first parameter.
All paths to input data in the configuration file have to be given relative to the directory you are calling
run_workload.sh
from.
For example, the configuration for the basic demo workload
(available here)
defines
Hence, we need to ensure that the data
directory is a subdirectory of the directory we are launching the workload
from.
Otherwise, the Anovos process inside the anovos-worker
Docker container won't be able to access it.
The following command will run the basic demo workflow included in the Anovos repository on Spark 3.2.2:
# enter the root folder of the repository
cd anovos
# place the input data that is processed by the basic demo at the location
# expected by the configuration
mkdir data & cp ./examples/data/income_dataset ./data
# launch the anovos-worker container
./local/run_workload.sh config/configs_basic.yaml 3.2.2
Once processing has finished, you will find the output in a folder output
within the directory you called
run_workload.sh
from.
💡 Note that the anovos-worker
images provided through Docker Hub do not support
the Feast or MLflow integrations out of the box,
as they require interaction with third-party components, access to network communication,
and/or interaction with the file system beyond the pre-configured paths.
You can find the list of currently unsupported configuration blocks at the top of
rewrite_configuration.py
.
If you try to run an _Anovos workload that uses unsupported features,
_you will receive an error message and no anovos-worker
container will be launched.
If you would like, you can build and configure a custom pre-processing and launch script by adapting
the files in anovos/local to your specific needs.
For example, for Feast you will likely want to configure an additional volume or bind mount in
run_workload.sh
, whereas MLflow requires some network configuration.
Specifying the Spark and Anovos versions
You can optionally define the Anovos version by adding it as a third parameter:
This will use Anovos 1.0.1 on Spark 3.2.2. If no version for Anovos is given, the latest release available for the specified Spark version will be used.
Please note that the corresponding anovos-worker
image has to be available on
Docker Hub for this to work out of the box.
If you need a specific configuration not available as a pre-built image,
you can follow the instructions here to build
your own anovos-worker
image.
In that case, you can then launch run_workload.sh
without specifying the Spark or Anovos version:
🐍 Using Anovos as a Python library
💿 Software Prerequisites
Anovos requires Spark, Python, and Java to be set up. We test for and officially support the following combinations:
-
Spark 2.4.8, Python 3.7, and Java 8
-
Spark 3.1.3, Python 3.9, and Java 11
-
Spark 3.2.2, Python 3.10, and Java 11
-
Spark 3.3.0, Python 3.10, and Java 11
The following tutorials can be helpful in setting up Apache Spark:
💡 For the foreseeable future, _Anovos will support Spark 3.1.x, 3.3.x, and 3.3.x. _We will phase out 2.4.x over the course of the next releases. To see which precise combinations we're currently testing, see this workflow configuration.
Installation through pip
To install Anovos, simply run
You can select a specific version of Anovos by specifying the version:
For more information on specifying package versions, see
the pip
documentation.
Then, you can import Anovos as a module into your Python applications using
To trigger Spark workloads from Python, you have to ensure that the necessary external packages
are included in the
SparkSession
.
For this, you can either use the pre-configured SparkSession
provided by Anovos:
If you need to use your own custom SparkSession
, make sure to include the following dependencies:
- io.github.histogrammar:histogrammar_2.11:1.0.20
- io.github.histogrammar:histogrammar-sparksql_2.11:1.0.20
- org.apache.spark:spark-avro_2.11:2.4.0
Cloning the GitHub repository
Clone the Anovos repository to your local environment using the command:
For production use, you'll always want to clone a specific version, e.g.,
to just get the code for version 1.0.1
.
Afterwards, go to the newly created anovos
directory and execute the following command to clean and build
the latest modules:
Next, install Anovos' dependencies by running
and go to the dist/
folder. There, you should
-
Update the input and output paths in
configs.yaml
and configure the data set. You might also want to adapt the threshold settings to your needs. -
Adapt the
main.py
sample script. It demonstrates how different functions from Anovos can be stitched together to create a workflow. -
If necessary, update
spark-submit.sh
. This is the shell script used to run the Spark application viaspark-submit
.
Once everything is configured, you can start your workflow run using the aforementioned script:
While the job is running, you can check the logs written to stdout
using
Once the run completes, the script will attempt to automatically open the final report
(report_stats/ml_anovos_report.html
) in your web browser.