Setting up Anovos on AWS EMR

For large workloads, you can set up Anovos on AWS EMR.

Installing/ Downloading Anovos

Clone the Anovos repository on you local environment using command:

git clone https://github.com/anovos/anovos.git

After cloning, go to the anovos directory and execute the following command to clean and build the latest modules in dist folder:

make clean build

Copy all required files into an S3 bucket

Copy the following files to AWS S3:

/anovos/dist/anovos.zip
- This file contains all Anovos modules
- Zipped version is mandatory for running importing the modules as –py-files
/anovos/dist/data/income_dataset (optional)
- This folder contains our demo dataset - income dataset.
/anovos/dist/main.py
- This is sample script to show how different functions from Anovos module can be stitched together to create a workflow.
- The users can create their own workflow script by importing the necessary functions.
- This script takes input from a yaml configuration file
/anovos/dist/configs.yaml
- This is the sample yaml configuration file that sets the argument for all functions.
- Update configs.yaml for all input & output s3 paths - Typically variables that have _path in its name such as final_report_path, file_path, appended_metric_path, output_path, etc..
- All other changes depend on the dataset used and modules to be run.
/anovos/bin/aws_bootstrap_files/setup_on_aws_emr_5_30_and_above.sh
- This shell script is used to install all required packages to run Anovos on EMR (bootstrapping).
- Also, Kindly copy the requirements.txt in the repo to an accessible s3 bucket.
- Edit the fist line in setup_on_aws_emr_5_30_and_above.sh before and then proceed to using it as a bootstrap file for AWS EMR.
/anovos/data/metric_dictionary.csv
- This contains a static dictionary file for the different metrics generated by Anovos in the report.
- This helps generate the wiki tab about the different modules/submodules metrics in the final Anovos report.
/anovos/jars/histogrammar*.jar\
- These jars are to make use external dependent libraries for Anovos.
- Specify and use the correct version 2.12 or 2.11 based on the Scala version in your environment.

AWS CLI Installation Instructions can be found here

AWS copy command: aws s3 cp --recursive <local file path> <s3 path> --profile <profile name>

Create a cluster

Software Configuration
- Emr-5.33.0
- Hadoop-2.10.1
- Spark-2.4.7
- Hive-2.3.7
- TensorFlow 2.4.1
Spark Submit Details
- Deploy mode : client
- Spark-submit options : --num-executors 1000
  --executor-cores 4 --executor-memory 20g
  --driver-memory 20G
  --driver-cores 4
  --conf spark.driver.maxResultSize=15g
  --conf spark.yarn.am.memoryOverhead=1000m
  --conf spark.executor.memoryOverhead=2000m
  --conf spark.kryo.referenceTracking=false
  --conf spark.network.timeout=18000s
  --conf spark.executor.heartbeatInterval=12000s
  --conf spark.dynamicAllocation.executorIdleTimeout=12000s
  --conf spark.rpc.message.maxSize=1024
  --conf spark.yarn.maxAppAttempts=1
  --conf spark.speculation=false
  --conf spark.kryoserializer.buffer.max=1024
  --conf spark.executor.extraJavaOptions=-XX:+UseG1GC
  --conf spark.driver.extraJavaOptions=-XX:+UseG1GC
  --packages org.apache.spark:spark-avro_2.11:2.4.0
  --jars /anovos/jars/histogrammar-sparksql_2.11-1.0.20.jar,/anovos/jars/histogrammar_2.11-1.0.20.jar
  --py-files {s3_bucket}/anovos.zip
  - When launching cluster need to make driver/worker node memory size is higher than total of configured one. For example, for the above config, there should be atleast than 20 + 2 = 22 GB per worker or driver. So user must select machine types with RAM around 26GB or more. Example machines - m3.2xlarge - 8 core, 30 GB.
  - The above histogram jars version and avro package version should follow scala version.
- Application location - s3 path of main.py file
- Arguments for main.py file - s3:///configs.yaml emr
- Final spark submit command example :
```
spark-submit --deploy-mode client --num-executors 1000 --executor-cores 4 --executor-memory 20g --driver-memory 20G --driver-cores 4 --conf spark.driver.maxResultSize=15g --conf spark.yarn.am.memoryOverhead=1000m --conf spark.executor.memoryOverhead=2000m --conf spark.kryo.referenceTracking=false --conf spark.network.timeout=18000s --conf spark.executor.heartbeatInterval=12000s --conf spark.dynamicAllocation.executorIdleTimeout=12000s --conf spark.rpc.message.maxSize=1024 --conf spark.yarn.maxAppAttempts=1 --conf spark.speculation=false --conf spark.kryoserializer.buffer.max=1024 --conf spark.executor.extraJavaOptions=-XX:+UseG1GC --conf spark.driver.extraJavaOptions=-XX:+UseG1GC --packages org.apache.spark:spark-avro_2.11:2.4.0 --jars s3://<s3-bucket>/jars/histogrammar-sparksql_2.11-1.0.20.jar,s3://<s3-bucket>/jars/histogrammar_2.11-1.0.20.jar --py-files s3://<s3-bucket>/anovos.zip s3://<s3-bucket>/main.py s3://<s3-bucket>/configs.yaml emr
```
- Bootstrap Actions
  - script location : specify the bootstrap_shell_script_s3_path/setup_on_aws_emr_5_30_and_above.sh