Setting up Anovos on AWS EMR
For large workloads, you can set up Anovos on AWS EMR.
Installing/ Downloading Anovos
Clone the Anovos repository on you local environment using command:
git clone https://github.com/anovos/anovos.git
After cloning, go to the anovos directory and execute the following command to clean and build the latest modules in dist folder:
make clean build
Copy all required files into an S3 bucket
Copy the following files to AWS S3:
/anovos/dist/anovos.zip
- This file contains all Anovos modules
- Zipped version is mandatory for running importing the modules as –py-files
/anovos/dist/data/income_dataset
(optional)- This folder contains our demo dataset - income dataset.
/anovos/dist/main.py
- This is sample script to show how different functions from Anovos module can be stitched together to create a workflow.
- The users can create their own workflow script by importing the necessary functions.
- This script takes input from a yaml configuration file
/anovos/dist/configs.yaml
- This is the sample yaml configuration file that sets the argument for all functions.
- Update configs.yaml for all input & output s3 paths - Typically variables that have
_path
in its name such as final_report_path, file_path, appended_metric_path, output_path, etc.. - All other changes depend on the dataset used and modules to be run.
/anovos/bin/aws_bootstrap_files/setup_on_aws_emr_5_30_and_above.sh
- This shell script is used to install all required packages to run Anovos on EMR (bootstrapping).
- Also, Kindly copy the requirements.txt in the repo to an accessible s3 bucket.
- Edit the fist line in setup_on_aws_emr_5_30_and_above.sh before and then proceed to using it as a bootstrap file for AWS EMR.
/anovos/data/metric_dictionary.csv
- This contains a static dictionary file for the different metrics generated by Anovos in the report.
- This helps generate the wiki tab about the different modules/submodules metrics in the final Anovos report.
/anovos/jars/histogrammar*.jar
\- These jars are to make use external dependent libraries for Anovos.
- Specify and use the correct version 2.12 or 2.11 based on the Scala version in your environment.
AWS CLI Installation Instructions can be found here
AWS copy command:
aws s3 cp --recursive <local file path> <s3 path> --profile <profile name>
Create a cluster
-
Software Configuration
- Emr-5.33.0
- Hadoop-2.10.1
- Spark-2.4.7
- Hive-2.3.7
- TensorFlow 2.4.1
-
Spark Submit Details
- Deploy mode :
client
-
Spark-submit options : --num-executors 1000
--executor-cores 4 --executor-memory 20g
--driver-memory 20G
--driver-cores 4
--conf spark.driver.maxResultSize=15g
--conf spark.yarn.am.memoryOverhead=1000m
--conf spark.executor.memoryOverhead=2000m
--conf spark.kryo.referenceTracking=false
--conf spark.network.timeout=18000s
--conf spark.executor.heartbeatInterval=12000s
--conf spark.dynamicAllocation.executorIdleTimeout=12000s
--conf spark.rpc.message.maxSize=1024
--conf spark.yarn.maxAppAttempts=1
--conf spark.speculation=false
--conf spark.kryoserializer.buffer.max=1024
--conf spark.executor.extraJavaOptions=-XX:+UseG1GC
--conf spark.driver.extraJavaOptions=-XX:+UseG1GC
--packages org.apache.spark:spark-avro_2.11:2.4.0
--jars /anovos/jars/histogrammar-sparksql_2.11-1.0.20.jar,/anovos/jars/histogrammar_2.11-1.0.20.jar
--py-files {s3_bucket}/anovos.zip-
When launching cluster need to make driver/worker node memory size is higher than total of configured one. For example, for the above config, there should be atleast than 20 + 2 = 22 GB per worker or driver. So user must select machine types with RAM around 26GB or more. Example machines - m3.2xlarge - 8 core, 30 GB.
-
The above histogram jars version and avro package version should follow scala version.
-
-
Application location - s3 path of main.py file
-
Arguments for main.py file - s3://
/configs.yaml emr -
Final spark submit command example :
spark-submit --deploy-mode client --num-executors 1000 --executor-cores 4 --executor-memory 20g --driver-memory 20G --driver-cores 4 --conf spark.driver.maxResultSize=15g --conf spark.yarn.am.memoryOverhead=1000m --conf spark.executor.memoryOverhead=2000m --conf spark.kryo.referenceTracking=false --conf spark.network.timeout=18000s --conf spark.executor.heartbeatInterval=12000s --conf spark.dynamicAllocation.executorIdleTimeout=12000s --conf spark.rpc.message.maxSize=1024 --conf spark.yarn.maxAppAttempts=1 --conf spark.speculation=false --conf spark.kryoserializer.buffer.max=1024 --conf spark.executor.extraJavaOptions=-XX:+UseG1GC --conf spark.driver.extraJavaOptions=-XX:+UseG1GC --packages org.apache.spark:spark-avro_2.11:2.4.0 --jars s3://<s3-bucket>/jars/histogrammar-sparksql_2.11-1.0.20.jar,s3://<s3-bucket>/jars/histogrammar_2.11-1.0.20.jar --py-files s3://<s3-bucket>/anovos.zip s3://<s3-bucket>/main.py s3://<s3-bucket>/configs.yaml emr
- Bootstrap Actions
- script location : specify the bootstrap_shell_script_s3_path/setup_on_aws_emr_5_30_and_above.sh
- Deploy mode :