Skip to content

Setting up Anovos on AWS EMR

For large workloads, you can set up Anovos on AWS EMR.

Installing/ Downloading Anovos

Clone the Anovos repository on you local environment using command:

git clone https://github.com/anovos/anovos.git

After cloning, go to the anovos directory and execute the following command to clean and build the latest modules in dist folder:

make clean build

Copy all required files into an S3 bucket

Copy the following files to AWS S3:

  • /anovos/dist/anovos.zip
    • This file contains all Anovos modules
    • Zipped version is mandatory for running importing the modules as –py-files
  • /anovos/dist/data/income_dataset (optional)
    • This folder contains our demo dataset - income dataset.
  • /anovos/dist/main.py
    • This is sample script to show how different functions from Anovos module can be stitched together to create a workflow.
    • The users can create their own workflow script by importing the necessary functions.
    • This script takes input from a yaml configuration file
  • /anovos/dist/configs.yaml
    • This is the sample yaml configuration file that sets the argument for all functions.
    • Update configs.yaml for all input & output s3 paths - Typically variables that have _path in its name such as final_report_path, file_path, appended_metric_path, output_path, etc..
    • All other changes depend on the dataset used and modules to be run.
  • /anovos/bin/aws_bootstrap_files/setup_on_aws_emr_5_30_and_above.sh
    • This shell script is used to install all required packages to run Anovos on EMR (bootstrapping).
    • Also, Kindly copy the requirements.txt in the repo to an accessible s3 bucket.
    • Edit the fist line in setup_on_aws_emr_5_30_and_above.sh before and then proceed to using it as a bootstrap file for AWS EMR.
  • /anovos/data/metric_dictionary.csv
    • This contains a static dictionary file for the different metrics generated by Anovos in the report.
    • This helps generate the wiki tab about the different modules/submodules metrics in the final Anovos report.
  • /anovos/jars/histogrammar*.jar\
    • These jars are to make use external dependent libraries for Anovos.
    • Specify and use the correct version 2.12 or 2.11 based on the Scala version in your environment.

AWS CLI Installation Instructions can be found here

AWS copy command: aws s3 cp --recursive <local file path> <s3 path> --profile <profile name>

Create a cluster

  • Software Configuration

    • Emr-5.33.0
    • Hadoop-2.10.1
    • Spark-2.4.7
    • Hive-2.3.7
    • TensorFlow 2.4.1
  • Spark Submit Details

    • Deploy mode : client
    • Spark-submit options : --num-executors 1000
      --executor-cores 4 --executor-memory 20g
      --driver-memory 20G
      --driver-cores 4
      --conf spark.driver.maxResultSize=15g
      --conf spark.yarn.am.memoryOverhead=1000m
      --conf spark.executor.memoryOverhead=2000m
      --conf spark.kryo.referenceTracking=false
      --conf spark.network.timeout=18000s
      --conf spark.executor.heartbeatInterval=12000s
      --conf spark.dynamicAllocation.executorIdleTimeout=12000s
      --conf spark.rpc.message.maxSize=1024
      --conf spark.yarn.maxAppAttempts=1
      --conf spark.speculation=false
      --conf spark.kryoserializer.buffer.max=1024
      --conf spark.executor.extraJavaOptions=-XX:+UseG1GC
      --conf spark.driver.extraJavaOptions=-XX:+UseG1GC
      --packages org.apache.spark:spark-avro_2.11:2.4.0
      --jars /anovos/jars/histogrammar-sparksql_2.11-1.0.20.jar,/anovos/jars/histogrammar_2.11-1.0.20.jar
      --py-files {s3_bucket}/anovos.zip

      • When launching cluster need to make driver/worker node memory size is higher than total of configured one. For example, for the above config, there should be atleast than 20 + 2 = 22 GB per worker or driver. So user must select machine types with RAM around 26GB or more. Example machines - m3.2xlarge - 8 core, 30 GB.

      • The above histogram jars version and avro package version should follow scala version.

    • Application location - s3 path of main.py file

    • Arguments for main.py file - s3:///configs.yaml emr

    • Final spark submit command example :

      spark-submit --deploy-mode client --num-executors 1000 --executor-cores 4 --executor-memory 20g --driver-memory 20G --driver-cores 4 --conf spark.driver.maxResultSize=15g --conf spark.yarn.am.memoryOverhead=1000m --conf spark.executor.memoryOverhead=2000m --conf spark.kryo.referenceTracking=false --conf spark.network.timeout=18000s --conf spark.executor.heartbeatInterval=12000s --conf spark.dynamicAllocation.executorIdleTimeout=12000s --conf spark.rpc.message.maxSize=1024 --conf spark.yarn.maxAppAttempts=1 --conf spark.speculation=false --conf spark.kryoserializer.buffer.max=1024 --conf spark.executor.extraJavaOptions=-XX:+UseG1GC --conf spark.driver.extraJavaOptions=-XX:+UseG1GC --packages org.apache.spark:spark-avro_2.11:2.4.0 --jars s3://<s3-bucket>/jars/histogrammar-sparksql_2.11-1.0.20.jar,s3://<s3-bucket>/jars/histogrammar_2.11-1.0.20.jar --py-files s3://<s3-bucket>/anovos.zip s3://<s3-bucket>/main.py s3://<s3-bucket>/configs.yaml emr
      

    • Bootstrap Actions
      • script location : specify the bootstrap_shell_script_s3_path/setup_on_aws_emr_5_30_and_above.sh