Skip to content

Setting up Anovos on AWS EMR

For large workloads, you can set up Anovos on AWS EMR.

Installing/ Downloading Anovos

Clone the Anovos repository on you local environment using command:

git clone https://github.com/anovos/anovos.git

After cloning, go to the anovos directory and execute the following command to clean and build the latest modules in dist folder:

make clean build

Copy all required files into an S3 bucket

Copy the following files to AWS S3:

  • dist/anovos.zip
    • This file contains all Anovos modules
    • Zipped version is mandatory for running importing the modules as –py-files
  • dist/income_dataset (optional)
    • This folder contains our demo dataset
  • dist/main.py
    • This is sample script to show how different functions from Anovos module can be stitched together to create a workflow.
    • The users can create their own workflow script by importing the necessary functions.
    • This script takes input from a yaml configuration file
  • dist/configs.yaml
    • This is the sample yaml configuration file that sets the argument for all functions.
    • Update configs.yaml for all input & output s3 paths. All other changes depend on the dataset used.
    • bin/req_packages_anovos.sh
    • This shell script is used to install all required packages to run Anovos on EMR

AWS copy command: aws s3 cp --recursive <local file path> <s3 path> --profile <profile name>

Create a cluster

  • Software Configuration

    • Emr-5.33.0
    • Hadoop-2.10.1
    • Spark-2.4.7
    • Hive-2.3.7
  • Spark Submit Details

    • Deploy mode client
    • Spark-submit options --num-executors 1000
      --executor-cores 4 --executor-memory 20g
      --driver-memory 20G
      --driver-cores 4
      --conf spark.driver.maxResultSize=15g
      --conf spark.yarn.am.memoryOverhead=1000m
      --conf spark.executor.memoryOverhead=2000m
      --conf spark.kryo.referenceTracking=false
      --conf spark.network.timeout=18000s
      --conf spark.executor.heartbeatInterval=12000s
      --conf spark.dynamicAllocation.executorIdleTimeout=12000s
      --conf spark.rpc.message.maxSize=1024
      --conf spark.yarn.maxAppAttempts=1
      --conf spark.speculation=false
      --conf spark.kryoserializer.buffer.max=1024
      --conf spark.executor.extraJavaOptions=-XX:+UseG1GC
      --conf spark.driver.extraJavaOptions=-XX:+UseG1GC
      --packages org.apache.spark:spark-avro_2.11:2.4.0
      --jars {s3_bucket}/jars/histogrammar-sparksql_2.11-1.0.20.jar,{s3_bucket}/jars/histogrammar_2.11-1.0.20.jar
      --py-files {s3_bucket}/{test_folder}/com.zip

      Application location*: s3 path of main.py file

      Arguments:
      , - Bootstrap Actions script location : (bootsrap_shell_scrip_path/{file_name.sh})