Coder Social home page Coder Social logo

geo-spatial-data-analysis-using-spark's Introduction

###########################################################################################
                            Instructions to run the script (run_experiments.sh)
###########################################################################################

Steps:
1. Setup the cluster with one Master Node and 3 Worker Nodes (For replicating our results, create a cluster with each instance having 4 GB RAM and 2 cores) AWS - t3.medium instances.
    - Please find below for replicating our configurations.
2. Start HDFS and configure it to distribute data over the 3 worker nodes with a replication factor of 2.
3. Upload following files to HDFS in "input" directory:
    - arealm10000.csv
    - zcta10000.csv
    - point-hotzone.csv
    - zone-hotzone.csv
    - yellow_tripdata_2009-01_point.csv
    - yellow_tripdata_2009-01_point_half.csv # Contains half of yellow_tripdata_2009-01_point.csv file
4. Configure Yarn, Spark according to the instances Memory spcifications. (Please refer to report for our memory specifications)
5. Start Yarn and Spark history server. "start-dfs.sh, start-yarn.sh, spark-history-server.sh". (In our case spark logs history on port 18080 this can be configure in spark-default.conf file)
    - start-dfs.sh
    - start-yarn.sh
    - spark-history-server.sh
6. Run "run_experiments.sh" script to run all the experiments.
7. Once the script execution is completed, check the CPU, Network In, Network Out metrics on AWS CLOUDWATCH for each instance.
8. For spark and yarn history check the following URL's:
    - YARN : master URL/8088
    - SPARK HISTORY LOG : master URL/18080


############################################################################################
                                    Configuration Files
############################################################################################
# -----------------------------------core-site.xml -----------------------------------------

<configuration>
	<property>
		<name>fs.default.name</name>
		<value>hdfs://172.31.5.191:9000</value>
	</property>
</configuration>

# -----------------------------------hdfs-site.xml -----------------------------------------
<configuration>
    <property>
            <name>dfs.namenode.name.dir</name>
            <value>/home/ubuntu/data/nameNode</value>
    </property>

    <property>
            <name>dfs.datanode.data.dir</name>
            <value>/home/ubuntu/data/dataNode</value>
    </property>

    <property>
            <name>dfs.replication</name>
            <value>2</value>
    </property>
</configuration>

# -----------------------------------yarn-site.xml -----------------------------------------
<configuration>
    <property>
            <name>yarn.acl.enable</name>
            <value>0</value>
    </property>

    <property>
            <name>yarn.resourcemanager.hostname</name>
            <value>172.31.5.191</value> # Master private IP
    </property>

    <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
    </property>

    <property>
   	    <name>yarn.nodemanager.resource.memory-mb</name>
            <value>2560</value>
    </property>

    <property>
            <name>yarn.scheduler.maximum-allocation-mb</name>
            <value>2560</value>
    </property>

    <property>
            <name>yarn.scheduler.minimum-allocation-mb</name>
            <value>256</value>
    </property> 

    <property>
            <name>yarn.nodemanager.vmem-check-enabled</name>
            <value>false</value>
    </property>
</configuration>

# --------------------------------------- slaves -------------------------------------------------
# Private IP's of workers
172.31.6.202
172.31.5.240
172.31.1.180

# -----------------------------------spark-defaults.conf -----------------------------------------
spark.master yarn
spark.driver.memory 1024m
spark.executor.memory 1024m

spark.eventLog.enabled            true
spark.eventLog.dir                hdfs://172.31.5.191:9000/spark-logs
spark.history.provider            org.apache.spark.deploy.history.FsHistoryProvider 
spark.history.fs.logDirectory     hdfs://172.31.5.191:9000/spark-logs 
spark.history.fs.update.interval  10s 
spark.history.ui.port             18080

##################################################################################################

geo-spatial-data-analysis-using-spark's People

Contributors

pranavbajoria93 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.