databricks / spark-perf Goto Github PK

View Code? Open in Web Editor NEW

379.0 49.0 203.0 3.51 MB

Performance tests for Apache Spark

License: Apache License 2.0

Shell 8.89% Python 23.48% Scala 67.63%

spark-perf's Introduction

Spark Performance Tests

This is a performance testing framework for Apache Spark 1.0+.

Features

Suites of performance tests for Spark, PySpark, Spark Streaming, and MLlib.
Parameterized test configurations:
- Sweeps sets of parameters to test against multiple Spark and test configurations.
Automatically downloads and builds Spark:
- Maintains a cache of successful builds to enable rapid testing against multiple Spark versions.
[...]

For questions, bug reports, or feature requests, please open an issue on GitHub.

Coverage

Spark Core RDD
- list coming soon
SQL and DataFrames
- coming soon
Machine Learning
- glm-regression: Generalized Linear Regression Model
- glm-classification: Generalized Linear Classification Model
- naive-bayes: Naive Bayes
- naive-bayes-bernoulli: Bernoulli Naive Bayes
- decision-tree: Decision Tree
- als: Alternating Least Squares
- kmeans: K-Means clustering
- gmm: Gaussian Mixture Model
- svd: Singular Value Decomposition
- pca: Principal Component Analysis
- summary-statistics: Summary Statistics (min, max, ...)
- block-matrix-mult: Matrix Multiplication
- pearson: Pearson's Correlation
- spearman: Spearman's Correlation
- chi-sq-feature/gof/mat: Chi-square Tests
- word2vec: Word2Vec distributed presentation of words
- fp-growth: FP-growth frequent item sets
- python-glm-classification: Generalized Linear Classification Model
- python-glm-regression: Generalized Linear Regression Model
- python-naive-bayes: Naive Bayes
- python-als: Alternating Least Squares
- python-kmeans: K-Means clustering
- python-pearson: Pearson's Correlation
- python-spearman: Spearman's Correlation

Dependencies

The spark-perf scripts require Python 2.7+. If you're using an earlier version of Python, you may need to install the argparse library using easy_install argparse.

Support for automatically building Spark requires Maven. On spark-ec2 clusters, this can be installed using the ./bin/spark-ec2/install-maven script from this project.

Configuration

To configure spark-perf, copy config/config.py.template to config/config.py and edit that file. See config.py.template for detailed configuration instructions. After editing config.py, execute ./bin/run to run performance tests. You can pass the --config option to use a custom configuration file.

The following sections describe some additional settings to change for certain test environments:

Running locally

Set up local SSH server/keys such that ssh localhost works on your machine without a password.

Set config.py options that are friendly for local execution:

SPARK_HOME_DIR = /path/to/your/spark
SPARK_CLUSTER_URL = "spark://%s:7077" % socket.gethostname()
SCALE_FACTOR = .05
SPARK_DRIVER_MEMORY = 512m
spark.executor.memory = 2g

Uncomment at least one SPARK_TESTS entry.

Running on an existing Spark cluster

SSH into the machine hosting the standalone master

Set config.py options:

SPARK_HOME_DIR = /path/to/your/spark/install
SPARK_CLUSTER_URL = "spark://<your-master-hostname>:7077"
SCALE_FACTOR = <depends on your hardware>
SPARK_DRIVER_MEMORY = <depends on your hardware>
spark.executor.memory = <depends on your hardware>

Uncomment at least one SPARK_TESTS entry.

Running on a spark-ec2 cluster with a custom Spark version

Launch an EC2 cluster with Spark's EC2 scripts.

Set config.py options:

USE_CLUSTER_SPARK = False
SPARK_COMMIT_ID = <what you want test>
SCALE_FACTOR = <depends on your hardware>
SPARK_DRIVER_MEMORY = <depends on your hardware>
spark.executor.memory = <depends on your hardware>

Uncomment at least one SPARK_TESTS entry.

License

This project is licensed under the Apache 2.0 License. See LICENSE for full license text.

spark-perf's People

Contributors

Stargazers

Watchers

Forkers

jmsantorum tili andrewor14 bcwalrus amplab tnachen mubarak brkyvz davies dorx joshrosen jkbradley mengxr tdas rtvt123 ishiihara grierdavid jameszhouyi renozhang tsailiming narayana1208 shivaram nightwolfzor xiaoao gdtm86 yanglian99 wangtaothetonic kaixinxiaolei earne zhpengg miguelperalvo huangjun6919 snowwolph jayunit100 petro-rudenko linearregression fangzheng354 sirghost altiscale nchammas strat0sphere apsaltis concretevitamin davande bkpathak arpit1286 scwf msurendra minyk kabitionno mindis javadba huawei-spark dbtsai hhbyyh zsxwing holdenk huihuijane josephwinston lm4limin blue3sky yangspeaking dongjoon-hyun xingzhixi sberajaw pxy0592 xiaoqingwang nraychaudhuri feynmanliang productpro arijitt tijoparacka wangmiao1981 tomzhang ouyangshourui tek-life thunterdb yanjiegao gaohannk eclairjs mlnick codeaudit balasubram sethah alexgking datastark jude90 chenliang613 dybwall1234 qiaojianjack ravi-macha shahpiyush yoelee mabidm carduz akzaidi nishantyp tyoung1221 yetanothertimes chenghaibo

spark-perf's Issues

state of the project?

From the commit history in master it has been almost a year since any activity, and some PRs have been un-merged for a very long time.

Who are the main contributors and can we add some folks with write privs so that some PRs can be addressed?

python_mllib_perf: java.lang.OutOfMemoryError: Java heap space

Java options: -Dspark.storage.memoryFraction=0.66 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer -Dspark.executor.memory=16g -Dspark.locality.wait=60000000 -Dspark.shuffle.manager=SORT
Options: GLMClassificationTest --num-trials=10 --inter-trial-wait=3 --num-partitions=128 --random-seed=5 --num-examples=1000000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=sgd --per-negative=0.3 --scale-factor=1.0 --loss=logistic

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/01 15:06:11 INFO SparkContext: Running Spark version 1.5.1
15/12/01 15:06:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/01 15:06:11 INFO SecurityManager: Changing view acls to: root
15/12/01 15:06:11 INFO SecurityManager: Changing modify acls to: root
15/12/01 15:06:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/12/01 15:06:12 INFO Slf4jLogger: Slf4jLogger started
15/12/01 15:06:12 INFO Remoting: Starting remoting
15/12/01 15:06:12 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:56882]
15/12/01 15:06:12 INFO Utils: Successfully started service 'sparkDriver' on port 56882.
15/12/01 15:06:12 INFO SparkEnv: Registering MapOutputTracker
15/12/01 15:06:12 INFO SparkEnv: Registering BlockManagerMaster
15/12/01 15:06:12 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-528106e5-c257-4af4-9bf6-276bc5a64f00
15/12/01 15:06:12 INFO MemoryStore: MemoryStore started with capacity 10.8 GB
15/12/01 15:06:12 INFO HttpFileServer: HTTP File server directory is /tmp/spark-ddb4b717-dfbe-4167-9ece-02f36c77c8a3/httpd-53fa3c9e-df28-476b-9bdc-b5a2fd07602b
15/12/01 15:06:12 INFO HttpServer: Starting HTTP Server
15/12/01 15:06:12 INFO Utils: Successfully started service 'HTTP file server' on port 56904.
15/12/01 15:06:12 INFO SparkEnv: Registering OutputCommitCoordinator
15/12/01 15:06:13 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/12/01 15:06:13 INFO SparkUI: Started SparkUI at http://10.88.67.113:4040
15/12/01 15:06:13 INFO Utils: Copying /home/test/spark-perf-master/pyspark-tests/mllib_tests.py to /tmp/spark-ddb4b717-dfbe-4167-9ece-02f36c77c8a3/userFiles-228e8f4d-b978-457a-b499-2d1a13e8153c/mllib_tests.py
15/12/01 15:06:13 INFO SparkContext: Added file file:/home/test/spark-perf-master/pyspark-tests/mllib_tests.py at http://10.88.67.113:56904/files/mllib_tests.py with timestamp 1448962573242
15/12/01 15:06:13 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
15/12/01 15:06:13 INFO AppClient$ClientEndpoint: Connecting to master spark://pts00450-vm8:7077...
15/12/01 15:06:13 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20151201150613-0000
15/12/01 15:06:13 INFO AppClient$ClientEndpoint: Executor added: app-20151201150613-0000/0 on worker-20151201150600-10.88.67.113-54856 (10.88.67.113:54856) with 2 cores
15/12/01 15:06:13 INFO SparkDeploySchedulerBackend: Granted executor ID app-20151201150613-0000/0 on hostPort 10.88.67.113:54856 with 2 cores, 9.0 GB RAM
15/12/01 15:06:13 INFO AppClient$ClientEndpoint: Executor updated: app-20151201150613-0000/0 is now RUNNING
15/12/01 15:06:13 INFO AppClient$ClientEndpoint: Executor updated: app-20151201150613-0000/0 is now LOADING
15/12/01 15:06:13 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46515.
15/12/01 15:06:13 INFO NettyBlockTransferService: Server created on 46515
15/12/01 15:06:13 INFO BlockManagerMaster: Trying to register BlockManager
15/12/01 15:06:13 INFO BlockManagerMasterEndpoint: Registering block manager 10.88.67.113:46515 with 10.8 GB RAM, BlockManagerId(driver, 10.88.67.113, 46515)
15/12/01 15:06:13 INFO BlockManagerMaster: Registered BlockManager
15/12/01 15:06:13 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
15/12/01 15:06:14 INFO SparkContext: Starting job: count at /home/test/spark-perf-master/pyspark-tests/mllib_tests.py:73
15/12/01 15:06:14 INFO DAGScheduler: Got job 0 (count at /home/test/spark-perf-master/pyspark-tests/mllib_tests.py:73) with 128 output partitions
15/12/01 15:06:14 INFO DAGScheduler: Final stage: ResultStage 0(count at /home/test/spark-perf-master/pyspark-tests/mllib_tests.py:73)
15/12/01 15:06:14 INFO DAGScheduler: Parents of final stage: List()
15/12/01 15:06:14 INFO DAGScheduler: Missing parents: List()
15/12/01 15:06:14 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[3] at count at /home/test/spark-perf-master/pyspark-tests/mllib_tests.py:73), which has no missing parents
15/12/01 15:06:14 INFO MemoryStore: ensureFreeSpace(86504) called with curMem=0, maxMem=11596411699
15/12/01 15:06:14 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 84.5 KB, free 10.8 GB)
15/12/01 15:06:14 INFO MemoryStore: ensureFreeSpace(85000) called with curMem=86504, maxMem=11596411699
15/12/01 15:06:14 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 83.0 KB, free 10.8 GB)
15/12/01 15:06:14 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.88.67.113:46515 (size: 83.0 KB, free: 10.8 GB)
15/12/01 15:06:14 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:861
15/12/01 15:06:14 INFO DAGScheduler: Submitting 128 missing tasks from ResultStage 0 (PythonRDD[3] at count at /home/test/spark-perf-master/pyspark-tests/mllib_tests.py:73)
15/12/01 15:06:14 INFO TaskSchedulerImpl: Adding task set 0.0 with 128 tasks
15/12/01 15:06:17 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://[email protected]:52848/user/Executor#-940891447]) with ID 0
15/12/01 15:06:17 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:17 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:17 INFO BlockManagerMasterEndpoint: Registering block manager 10.88.67.113:51505 with 8.1 GB RAM, BlockManagerId(0, 10.88.67.113, 51505)
15/12/01 15:06:17 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.88.67.113:51505 (size: 83.0 KB, free: 8.1 GB)
15/12/01 15:06:23 INFO BlockManagerInfo: Added rdd_2_0 in memory on 10.88.67.113:51505 (size: 597.0 MB, free: 7.5 GB)
15/12/01 15:06:23 INFO BlockManagerInfo: Added rdd_2_1 in memory on 10.88.67.113:51505 (size: 597.1 MB, free: 6.9 GB)
15/12/01 15:06:24 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:24 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:24 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 6987 ms on 10.88.67.113 (1/128)
15/12/01 15:06:24 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 6931 ms on 10.88.67.113 (2/128)
15/12/01 15:06:28 INFO BlockManagerInfo: Added rdd_2_2 in memory on 10.88.67.113:51505 (size: 597.0 MB, free: 6.4 GB)
15/12/01 15:06:29 INFO BlockManagerInfo: Added rdd_2_3 in memory on 10.88.67.113:51505 (size: 597.1 MB, free: 5.8 GB)
15/12/01 15:06:30 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:30 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 5991 ms on 10.88.67.113 (3/128)
15/12/01 15:06:30 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:30 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 6197 ms on 10.88.67.113 (4/128)
15/12/01 15:06:36 INFO BlockManagerInfo: Added rdd_2_5 in memory on 10.88.67.113:51505 (size: 597.1 MB, free: 5.2 GB)
15/12/01 15:06:36 INFO BlockManagerInfo: Added rdd_2_4 in memory on 10.88.67.113:51505 (size: 597.0 MB, free: 4.6 GB)
15/12/01 15:06:36 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:36 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:36 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 6511 ms on 10.88.67.113 (5/128)
15/12/01 15:06:36 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 6768 ms on 10.88.67.113 (6/128)
15/12/01 15:06:41 INFO BlockManagerInfo: Added rdd_2_6 in memory on 10.88.67.113:51505 (size: 597.0 MB, free: 4.0 GB)
15/12/01 15:06:42 INFO BlockManagerInfo: Added rdd_2_7 in memory on 10.88.67.113:51505 (size: 597.1 MB, free: 3.4 GB)
15/12/01 15:06:42 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:42 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 5899 ms on 10.88.67.113 (7/128)
15/12/01 15:06:42 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:43 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 6308 ms on 10.88.67.113 (8/128)
15/12/01 15:08:24 INFO TaskSetManager: Starting task 10.0 in stage 0.0 (TID 10, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:08:24 WARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8, 10.88.67.113): java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2479)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:130)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:105)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:165)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:134)
at org.xerial.snappy.SnappyOutputStream.dumpOutput(SnappyOutputStream.java:294)
at org.xerial.snappy.SnappyOutputStream.compressInput(SnappyOutputStream.java:306)
at org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:245)
at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:107)
at org.apache.spark.io.SnappyOutputStreamWrapper.write(CompressionCodec.scala:189)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:158)
at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190)
at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1199)
at org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:132)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:793)
at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:669)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:627)
at java.lang.Thread.run(Thread.java:809)

15/12/01 15:08:24 ERROR TaskSchedulerImpl: Lost executor 0 on 10.88.67.113: remote Rpc client disassociated

Some corner-cases related to the Spark build cache and multiple repositories

@tdas found a tricky corner-case in the Spark build cache: if you switch the Spark repository in the configuration file, then we don't attempt to re-clone the spark-build-cache/master checkout.

Interpreting Spark-perf results

I just started using Spark-perf-master and I am running pyspark tests only. After the run it prints output in the result folder. But I don't clearly understand what those numbers means. For example,

python-scheduling-throughput, SchedulerThroughputTest --num-tasks=5000 --num-trials=10 --inter-trial-wait=3, 2.505, 0.145, 2.383, 2.789, 2.460

python-agg-by-key, AggregateByKey --num-trials=10 --inter-trial-wait=3 --num-partitions=400 --reduce-tasks=400 --random-seed=5 --persistent-type=memory --num-records=200000000 --unique-keys=20000 --key-length=10 --unique-values=1000000 --value-length=10 , 28.7235, 0.203, 28.461, 29.106, 28.537

What doest it mean by numbers 2.505, 0.145 etc for the first pyspark job and 28.7235, 0.03 etc for the second job.

Spark-perf data size for benchmark

I am running Pyspark tests with Spark-perf. I get out of memory errors, so I was wondering what is the data size the package is handling during the Benchmark.

S3-backed build cache

Performing large bisections would be even faster if we used a S3-backed build cache that stored the compressed build archives.

Move configuration cross-product logic out of the test runners

Currently, the logic for taking a cross-product of configuration options is mixed into the test suite runner. I believe that this complexity should live at higher layers of the stack: a test runner should accept the configuration necessary for running one test and parameter sweeps should be implemented in terms of multiple calls to the lower-level test runner.

This is a first step towards enabling fully-scriptable configuration.

running two instances of spark-perf ?

I have been running 1 instance of the spark-perf benchmark using the core tests and the MLlib tests on a small-ish cluster (only 4 nodes, yet quite poweful ones -- 64GB ram, 8 cores each), using scale-factor=1.
The benchmark is occupying just 4 executors (1 core each).

Now I ve tried to launch a second scaled-down (0.1) configuration of the benchmark suite using only the core tests.. at the same time. Although there are both memory and executor/cores available, the benchmark fails to start! (or more precisely it fails to engage workers! .. giving me the following message
"Spark is still running on some slaves ... sleeping for 10 seconds"). That is even though I have set the USE_CLUSTER_SPARK = True, and RESTART_SPARK_CLUSTER = False -- so I guess it tries to use my existing cluster

On the other hand if I start a spark-shell or start another appl it seems to get admitted just fine!

Any ideas of what this means ? .. Given the very spartan information about what the benchmark does/uses it is rather difficult to know which direction to start looking at.

TIA

Manolis.

glm classification error

On Spark 1.2.0

15/01/26 14:15:07 INFO scheduler.DAGScheduler: Job 0 finished: count at MLAlgorithmTests.scala:204, took 122.066741 s
Exception in thread "main" java.lang.IllegalArgumentException: GLMClassificationTest given incompatible (loss, regType) = (logistic, l2). Note the set of supp
orted combinations increases in later Spark versions.
        at mllib.perf.onepointtwo.GLMClassificationTest.runTest(MLAlgorithmTests.scala:238)
        at mllib.perf.onepointtwo.GLMClassificationTest.runTest(MLAlgorithmTests.scala:167)
        at mllib.perf.onepointtwo.RegressionAndClassificationTests.run(MLAlgorithmTests.scala:38)
        at mllib.perf.onepointtwo.TestRunner$$anonfun$2.apply(TestRunner.scala:55)
        at mllib.perf.onepointtwo.TestRunner$$anonfun$2.apply(TestRunner.scala:53)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.immutable.Range.foreach(Range.scala:141)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at mllib.perf.onepointtwo.TestRunner$.main(TestRunner.scala:53)
        at mllib.perf.onepointtwo.TestRunner.main(TestRunner.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

chi-sq-matrix exception in MLlib 1.1

06/10/26 22:45:27 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Exception in thread "main" scala.MatchError: chi-sq-matrix (of class java.lang.String)
        at mllib.perf.onepointone.TestRunner$.main(TestRunner.scala:19)
        at mllib.perf.onepointone.TestRunner.main(TestRunner.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

How to read the test result

I ran the spark-perf test but how do I interpret the result. I see an array of time estimate, how do I interpret it

results:{  
   "testName":"scheduling-throughput",
   "options":{  
      "num-tasks":"10000",
      "num-trials":"10",
      "random-seed":"5",
      "num-jobs":"1",
      "inter-trial-wait":"3",
      "closure-size":"0"
   },
   "sparkConf":{  
      "spark.serializer":"org.apache.spark.serializer.JavaSerializer",
      "spark.driver.host":"172.16.30.5",
      "spark.driver.port":"40017",
      "spark.jars":"file:/home/zeus/temp/spark-perf/spark-tests/target/spark-perf-tests-assembly.jar",
      "spark.app.name":"TestRunner: scheduling-throughput",
      "spark.storage.memoryFraction":"0.66",
      "spark.locality.wait":"60000000",
      "spark.driver.memory":"512m",
      "spark.executor.id":"driver",
      "spark.submit.deployMode":"client",
      "spark.master":"spark://testbed5.jvn.edu.vn:7077",
      "spark.fileserver.uri":"http://172.16.30.5:49201",
      "spark.externalBlockStore.folderName":"spark-2211ef9a-d22b-41a2-a5a4-310443980c95",
      "spark.app.id":"app-20160427111045-0000"
   },
   "sparkVersion":"1.5.2",
   "systemProperties":{  
      "java.io.tmpdir":"/tmp",
      "spark.serializer":"org.apache.spark.serializer.JavaSerializer",
      "line.separator":"\n",
      "path.separator":":",
      "sun.management.compiler":"HotSpot 64-Bit Tiered Compilers",
      "SPARK_SUBMIT":"true",
      "sun.cpu.endian":"little",
      "java.specification.version":"1.8",
      "java.vm.specification.name":"Java Virtual Machine Specification",
      "java.vendor":"Oracle Corporation",
      "java.vm.specification.version":"1.8",
      "user.home":"/home/zeus",
      "file.encoding.pkg":"sun.io",
      "sun.nio.ch.bugLevel":"",
      "sun.arch.data.model":"64",
      "sun.boot.library.path":"/usr/lib/jvm/java-8-oracle/jre/lib/amd64",
      "user.dir":"/home/zeus/temp/spark-perf",
      "spark.jars":"file:/home/zeus/temp/spark-perf/spark-tests/target/spark-perf-tests-assembly.jar",
      "java.library.path":"/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib",
      "sun.cpu.isalist":"",
      "os.arch":"amd64",
      "java.vm.version":"25.66-b17",
      "spark.app.name":"spark.perf.TestRunner",
      "java.endorsed.dirs":"/usr/lib/jvm/java-8-oracle/jre/lib/endorsed",
      "java.runtime.version":"1.8.0_66-b17",
      "java.vm.info":"mixed mode",
      "sparkperf.commitSHA":"unknown",
      "java.ext.dirs":"/usr/lib/jvm/java-8-oracle/jre/lib/ext:/usr/java/packages/lib/ext",
      "spark.storage.memoryFraction":"0.66",
      "java.runtime.name":"Java(TM) SE Runtime Environment",
      "spark.locality.wait":"60000000",
      "spark.driver.memory":"512m",
      "file.separator":"/",
      "java.class.version":"52.0",
      "java.specification.name":"Java Platform API Specification",
      "sun.boot.class.path":"/usr/lib/jvm/java-8-oracle/jre/lib/resources.jar:/usr/lib/jvm/java-8-oracle/jre/lib/rt.jar:/usr/lib/jvm/java-8-oracle/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jsse.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jce.jar:/usr/lib/jvm/java-8-oracle/jre/lib/charsets.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jfr.jar:/usr/lib/jvm/java-8-oracle/jre/classes",
      "file.encoding":"UTF-8",
      "user.timezone":"Asia/Ho_Chi_Minh",
      "java.specification.vendor":"Oracle Corporation",
      "sun.java.launcher":"SUN_STANDARD",
      "os.version":"3.19.0-25-generic",
      "sun.os.patch.level":"unknown",
      "spark.submit.deployMode":"client",
      "java.vm.specification.vendor":"Oracle Corporation",
      "spark.master":"spark://testbed5.jvn.edu.vn:7077",
      "user.country":"US",
      "sun.jnu.encoding":"UTF-8",
      "user.language":"en",
      "java.vendor.url":"http://java.oracle.com/",
      "java.awt.printerjob":"sun.print.PSPrinterJob",
      "java.awt.graphicsenv":"sun.awt.X11GraphicsEnvironment",
      "awt.toolkit":"sun.awt.X11.XToolkit",
      "java.class.path":"/home/zeus/temp/spark-1.5.2-bin-hadoop2.6/conf/:/home/zeus/temp/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar:/home/zeus/temp/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/zeus/temp/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/zeus/temp/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar",
      "os.name":"Linux",
      "java.vm.vendor":"Oracle Corporation",
      "java.vendor.url.bug":"http://bugreport.sun.com/bugreport/",
      "user.name":"zeus",
      "java.vm.name":"Java HotSpot(TM) 64-Bit Server VM",
      "sun.java.command":"org.apache.spark.deploy.SparkSubmit --master spark://testbed5.jvn.edu.vn:7077 --conf spark.driver.memory=512m --class spark.perf.TestRunner /home/zeus/temp/spark-perf/spark-tests/target/spark-perf-tests-assembly.jar scheduling-throughput --num-trials=10 --inter-trial-wait=3 --num-tasks=10000 --num-jobs=1 --closure-size=0 --random-seed=5",
      "java.home":"/usr/lib/jvm/java-8-oracle/jre",
      "java.version":"1.8.0_66",
      "sun.io.unicode.encoding":"UnicodeLittle"
   },
   "results":[  
      {  
         "time":5.715
      },
      {  
         "time":2.052
      },
      {  
         "time":1.531
      },
      {  
         "time":1.449
      },
      {  
         "time":1.285
      },
      {  
         "time":1.46
      },
      {  
         "time":1.33
      },
      {  
         "time":1.274
      },
      {  
         "time":1.232
      },
      {  
         "time":1.246
      }
   ]
}

Traceback in hdfs-recovery

--------------------------------------------------------------------
Java options: -Dspark.storage.memoryFraction=0.66 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer -Dspark.executor.memory=9g -Dspark.executor.extraJavaOptions= -XX:+UseConcMarkSweepGC 
Options: hdfs-recovery --total-duration=60 --hdfs-url=hdfs://numaq1-1/test --batch-duration=5000 --records-per-file=10000 --file-cleaner-delay=300
--------------------------------------------------------------------
<snip>
14/10/16 10:45:26 INFO scheduler.JobGenerator: Started JobGenerator at 1413470730000 ms
14/10/16 10:45:26 INFO scheduler.JobScheduler: Started JobScheduler
Exception in thread "main" java.lang.AbstractMethodError
        at org.apache.spark.Logging$class.log(Logging.scala:52)
        at streaming.perf.util.FileGenerator.log(FileGenerator.scala:15)
        at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
        at streaming.perf.util.FileGenerator.logInfo(FileGenerator.scala:15)
        at streaming.perf.util.FileGenerator.start(FileGenerator.scala:51)
        at streaming.perf.HdfsRecoveryTest.run(HdfsRecoveryTest.scala:61)
        at streaming.perf.TestRunner$.main(TestRunner.scala:18)
        at streaming.perf.TestRunner.main(TestRunner.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Exception in thread "Thread-101" java.lang.AbstractMethodError
        at org.apache.spark.Logging$class.log(Logging.scala:52)
        at streaming.perf.util.FileGenerator.log(FileGenerator.scala:15)
        at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
        at streaming.perf.util.FileGenerator.logInfo(FileGenerator.scala:15)
        at streaming.perf.util.FileGenerator.streaming$perf$util$FileGenerator$$copyFile(FileGenerator.scala:108)
        at streaming.perf.util.FileGenerator$$anonfun$streaming$perf$util$FileGenerator$$generateFiles$1$$anonfun$apply$mcVI$sp$1.apply$mcVJ$sp(FileGenerator.scala:82)
        at streaming.perf.util.FileGenerator$$anonfun$streaming$perf$util$FileGenerator$$generateFiles$1$$anonfun$apply$mcVI$sp$1.apply(FileGenerator.scala:75)
        at streaming.perf.util.FileGenerator$$anonfun$streaming$perf$util$FileGenerator$$generateFiles$1$$anonfun$apply$mcVI$sp$1.apply(FileGenerator.scala:75)
        at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:74)
        at streaming.perf.util.FileGenerator$$anonfun$streaming$perf$util$FileGenerator$$generateFiles$1.apply$mcVI$sp(FileGenerator.scala:75)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
        at streaming.perf.util.FileGenerator.streaming$perf$util$FileGenerator$$generateFiles(FileGenerator.scala:73)
        at streaming.perf.util.FileGenerator$$anon$1.run(FileGenerator.scala:28)

Building mllib test problem

Was testing against Spark 1.1.0 so I had to change the Build.scala

# git diff project/Build.scala 
diff --git a/mllib-tests/project/Build.scala b/mllib-tests/project/Build.scala
index 6f5434e..c3f7394 100644
--- a/mllib-tests/project/Build.scala
+++ b/mllib-tests/project/Build.scala
@@ -20,12 +20,12 @@ object MyBuild extends Build {
     lazy val onepointone = project
         .settings(
           libraryDependencies ++= deps,
-          libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.1.1-SNAPSHOT" % "provided"
+          libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.1.0" % "provided"
         )

     lazy val onepointoh = project
         .settings(        //should be set to 1.0.0 or higher
-          libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.0.0" % "provided",
+          libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.1.0" % "provided",
           libraryDependencies ++= deps
         )
 }

After I got the dependency problem resolved, there are now compilation errors:

[info] Compiling 5 Scala sources to /net/home/ltsai/a/spark-perf/mllib-tests/onepointone/target/scala-2.10/classes...
[error] /net/home/ltsai/a/spark-perf/mllib-tests/onepointoh/src/main/scala/mllib/perf/onepointoh/MLAlgorithmTests.scala:523: overloaded method value train with alternatives:
[error]   (input: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity: org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int,numClassesForClassification: Int,maxBins: Int,quantileCalculationStrategy: org.apache.spark.mllib.tree.configuration.QuantileStrategy.QuantileStrategy,categoricalFeaturesInfo: Map[Int,Int])org.apache.spark.mllib.tree.model.DecisionTreeModel <and>
[error]   (input: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity: org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int,numClassesForClassification: Int)org.apache.spark.mllib.tree.model.DecisionTreeModel <and>
[error]   (input: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity: org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int)org.apache.spark.mllib.tree.model.DecisionTreeModel <and>
[error]   (input: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],strategy: org.apache.spark.mllib.tree.configuration.Strategy)org.apache.spark.mllib.tree.model.DecisionTreeModel
[error]  cannot be applied to (org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint], org.apache.spark.mllib.tree.configuration.Algo.Value, org.apache.spark.mllib.tree.impurity.Variance.type, Int, Int, org.apache.spark.mllib.tree.configuration.QuantileStrategy.Value, Map[Int,Int])
[error]       DecisionTree.train(rdd, Regression, Variance, treeDepth, maxBins, QuantileStrategy.Sort,
[error]                    ^
[error] /net/home/ltsai/a/spark-perf/mllib-tests/onepointoh/src/main/scala/mllib/perf/onepointoh/MLAlgorithmTests.scala:527: overloaded method value train with alternatives:
[error]   (input: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity: org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int,numClassesForClassification: Int,maxBins: Int,quantileCalculationStrategy: org.apache.spark.mllib.tree.configuration.QuantileStrategy.QuantileStrategy,categoricalFeaturesInfo: Map[Int,Int])org.apache.spark.mllib.tree.model.DecisionTreeModel <and>
[error]   (input: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity: org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int,numClassesForClassification: Int)org.apache.spark.mllib.tree.model.DecisionTreeModel <and>
[error]   (input: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity: org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int)org.apache.spark.mllib.tree.model.DecisionTreeModel <and>
[error]   (input: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],strategy: org.apache.spark.mllib.tree.configuration.Strategy)org.apache.spark.mllib.tree.model.DecisionTreeModel
[error]  cannot be applied to (org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint], org.apache.spark.mllib.tree.configuration.Algo.Value, org.apache.spark.mllib.tree.impurity.Gini.type, Int, Int, org.apache.spark.mllib.tree.configuration.QuantileStrategy.Value, Map[Int,Int])
[error]       DecisionTree.train(rdd, Classification, Gini, treeDepth,
[error]                    ^
[error] two errors found
[error] (onepointoh/compile:compile) Compilation failed
[error] Total time: 379 s, completed Oct 16, 2014 10:06:07 PM

Running spark-perf using OpenBLAS and MKL

Can someone help me in running spark-perf's MLIB tests using OpenBLAS and MKL?

MLlib TODO items

Change Scala testName to match Python test names: “glm-regression” —> GLMRegressionTest
Make parameter names match across all tests. (num-examples, num-rows, etc.)
Refactor correlation tests so pearson/spearman is a parameter.
Better data generation in Python

Error when running tests: java.lang.NoSuchMethodError: joptsimple.ArgumentAcceptingOptionSpec.required()

I get the following error when running tests. Proper classpath is not being set. Any suggestions are appreciated.

Exception in thread "main" java.lang.NoSuchMethodError: joptsimple.ArgumentAcceptingOptionSpec.required()Ljoptsimple/ArgumentAcceptingOptionSpec;
    at mllib.perf.PerfTest$$anonfun$addOptionsToParser$1.apply(PerfTest.scala:80)
    at mllib.perf.PerfTest$$anonfun$addOptionsToParser$1.apply(PerfTest.scala:79)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

Exception in thread "main" java.lang.IllegalArgumentException: int is not a value type
    at joptsimple.internal.Reflection.findConverter(Reflection.java:66)
    at joptsimple.ArgumentAcceptingOptionSpec.ofType(ArgumentAcceptingOptionSpec.java:111)
    at mllib.perf.PerfTest$$anonfun$addOptionsToParser$3.apply(PerfTest.scala:86)

Thank you

Don't require spark-env.sh configuration file to be present

Spark doesn't require this file, so we shouldn't throw an error message if it's not present when we try to copy it.

Memory Problems with Spark-perf

I am running pyspark tests on cluster with 12 node, 20 cores on each nodes and 60gb memory per node. I am getting output of first few tests(sort, agg, count etc), and when it reaches to broadcast, job terminates. I assume it is because of lack of memory from .err file in result folder as ensureFreeSpace(4194304) called with curMem=610484012, maxMem=611642769. How can I increase maxMem value? This is my config/config.py file content.

COMMON_JAVA_OPTS = [
# Fraction of JVM memory used for caching RDDs.
JavaOptionSet("spark.storage.memoryFraction", [0.66]),
JavaOptionSet("spark.serializer", ["org.apache.spark.serializer.JavaSerializer"]),
JavaOptionSet("spark.executor.memory", ["9g"]),
and

Set driver memory here

SPARK_DRIVER_MEMORY = "20g"

It shows the running command as follows.

Setting env var SPARK_SUBMIT_OPTS: -Dspark.storage.memoryFraction=0.66 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer -Dspark.executor.memory=9g -Dspark.locality.wait=60000000 -Dsparkperf.commitSHA=unknown
Running command: /nfs/15/soottikkal/local/spark-1.5.2-bin-hadoop2.6//bin/spark-submit --master spark://r0111.ten.osc.edu:7077 pyspark-tests/core_tests.py BroadcastWithBytes --num-trials=10 --inter-trial-wait=3 --num-partitions=400 --reduce-tasks=400 --random-seed=5 --persistent-type=memory --num-records=200000000 --unique-keys=20000 --key-length=10 --unique-values=1000000 --value-length=10 --broadcast-size=209715200 1>> results/python_perf_output__2016-01-28_23-35-54_logs/python-broadcast-w-bytes.out 2>> results/python_perf_output__2016-01-28_23-35-54_logs/python-broadcast-w-bytes.err

Is the spark-submit command taking memory as set in config.py here? maxMem is only 611mb which looks like 0.66*1gb of default memory setting of Spark. Changing spark.executor.memory or SPARK_DRIVER_MEMORY value in config/config.py has no effect on maxMem, but changing spark.storage.memoryFraction from 0.66 to 0.88 increases the MaxMem. How can I control maxMem value to get large memories that are already available in the cluster?

Input Data File Location

Hello,

I am working on spark on yarn setup and running k-means algorithm. I want to know the location of the input data file generated by spark-perf or it is in memory only?

Thanks

make-distribution.sh with --skip-java-test fails against master

Attempting to run spark-perf against master fails:

./make-distribution.sh --skip-java-test
The following shell command finished with a non-zero returncode (1): ./make-distribution.sh --skip-java-test

The issue is with the --skip-java-test flag, which was recently removed in apache/spark@6cf51a7. It's still used in Spark 1.4- (https://github.com/apache/spark/blob/branch-1.4/make-distribution.sh#L146)

There are 2 potential fixes:

Amend spark-perf to check if --skip-java-test exists in make-distribution. Something like:

...
    with cd(target_dir):
        logger.info("Building spark at version %s; This may take a while...\n" % commit_id)
        # Spark version 1.5+ no longer uses the --skip-java-test flag in make-distribution.sh
        skip_java_test_code = run_cmd('cat make-distribution.sh | grep "skip-java-test"', exit_on_fail=False)
        skip_java_test_str =  "--skip-java-test" if skip_java_test_code == 0 else ""
        # According to the SPARK-1520 JIRA, building with Java 7+ will only cause problems when
        # running PySpark on YARN or when running on Java 6.  Since we'll be building and running
        # Spark on the same machines and using standalone mode, it should be safe to
        # disable this warning:
        if is_yarn_mode:
            run_cmd("./make-distribution.sh %s -Pyarn %s" % (skip_java_test_str, additional_make_distribution_args))
        else:
            run_cmd("./make-distribution.sh %s %s" % (skip_java_test_str, additional_make_distribution_args))
...

Amend make-distribution.sh to give a warning such as suggested in the commit.
e.g.

echo "Warning: '--skip-java-test' is deprecated and has no effect."
;;

I can submit a PR for option 1 or 2 as required, let me know which you prefer.

All test parameters are integers

Makes it hard to test with more than 2 billion records, for instance.

Separate configuration files for local mode and cluster mode

spark-perf would be easier to use if we had separate configuration files for local mode vs. cluster mode (e.g. developing new tests on a single machine vs. running the performance testing suite on EC2).

Use Long instead of Int for numeric options

We should use Long instead of Int for numeric options in order to support larger values (e.g. huge numbers of records), and verify that the data generators work for huge datasets.

The current code fails in the options parsing phase if it attempts to parse numbers bigger than IntMax:

Exception in thread "main" joptsimple.OptionArgumentConversionException: Cannot convert argument '4000000000' of option ['num-records'] to class java.lang.Integer
  at joptsimple.AbstractOptionSpec.convertWith(AbstractOptionSpec.java:94)
  at joptsimple.ArgumentAcceptingOptionSpec.convert(ArgumentAcceptingOptionSpec.java:276)
  at joptsimple.OptionSet.valuesOf(OptionSet.java:222)
  at joptsimple.OptionSet.valueOf(OptionSet.java:171)
  at joptsimple.OptionSet.valueOf(OptionSet.java:152)
  at spark.perf.KVDataTest.createInputData(KVDataTest.scala:62)
  at spark.perf.TestRunner$.main(TestRunner.scala:29)
  at spark.perf.TestRunner.main(TestRunner.scala)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:314)
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: joptsimple.internal.ReflectionException: java.lang.NumberFormatException: For input string: "4000000000"
  at joptsimple.internal.Reflection.reflectionException(Reflection.java:140)
  at joptsimple.internal.Reflection.invoke(Reflection.java:122)
  at joptsimple.internal.MethodInvokingValueConverter.convert(MethodInvokingValueConverter.java:48)
  at joptsimple.internal.Reflection.convertWith(Reflection.java:128)
  at joptsimple.AbstractOptionSpec.convertWith(AbstractOptionSpec.java:91)
  ... 14 more

For the linear regression test, what is the size of the data set that is used? Is it possible to vary the size of this ata set to test the time taken for different data sizes?

How many executors, cores per executor

Hi,

I am having trouble figuring out the number of executors that I need to set. I am running the glm-classification on Spark 1.5/Yarn/CDH-5.5, and it is only using one executor (and I believe one thread).

I've tried setting spark.executor.instances and spark.executor.cores to 6/8 respectively (using a cluster of 6 machines each with 8 cores). But the classificaiton is still only using one executor and one core.

Any thoughts are appreciated!

HDFS URL not used properly in SPARK_KEY_VAL_TEST_OPTS

It uses os.env() rather than HDFS_URL defined above.

Multiple Decision tree test?

Is this to be expected?

And some cores are zero?

problem fine tuning stand-alone mode worker instances

I am running spark-perf on a scale-up node with v1.6.0 in stand-alone mode, and I'm having trouble fine tuning the runs. The problem I am currently facing is that I'd like to run multiple workers (smaller JVM heaps) on a single node. But, no matter how I am configuring spark-perf I see:

1 master, N worker instances
N coarse-grained executors
1 JVM running the actual test

So, I am able to get multiple worker instances fired up, but I only see one JVM running an actual test (e.g. glm-regression). Am I messing up or missing a configuration option? The other issue I'm curious about is the coarse-grained executors. I have not seen them running before any time that I've used stand-alone mode.

Thanks.

Python mllib tests failing

All tests are failing because of the random seed.

Number of failed tests: 8, failed tests: python-glm-classification,python-glm-classification,python-glm-regression,python-naive-bayes,python-als,python-kmeans,python-pearson,python-spearman

15/01/23 23:17:32 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 (TID 11, numaq1-1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, in main
    process()
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line 227, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/net/home/ltsai/a/spark-perf.new/pyspark-tests/mllib_data.py", line 21, in gen
    rng = numpy.random.RandomState(hash(str(seed ^ index)))
  File "mtrand.pyx", line 613, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7402)
  File "mtrand.pyx", line 649, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7702)
ValueError: Seed must be between 0 and 4294967295

        at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
        at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
        at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)

How to run Spark Streaming test?

There is on document about how to run Spark Streaming test,so I want to know how to test it ? Is there any example?

Python KV tests don'ts support HDFS persistence type

As noted in #43, PySpark doesn't handle the HDFS persistence type for the KVS tests:

https://github.com/databricks/spark-perf/blame/master/pyspark-tests/core_tests.py#L31

We should probably add support for this or raise a louder warning if that configuration is used.

config.py.template error

Hi, Adam,

How can you compare the value MLLIB_SPARK_VERSION = 2.0.0 -which is NOT a decimal value I suppose - in all the config.py file lines like (e.g.) if MLLIB_SPARK_VERSION >= 1.1:

Thanks for a response.

Spark-perf over mesos

Hi, What is the correct way to submit the test over a Mesos cluster?
I have tried to set the SPARK_CLUSTER_URL variable in two different ways.

SPARK_CLUSTER_URL = "mesos://zk://mesosmaster1:2181,mesosmaster2:2181,mesosmaster3:2181/mesos"

and

SPARK_CLUSTER_URL = "mesos://mesosmaster1:5050"

Neither of them seems to work.

Thanks and Kind Regards
Alessio

Documentation of output files

It would be very helpful to have a description of the output results. For instance there is a file in the root of the results directory that seems to summarize the results:

$ cat results/mllib_perf_output__2016-05-02_23-03-39
glm-regression, glm-regression --num-trials=10 --inter-trial-wait=3 --num-partitions=1 --random-seed=5 --num-examples=1000 --feature-noise=1.0 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --elastic-net-param=0.0 --optimizer=sgd --intercept=0.0 --label-noise=0.1 --loss=l2
Training time: 1.6775, 0.101, 1.596, 1.888, 1.596
Test time: 0.0925, 0.009, 0.074, 0.106, 0.074
Training Set Metric: 31.6423691314, 0.780, 30.9980164059, 33.4544170619, 31.4295866559
Test Set Metric: 33.5374436306, 0.830, 31.7722799916, 33.9670505732, 32.7795578401
glm-classification, glm-classification --num-trials=10 --inter-trial-wait=3 --num-partitions=1 --random-seed=5 --num-examples=1000 --feature-noise=1.0 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --elastic-net-param=0.0 --per-negative=0.3 --optimizer=sgd --loss=logistic
Training time: 1.596, 0.099, 1.51, 1.839, 1.51
Test time: 0.079, 0.007, 0.064, 0.085, 0.064
Training Set Metric: 94.1936507186, 0.454, 93.6170212766, 93.6170212766, 94.8275862069
Test Set Metric: 85.1563662824, 2.205, 81.3008130081, 86.3117870722, 85.9848484848

As well as the the results JSON string for each experiment in the *_logs directory. I'm assuming the above is a summary, and explaining that would be very useful.

Excisting spark cluster - hadoop spark cluster is ideal one to run spark-perf?

I have a hadoop cluster with spark enabled on it. Can I run spark -perf on this cluster? Or should it be standalone cluster with spark on it?

A place from where we can download the jars instead of having to build always

I want to access the jars directly instead of taking the trouble to build the source. Can you provide a link from where people can download the jars for tests provided by spark-perf ?

Thanks. Btw, I personally think that spark-perf is a great idea executed with brilliance!

Mllib test failed

There many failed workloads in Mllib test.
Such as glm-regression. The following is the error log. I use spark 1.3.0 Can anyone help to find the
reason?

16/07/23 10:50:43 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Exception in thread "main" java.lang.NoClassDefFoundError: org.apache.spark.ml.attribute.Attribute
at mllib.perf.GLMRegressionTest.createInputData(MLAlgorithmTests.scala:119)
at mllib.perf.TestRunner$$anonfun$2.apply(TestRunner.scala:67)
at mllib.perf.TestRunner$$anonfun$2.apply(TestRunner.scala:66)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at mllib.perf.TestRunner$.main(TestRunner.scala:66)
at mllib.perf.TestRunner.main(TestRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:56)
at java.lang.reflect.Method.invoke(Method.java:620)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.ml.attribute.Attribute
at java.net.URLClassLoader.findClass(URLClassLoader.java:600)
at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:786)
at java.lang.ClassLoader.loadClass(ClassLoader.java:764)
at java.lang.ClassLoader.loadClass(ClassLoader.java:741)
... 19 more

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

Java options: -Dspark.storage.memoryFraction=0.66 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer -Dspark.locality.wait=60000000 -Dspark.shuffle.manager=SORT
Options: glm-regression --num-trials=10 --inter-trial-wait=3 --num-partitions=64 --random-seed=5 --num-examples=500000 --num-iterations=20 --optimizer=auto --reg-type=elastic-net --elastic-net-param=0.0 --reg-param=0.01 --feature-noise=1.0 --step-size=0.0 --label-noise=0.1 --intercept=0.2 --loss=l2 --num-features=10000

SBT get doesnt work

looks like sbt has moved... ?

wget http://typesafe.artifactoryonline.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.6/sbt-launch.jar

leads to...

--2016-07-11 22:06:13--  http://typesafe.artifactoryonline.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.6/sbt-launch.jar
Resolving typesafe.artifactoryonline.com (typesafe.artifactoryonline.com)... 52.22.208.20, 54.85.27.32
Connecting to typesafe.artifactoryonline.com (typesafe.artifactoryonline.com)|52.22.208.20|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2016-07-11 22:06:13 ERROR 404: Not Found.

spearman test hung on MLlib 1.1

Attached screen shot.

06/10/27 04:40:17 INFO scheduler.DAGScheduler: Stage 1 (sortByKey at SpearmanCorrelation.scala:55) finished in 18.166 s
06/10/27 04:40:17 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
06/10/27 04:40:17 INFO spark.SparkContext: Job finished: sortByKey at SpearmanCorrelation.scala:55, took 18.552112971 s
06/10/27 04:40:18 INFO spark.SparkContext: Starting job: first at RowMatrix.scala:58
06/10/27 04:40:18 INFO spark.SparkContext: Starting job: apply at Option.scala:120

Test did not produce expected results. Output was: spark-perf

Hi,
I am using it with spark-2.0.0-bin-hadoop2.7 version but it is not working and showing following output.

Setting env var SPARK_SUBMIT_OPTS: -Dspark.storage.memoryFraction=0.66 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer -Dspark.locality.wait=60000000 -Dsparkperf.commitSHA=unknown
Running command: /home/shuja/Desktop/data/spark-2.0.0-bin-hadoop2.7/bin/spark-submit --class spark.perf.TestRunner --master spark://shuja:7077 --driver-memory 1g /home/shuja/Desktop/data/spark-perf-master/spark-tests/target/spark-perf-tests-assembly.jar scheduling-throughput --num-trials=10 --inter-trial-wait=3 --num-tasks=10000 --num-jobs=1 --closure-size=0 --random-seed=5 1>> results/spark_perf_output__2016-08-03_11-22-03_logs/scheduling-throughput.out 2>> results/spark_perf_output__2016-08-03_11-22-03_logs/scheduling-throughput.err

Test did not produce expected results. Output was:

Java options: -Dspark.storage.memoryFraction=0.66 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer -Dspark.locality.wait=60000000
Options: scheduling-throughput --num-trials=10 --inter-trial-wait=3 --num-tasks=10000 --num-jobs=1 --closure-size=0 --random-seed=5

Randomize order of tests; run multiple spaced trials of each configuration

When we observe small performance regressions, it would be helpful to have a way to automatically run multiple trials of each test in some random order; this would help us to gain confidence that the performance regression was actually caused by a change of settings or Spark versions rather than a transient performance issue in EC2.

Programmatically setting spark-perf configurations is difficult

The configuration of a given spark-run is defined by config.py, which is just a regular Python script.

In addition to defining "base" configurations (example), this script also handles generating derived configurations (example)--that is, configurations that are generated automatically from other base or derived configurations.

The combination of configuration-as-Python and derived configurations makes it difficult to programmatically set configurations for a spark-perf run.

What I'm doing now is using sed to set the base configurations I want in config.py. For example:

sed -i -r \
  -e '0,/^(USE_CLUSTER_SPARK = )(True)/s//\1False/' \
  -e '0,/^(DISK_WARMUP = )(False)/s//\1True/' \
  -e 's/\# (JavaOptionSet\("spark.executor.memory", )\["9g"\]\)/\1\["56g"\]\)/' \
  config/config.py

This is cumbersome. Is there a better way?

If not, does it make sense to split up config.py somehow into a set of base configs and another set of derived configs?

If they are split up like that, one could easily append full Python assignments to the end of the base config file without having to worry about the derived configs missing those new assignments.

Even better, the base configs could be converted to a different format like YAML, which would allow for direct programmatic manipulation of the configs. Simply load the YAML, manipulate it in Python, and then write it back out to YAML.

Results should have the same format across the Scala and Python APIs

The format of the results in the various .out files produced by spark-perf varies in small ways across the Scala and Python APIs. This makes it annoying to write code that analyses these results across the various language APIs.

In Scala:

JSON results in the .out files are prefixed with results:.

The "results" key inside the JSON object maps to a list of dictionaries with a "time" key like this:

"results":[{"time":0.263},{"time":0.146},{"time":0.142},{"time":0.156},{"time":0.142},{"time":0.123},{"time":0.12},{"time":0.103},{"time":0.093},{"time":0.101}]

The key is storage-location under options.

In Python:

JSON results in the .out files are prefixed with jsonResults:. There is an additional results: line that has CSV results.

The "results" key inside the JSON object maps to a flat list of test times like this:

"results":[0.53191494941711426,0.52017116546630859,0.46260595321655273,0.46292901039123535,0.47744202613830566,0.44221711158752441,0.53686809539794922,0.44035696983337402,0.45516180992126465,0.42807292938232422]

The key is storage_location under options (underscore vs. dash).

scale_factor setting question

I`m confused about scale_factor setting in config.py.template

# The default values configured below are appropriate for approximately 20 m1.xlarge nodes,
# in which each node has 15 GB of memory. Use this variable to scale the values (e.g.
# number of records in a generated dataset) if you are running the tests with more
# or fewer nodes. When developing new test suites, you might want to set this to a small
# value suitable for a single machine, such as 0.001.
SCALE_FACTOR = 1.0

scale_factor=1 for 20 m1.xlarge nodes(15GB mem) , why 0.001 for a single machine?
what if c3.xlarge(7.5GB mem) nodes or c3.2xlarge(4vCPU) nodes?

Thanks!

Complains 'spark-mllib_2.10;1.1.1-SNAPSHOT: not found' when compile mllib-tests

[info] Resolving org.apache.spark#spark-mllib_2.10;1.1.1-SNAPSHOT ...
[warn] module not found: org.apache.spark#spark-mllib_2.10;1.1.1-SNAPSHOT
[warn] ==== local: tried
[warn] /root/.ivy2/local/org.apache.spark/spark-mllib_2.10/1.1.1-SNAPSHOT/ivys/ivy.xml
[warn] ==== public: tried
[warn] http://repo1.maven.org/maven2/org/apache/spark/spark-mllib_2.10/1.1.1-SNAPSHOT/spark-mllib_2.10-1.1.1-SNAPSHOT.pom
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: org.apache.spark#spark-mllib_2.10;1.1.1-SNAPSHOT: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
sbt.ResolveException: unresolved dependency: org.apache.spark#spark-mllib_2.10;1.1.1-SNAPSHOT: not found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
at xsbt.boot.Using$.withResource(Using.scala:11)
at xsbt.boot.Using$.apply(Using.scala:10)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:60)
at sbt.IvySbt.withIvy(Ivy.scala:101)
at sbt.IvySbt.withIvy(Ivy.scala:97)
at sbt.IvySbt$Module.withModule(Ivy.scala:116)
at sbt.IvyActions$.update(IvyActions.scala:125)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42)
at sbt.std.Transform$$anon$4.work(System.scala:64)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
at sbt.Execute.work(Execute.scala:244)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:30)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
error sbt.ResolveException: unresolved dependency: org.apache.spark#spark-mllib_2.10;1.1.1-SNAPSHOT: not found
[error] Total time: 10 s, completed Sep 25, 2014 11:50:27 AM

How to run a custom test.

Hello

Is there any document or how to about how to include a custom test?

Use of DefaultCodec for HDFS DataGenerator is really bad

When DefaultCodec = ZLib, we end up serializing all data generation, because the native implementation of ZLib actually locks on a static class object during deserialization.

Changing to SnappyCodec enables actual parallelism.

Spark 2.0.0 support

I'm working on this and will submit a pull request once done, we face NoSuchMethodError problems once you try to run anything but scheduling-throughput

The fix for that is to modify spark-tests/project/SparkTestsBuild.scala - use 2.0.0-preview for org.apache.spark dependency version and Scala 2.11.8; specifically this resolves

NoSuchMethodError: org/apache/spark/SparkContext.rddToPairRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/rdd/PairRDDFunctions; at spark.perf.AggregateByKey.runTest(KVDataTest.scala:137)

which is triggered by

class AggregateByKey(sc: SparkContext) extends KVDataTest(sc) {
  override def runTest(rdd: RDD[_], reduceTasks: Int) {
    rdd.asInstanceOf[RDD[(String, String)]]
      .map{case (k, v) => (k, v.toInt)}.reduceByKey(_ + _, reduceTasks).count()
  } 
}

Create new test to cover root issue behind SPARK-3333

This is just a quick reminder that we probably need to add a new test to cover the root issue behind the problem reported in SPARK-3333 once that's been identified.

Same numbers are reported for all test runs in top-level results summary file when performing parameter sweeps

@davies noticed an interesting bug in the test result summaries that we print to the test output file (not the *_logs/*.out file, but the text files in the /results directory):

The numbers reported are the same for all tests!

The problem is that the process_output function is called on the same *.out file for all tests run in a particular invocation of the ./bin/run script and that function finds and parses the first result: line that it finds rather than the last one"

            for java_opt_list in itertools.product(*java_opt_set_arrays):
                for opt_list in itertools.product(*opt_set_arrays):
                    [...]
                    print("Running command: %s\n" % cmd)
                    Popen(cmd, shell=True, env=test_env).wait()
                    result_string = cls.process_output(config, short_name, opt_list,
                                                       stdout_filename, stderr_filename)

[...]

 def process_output(cls, config, short_name, opt_list, stdout_filename, stderr_filename):
        with open(stdout_filename, "r") as stdout_file:
            output = stdout_file.read()
        results_token = "results: "
        if results_token not in output:
            print("Test did not produce expected results. Output was:")
            print(output)
            sys.exit(1)
        result_line = filter(lambda x: results_token in x, output.split("\n"))[0]

This bug has been there for a long time; if we go back to 2013, it's still there in

spark-perf/bin/run

Line 236 in 246e813

result_line = filter(lambda x: results_token in x, output.split("\n"))[0]

Note that the correct numbers are logged in *.out; the bug only affects the automatic processing of them in the single combined results summary file.

This should be easy to fix: just index by -1 instead of 0.