spotify / spydra Goto Github PK

View Code? Open in Web Editor NEW

134.0 24.0 34.0 576 KB

Ephemeral Hadoop clusters using Google Compute Platform

License: Apache License 2.0

Java 95.66% Python 2.32% Shell 1.01% Makefile 0.44% Dockerfile 0.57%

hadoop dataproc google-cloud

spydra's People

Contributors

Stargazers

Watchers

spydra's Issues

Autoscaler should use Dataproc autoscaling

Cloud Dataproc now natively supports autoscaling. Dataproc's autoscaling seems to be a superset of the functionality in Spydra's autoscaler. If you're interested, I'd be happy to take a stab at moving Spydra to Dataproc's autoscaler and getting rid of the init action.

The one major difference is that the minimum cooldown period (scaling interval) in Dataproc is 10 minutes, while Spydra's README suggests 2 minutes. Are folks at Spotify using scaling intervals that short?

Pooling of clusters is scoped per project

PoolingSubmitter looks at all clusters in a project, this should be segmented on a user provided value allowing multiple pools to coexist in a single project.

Create an optional mechanism to avoid duplicate jobs

We create Kubernetes pods to run Spydra and it submits a job to Dataproc. Sometimes our pods are removed and we automatically recreate the pod(Spydra) again, and it submits that job again. In the end, there are some duplicate jobs are running in Dataproc. Those jobs may take hours which costs a lot.

I think we can create an optional mechanism to avoid this situation by labeling jobs, and when we create a job, we can check whether there is any job with that label whose status is DONE, if so we should not submit that job and can throw an exception.

Unauthorized issue

Hello.
I just started following up the build process. I setup the authorization, created buckets, and run with maven but it is causing error.

Caused by: org.eclipse.aether.transfer.ArtifactTransferException: Could not transfer artifact com.spotify.data.spydra:spydra-parent:pom:0.3.4-20170726.094703-3 from/to ossrh (https://oss.sonatype.org/content/repositories/snapshots): Failed to transfer file: https://oss.sonatype.org/content/repositories/snapshots/com/spotify/data/spydra/spydra-parent/0.3.4-SNAPSHOT/spydra-parent-0.3.4-20170726.094703-3.pom. Return code is: 401, ReasonPhrase: Unauthorized.
	at org.eclipse.aether.connector.basic.ArtifactTransportListener.transferFailed(ArtifactTransportListener.java:43)
	at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run(BasicRepositoryConnector.java:355)
	at org.eclipse.aether.connector.basic.BasicRepositoryConnector.put(BasicRepositoryConnector.java:274)
	at org.eclipse.aether.internal.impl.DefaultDeployer.deploy(DefaultDeployer.java:311)
	... 27 more
Caused by: org.apache.maven.wagon.TransferFailedException: Failed to transfer file: https://oss.sonatype.org/content/repositories/snapshots/com/spotify/data/spydra/spydra-parent/0.3.4-SNAPSHOT/spydra-parent-0.3.4-20170726.094703-3.pom. Return code is: 401, ReasonPhrase: Unauthorized.
	at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:627)
	at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:541)
	at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:523)
	at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:517)
	at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:497)
	at org.eclipse.aether.transport.wagon.WagonTransporter$PutTaskRunner.run(WagonTransporter.java:644)
	at org.eclipse.aether.transport.wagon.WagonTransporter.execute(WagonTransporter.java:427)
	at org.eclipse.aether.transport.wagon.WagonTransporter.put(WagonTransporter.java:410)
	at org.eclipse.aether.connector.basic.BasicRepositoryConnector$PutTaskRunner.runTask(BasicRepositoryConnector.java:510)
	at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run(BasicRepositoryConnector.java:350)
	... 29 more
[ERROR] 
[ERROR]

How can I figure out which point has caused error with this log?
Thanks.

Use application default credentials

Currently Spydra requires a service account key as a json file for authenticating with GCP. It would be convenient to instead rely on the Application Default Credentials strategy implemented in Google's libs and command line clients (https://cloud.google.com/docs/authentication/production).

This would allow users to continue using the service account key file like today. However, when running Spydra itself on GCP VMs, they can instead use the default credentials supplied by the metadata service.

I believe this would be good both for simplicity and security, as one would no longer have to distribute key files.

Submission of pyspark jobs

I wan't to submit pyspark jobs using Spydra. Either the documentation is lacking on this end, or the current implementation is not able to handle this job type (I would say, the latter is the case by skimming over the code which generates the gcloud dataproc jobs submit pyspark command).

I get the following error message:

ERROR: (gcloud.dataproc.jobs.submit.pyspark) argument PY_FILE is required
Usage: gcloud dataproc jobs submit pyspark PY_FILE --cluster=CLUSTER [optional flags] [-- JOB_ARGS ...]

The problem is, that I'm unable with options and job_args to insert the filename into the command line.

Cluster always self destructs after 30 minutes.

The spydra/init-actions/self-destruct.sh script will install the self destruct cron job on both master and the 0th node. However only the master receives the heartbeat updates, so the 0th worker will always kill the cluster when the collector timeout is reached.

New Logs don't appear in Historyserver

When running the Embedded JobHistoryServer it doesn't seem to ever refresh the jobs from GCS. Any jobs that are logged to GCS after it's started never appear until it's stopped and started again.

spotify / spydra Goto Github PK

spydra's People

Contributors

Stargazers

Watchers

Forkers

spydra's Issues

Autoscaler should use Dataproc autoscaling

Pooling of clusters is scoped per project

Create an optional mechanism to avoid duplicate jobs

Unauthorized issue

Use application default credentials

Submission of pyspark jobs

Cluster always self destructs after 30 minutes.

New Logs don't appear in Historyserver

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent