Coder Social home page Coder Social logo

spotify / spydra Goto Github PK

View Code? Open in Web Editor NEW
134.0 24.0 34.0 576 KB

Ephemeral Hadoop clusters using Google Compute Platform

License: Apache License 2.0

Java 95.66% Python 2.32% Shell 1.01% Makefile 0.44% Dockerfile 0.57%
hadoop dataproc google-cloud

spydra's People

Contributors

cy6erbr4in avatar freben avatar jstck avatar kant avatar karth295 avatar kiarash-rezahanjani avatar krisss85 avatar lndbrg avatar medb avatar ochienggot avatar perploug avatar psobot avatar rustedbones avatar sisidra avatar varjoranta avatar xafilox avatar xeago avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spydra's Issues

Autoscaler should use Dataproc autoscaling

Cloud Dataproc now natively supports autoscaling. Dataproc's autoscaling seems to be a superset of the functionality in Spydra's autoscaler. If you're interested, I'd be happy to take a stab at moving Spydra to Dataproc's autoscaler and getting rid of the init action.

The one major difference is that the minimum cooldown period (scaling interval) in Dataproc is 10 minutes, while Spydra's README suggests 2 minutes. Are folks at Spotify using scaling intervals that short?

Create an optional mechanism to avoid duplicate jobs

We create Kubernetes pods to run Spydra and it submits a job to Dataproc. Sometimes our pods are removed and we automatically recreate the pod(Spydra) again, and it submits that job again. In the end, there are some duplicate jobs are running in Dataproc. Those jobs may take hours which costs a lot.

I think we can create an optional mechanism to avoid this situation by labeling jobs, and when we create a job, we can check whether there is any job with that label whose status is DONE, if so we should not submit that job and can throw an exception.

Unauthorized issue

Hello.
I just started following up the build process. I setup the authorization, created buckets, and run with maven but it is causing error.

Caused by: org.eclipse.aether.transfer.ArtifactTransferException: Could not transfer artifact com.spotify.data.spydra:spydra-parent:pom:0.3.4-20170726.094703-3 from/to ossrh (https://oss.sonatype.org/content/repositories/snapshots): Failed to transfer file: https://oss.sonatype.org/content/repositories/snapshots/com/spotify/data/spydra/spydra-parent/0.3.4-SNAPSHOT/spydra-parent-0.3.4-20170726.094703-3.pom. Return code is: 401, ReasonPhrase: Unauthorized.
	at org.eclipse.aether.connector.basic.ArtifactTransportListener.transferFailed(ArtifactTransportListener.java:43)
	at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run(BasicRepositoryConnector.java:355)
	at org.eclipse.aether.connector.basic.BasicRepositoryConnector.put(BasicRepositoryConnector.java:274)
	at org.eclipse.aether.internal.impl.DefaultDeployer.deploy(DefaultDeployer.java:311)
	... 27 more
Caused by: org.apache.maven.wagon.TransferFailedException: Failed to transfer file: https://oss.sonatype.org/content/repositories/snapshots/com/spotify/data/spydra/spydra-parent/0.3.4-SNAPSHOT/spydra-parent-0.3.4-20170726.094703-3.pom. Return code is: 401, ReasonPhrase: Unauthorized.
	at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:627)
	at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:541)
	at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:523)
	at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:517)
	at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:497)
	at org.eclipse.aether.transport.wagon.WagonTransporter$PutTaskRunner.run(WagonTransporter.java:644)
	at org.eclipse.aether.transport.wagon.WagonTransporter.execute(WagonTransporter.java:427)
	at org.eclipse.aether.transport.wagon.WagonTransporter.put(WagonTransporter.java:410)
	at org.eclipse.aether.connector.basic.BasicRepositoryConnector$PutTaskRunner.runTask(BasicRepositoryConnector.java:510)
	at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run(BasicRepositoryConnector.java:350)
	... 29 more
[ERROR] 
[ERROR] 

How can I figure out which point has caused error with this log?
Thanks.

Use application default credentials

Currently Spydra requires a service account key as a json file for authenticating with GCP. It would be convenient to instead rely on the Application Default Credentials strategy implemented in Google's libs and command line clients (https://cloud.google.com/docs/authentication/production).

This would allow users to continue using the service account key file like today. However, when running Spydra itself on GCP VMs, they can instead use the default credentials supplied by the metadata service.

I believe this would be good both for simplicity and security, as one would no longer have to distribute key files.

Submission of pyspark jobs

I wan't to submit pyspark jobs using Spydra. Either the documentation is lacking on this end, or the current implementation is not able to handle this job type (I would say, the latter is the case by skimming over the code which generates the gcloud dataproc jobs submit pyspark command).

I get the following error message:

ERROR: (gcloud.dataproc.jobs.submit.pyspark) argument PY_FILE is required
Usage: gcloud dataproc jobs submit pyspark PY_FILE --cluster=CLUSTER [optional flags] [-- JOB_ARGS ...]

The problem is, that I'm unable with options and job_args to insert the filename into the command line.

Cluster always self destructs after 30 minutes.

The spydra/init-actions/self-destruct.sh script will install the self destruct cron job on both master and the 0th node. However only the master receives the heartbeat updates, so the 0th worker will always kill the cluster when the collector timeout is reached.

New Logs don't appear in Historyserver

When running the Embedded JobHistoryServer it doesn't seem to ever refresh the jobs from GCS. Any jobs that are logged to GCS after it's started never appear until it's stopped and started again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.