spotify / spydra Goto Github PK
View Code? Open in Web Editor NEWEphemeral Hadoop clusters using Google Compute Platform
License: Apache License 2.0
Ephemeral Hadoop clusters using Google Compute Platform
License: Apache License 2.0
Cloud Dataproc now natively supports autoscaling. Dataproc's autoscaling seems to be a superset of the functionality in Spydra's autoscaler. If you're interested, I'd be happy to take a stab at moving Spydra to Dataproc's autoscaler and getting rid of the init action.
The one major difference is that the minimum cooldown period (scaling interval) in Dataproc is 10 minutes, while Spydra's README suggests 2 minutes. Are folks at Spotify using scaling intervals that short?
PoolingSubmitter looks at all clusters in a project, this should be segmented on a user provided value allowing multiple pools to coexist in a single project.
We create Kubernetes pods to run Spydra and it submits a job to Dataproc. Sometimes our pods are removed and we automatically recreate the pod(Spydra) again, and it submits that job again. In the end, there are some duplicate jobs are running in Dataproc. Those jobs may take hours which costs a lot.
I think we can create an optional mechanism to avoid this situation by labeling jobs, and when we create a job, we can check whether there is any job with that label whose status is DONE, if so we should not submit that job and can throw an exception.
Hello.
I just started following up the build process. I setup the authorization, created buckets, and run with maven but it is causing error.
Caused by: org.eclipse.aether.transfer.ArtifactTransferException: Could not transfer artifact com.spotify.data.spydra:spydra-parent:pom:0.3.4-20170726.094703-3 from/to ossrh (https://oss.sonatype.org/content/repositories/snapshots): Failed to transfer file: https://oss.sonatype.org/content/repositories/snapshots/com/spotify/data/spydra/spydra-parent/0.3.4-SNAPSHOT/spydra-parent-0.3.4-20170726.094703-3.pom. Return code is: 401, ReasonPhrase: Unauthorized.
at org.eclipse.aether.connector.basic.ArtifactTransportListener.transferFailed(ArtifactTransportListener.java:43)
at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run(BasicRepositoryConnector.java:355)
at org.eclipse.aether.connector.basic.BasicRepositoryConnector.put(BasicRepositoryConnector.java:274)
at org.eclipse.aether.internal.impl.DefaultDeployer.deploy(DefaultDeployer.java:311)
... 27 more
Caused by: org.apache.maven.wagon.TransferFailedException: Failed to transfer file: https://oss.sonatype.org/content/repositories/snapshots/com/spotify/data/spydra/spydra-parent/0.3.4-SNAPSHOT/spydra-parent-0.3.4-20170726.094703-3.pom. Return code is: 401, ReasonPhrase: Unauthorized.
at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:627)
at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:541)
at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:523)
at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:517)
at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:497)
at org.eclipse.aether.transport.wagon.WagonTransporter$PutTaskRunner.run(WagonTransporter.java:644)
at org.eclipse.aether.transport.wagon.WagonTransporter.execute(WagonTransporter.java:427)
at org.eclipse.aether.transport.wagon.WagonTransporter.put(WagonTransporter.java:410)
at org.eclipse.aether.connector.basic.BasicRepositoryConnector$PutTaskRunner.runTask(BasicRepositoryConnector.java:510)
at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run(BasicRepositoryConnector.java:350)
... 29 more
[ERROR]
[ERROR]
How can I figure out which point has caused error with this log?
Thanks.
Currently Spydra requires a service account key as a json file for authenticating with GCP. It would be convenient to instead rely on the Application Default Credentials strategy implemented in Google's libs and command line clients (https://cloud.google.com/docs/authentication/production).
This would allow users to continue using the service account key file like today. However, when running Spydra itself on GCP VMs, they can instead use the default credentials supplied by the metadata service.
I believe this would be good both for simplicity and security, as one would no longer have to distribute key files.
I wan't to submit pyspark jobs using Spydra. Either the documentation is lacking on this end, or the current implementation is not able to handle this job type (I would say, the latter is the case by skimming over the code which generates the gcloud dataproc jobs submit pyspark
command).
I get the following error message:
ERROR: (gcloud.dataproc.jobs.submit.pyspark) argument PY_FILE is required
Usage: gcloud dataproc jobs submit pyspark PY_FILE --cluster=CLUSTER [optional flags] [-- JOB_ARGS ...]
The problem is, that I'm unable with options
and job_args
to insert the filename into the command line.
The spydra/init-actions/self-destruct.sh script will install the self destruct cron job on both master and the 0th node. However only the master receives the heartbeat updates, so the 0th worker will always kill the cluster when the collector timeout is reached.
When running the Embedded JobHistoryServer it doesn't seem to ever refresh the jobs from GCS. Any jobs that are logged to GCS after it's started never appear until it's stopped and started again.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.