Coder Social home page Coder Social logo

dataproc-java-dependencies's Introduction

This repository contains a simple demo Spark application that translates words using Google's Translation API and running on Cloud Dataproc.

  1. Record the project ID in an environment variable for later use:

    export PROJECT=$(gcloud info --format='value(config.project)')
    
  2. Enable the translate and dataproc APIs:

    gcloud services enable translate.googleapis.com dataproc.googleapis.com
    
  3. Compile the JAR (this may take a few minutes):

  • Option 1: with Maven
    cd maven
    mvn package
    
  • Option 2: with SBT
    cd sbt
    sbt assembly
    mv target/scala-2.11/translate-example-assembly-1.0.jar target/translate-example-1.0.jar
    
  1. Create a bucket:

    gsutil mb gs://$PROJECT-bucket
    
  2. Upload words.txt to the bucket:

    gsutil cp ../words.txt gs://$PROJECT-bucket
    

    The file words.txt contains the following:

    cat
    dog
    fish
    
  3. Create a Cloud Dataproc cluster:

    gcloud dataproc clusters create demo-cluster \
    --zone=us-central1-a \
    --scopes=cloud-platform \
    --image-version=1.3
    
  4. Submit the Spark job to translate the words to French:

    gcloud dataproc jobs submit spark \
    --cluster demo-cluster \
    --jar target/translate-example-1.0.jar \
    -- fr gs://$PROJECT-bucket words.txt translated-fr
    
  5. Verify that the words have been translated:

    gsutil cat gs://$PROJECT-bucket/translated-fr/part-*
    

    The output is:

    chat
    chien
    poisson
    

dataproc-java-dependencies's People

Contributors

jphalip avatar

Watchers

 avatar  avatar  avatar

Forkers

pmkc

dataproc-java-dependencies's Issues

404 error

Strangely getting this error when attempting to read the text file in the bucket:

Exception in thread "main" com.google.api.client.http.HttpResponseException: 404 Not Found
Not Found
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1070)
	at com.google.api.client.googleapis.batch.BatchRequest.execute(BatchRequest.java:241)
	at com.google.cloud.hadoop.gcsio.BatchHelper.flushIfPossible(BatchHelper.java:121)
	at com.google.cloud.hadoop.gcsio.BatchHelper.flush(BatchHelper.java:134)
	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfos(GoogleCloudStorageImpl.java:1639)
	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfosRaw(GoogleCloudStorageFileSystem.java:1303)
	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.listFileInfo(GoogleCloudStorageFileSystem.java:1082)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.listStatus(GoogleHadoopFileSystemBase.java:1356)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$listLeafFiles(InMemoryFileIndex.scala:261)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles$1.apply(InMemoryFileIndex.scala:172)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles$1.apply(InMemoryFileIndex.scala:171)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles(InMemoryFileIndex.scala:171)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:124)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:90)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:66)
	at org.apache.spark.sql.execution.datasources.DataSource.tempFileIndex$lzycompute$1(DataSource.scala:128)
	at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$tempFileIndex$1(DataSource.scala:119)
	at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
	at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:623)
	at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:657)
	at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:632)
	at demo.TranslateExample$.main(TranslateExample.scala:40)
	at demo.TranslateExample.main(TranslateExample.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.