Coder Social home page Coder Social logo

spark-bigquery's Introduction

MAINTENANCE MODE

THIS PROJECT IS IN MAINTENANCE MODE DUE TO THE FACT THAT IT'S NOT WIDELY USED WITHIN SPOTIFY. WE'LL PROVIDE BEST EFFORT SUPPORT FOR ISSUES AND PULL REQUESTS BUT DO EXPECT DELAY IN RESPONSES.

spark-bigquery

Build Status GitHub license Maven Central

Google BigQuery support for Spark, SQL, and DataFrames.

spark-bigquery version Spark version Comment
0.2.x 2.x.y Active development
0.1.x 1.x.y Development halted

To use the package in a Google Cloud Dataproc cluster:

install org.apache.avro_avro-ipc-1.7.7.jar to ~/.ivy2/jars

spark-shell --packages com.spotify:spark-bigquery_2.10:0.2.2

To use it in a local SBT console:

import com.spotify.spark.bigquery._

// Set up GCP credentials
sqlContext.setGcpJsonKeyFile("<JSON_KEY_FILE>")

// Set up BigQuery project and bucket
sqlContext.setBigQueryProjectId("<BILLING_PROJECT>")
sqlContext.setBigQueryGcsBucket("<GCS_BUCKET>")

// Set up BigQuery dataset location, default is US
sqlContext.setBigQueryDatasetLocation("<DATASET_LOCATION>")

Usage:

// Load everything from a table
val table = sqlContext.bigQueryTable("bigquery-public-data:samples.shakespeare")

// Load results from a SQL query
// Only legacy SQL dialect is supported for now
val df = sqlContext.bigQuerySelect(
  "SELECT word, word_count FROM [bigquery-public-data:samples.shakespeare]")

// Save data to a table
df.saveAsBigQueryTable("my-project:my_dataset.my_table")

If you'd like to write nested records to BigQuery, be sure to specify an Avro Namespace. BigQuery is unable to load Avro Namespaces with a leading dot (.nestedColumn) on nested records.

// BigQuery is able to load fields with namespace 'myNamespace.nestedColumn'
df.saveAsBigQueryTable("my-project:my_dataset.my_table", tmpWriteOptions = Map("recordNamespace" -> "myNamespace"))

See also Loading Avro Data from Google Cloud Storage for data type mappings and limitations. For example loading arrays of arrays is not supported.

License

Copyright 2016 Spotify AB.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

spark-bigquery's People

Contributors

draa avatar edwardcapriolo avatar jobegrabber avatar karlhigley avatar lambiase avatar martinstuder avatar mrksmb avatar nevillelyh avatar pcejrowski avatar richwhitjr avatar samelamin avatar yu-iskw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-bigquery's Issues

Read table rows directly for small tables

Right now we read tables via AvroBigQueryInputFormat, which exports the table to Avro files on GCS first.
We can probably fetch rows directly from BQ using the BQ client for small tables.

Use with Databricks

After a quick call with the folks at databricks it seems we cant set environment variables because they are a managed service and you cant edit the environment variables

Is there a way to pass in the credentials? even a path without having the variables themselves set up?

Json parsing failed when i was using the saveAsBigQueryTable

I was trying to load the data to BigQuery using the below sample code
val conf1 = new SparkConf().setAppName("App").setMaster("local[2]")
val sc = new SparkContext(conf1)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.setGcpJsonKeyFile("gcskey.json")

// Set up BigQuery project and bucket
sqlContext.setBigQueryProjectId("proj_name")
sqlContext.setBigQueryGcsBucket("gcsbucket")

// Set up BigQuery dataset location, default is US
sqlContext.setBigQueryDatasetLocation("US")
Usage:

// Load everything from a table
val table = sqlContext.bigQueryTable("bigquery-public-data:samples.shakespeare")

// Load results from a SQL query
// Only legacy SQL dialect is supported for now
val df = sqlContext.bigQuerySelect(
"SELECT word, word_count FROM [bigquery-public-data:samples.shakespeare]")

// Save data to a table
df.saveAsBigQueryTable("my-project:my_dataset.my_table")

While executing the code i got below error when it was trying to execute the last statement

165037 [main] ERROR org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation - Aborting job.
java.io.IOException: Failed to parse JSON: Unexpected token; Parser terminated before end of string

Can somebody help me to resolve this issue?

cleanup from dataproc bucket

Thanks for an excellent library, saved me a ton of work!

It seems that running saveAsBigQueryTable creates many files in the DataProc bucket, seems like these are tmp files from the process, but they do not get cleaned up when cluster is deleted, and no reference to cleanup action, found them by mistake when doing some manual digging in the bucket.

Just a tiny sample (there where many many thousands)
/hadoop/tmp/spark-bigquery/spark-bigquery-1470089007094=105594505/_temporary/0/_temporary/attempt_201608012203_0012_m_000931_0/#1470089014836000...
/hadoop/tmp/spark-bigquery/spark-bigquery-1470142403992=2044817433/part-r-01462-db5b41ac-a0ac-4a52-9e8b-f2b368c97cd6.avro#1470142479116000...

These might be also part of aborted contexts, hard to know. in that case you're off the hook ๐Ÿ˜„ but would appreciate a mention in the docs about this.

Update Schema with jobs

Hey guys

I want to update the schema with each write, but seems like setting the job configuration to ALLOW_FIELD_ADDITION doesnt work

I even tried setting writeDisposition as WRITE_APPEND but as far as I can tell, its set as this by default

Any ideas?
See here

404 Not found exception when creating gcs directories

Hi everyone,

I encountered a weird issue while trying your library. When saving the temp file to gcs, it called the storage api with a weird address: http://google.api.address/null.

I tried debugging through the code to find what was causing the problem and I did not find it, however I solved the issue accidentally.

I wanted to test creating a directory with the security account to see if it was a permission problem, so I added google-cloud-storage in my dependencies because I couldn't import import com.google.cloud.storage.StorageOptions , and this solved the issue...

Is there a way to make this error more explicit? Is it a problem that is global to google and not this library?

Here is the build.sbt to reproduce the error

 "com.typesafe.scala-logging" %% "scala-logging" % "3.7.2",
  "org.apache.spark" %% "spark-core" % "2.2.0",
  "org.apache.spark" %% "spark-sql" % "2.2.0",
  "com.google.cloud" % "google-cloud-bigquery" % "0.32.0-beta",
  "com.google.cloud.bigdataoss" % "gcs-connector" % "1.6.2-hadoop2",
  "com.spotify" % "spark-bigquery_2.11" % "0.2.2",
  "org.apache.parquet" % "parquet-avro" % "1.9.0"

And the code

object Main extends App {
    implicit val spark = SparkSession
      .builder()
      .appName("Name")
      .master("local[*]")
      .config("google.cloud.auth.service.account.json.keyfile", "/path")
      .config("fs.gs.project.id", "project-id")
      .getOrCreate()
    bqSqlContext.bigQuerySelect(s"SELECT * FROM ${tableName} LIMIT 10")
}

And here is the exception:

Exception in thread "main" com.google.api.client.http.HttpResponseException: 404 Not Found
Not Found
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1070)
	at com.google.api.client.googleapis.batch.BatchRequest.execute(BatchRequest.java:241)
	at com.google.cloud.hadoop.gcsio.BatchHelper.flushIfPossible(BatchHelper.java:118)
	at com.google.cloud.hadoop.gcsio.BatchHelper.flush(BatchHelper.java:132)
	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfos(GoogleCloudStorageImpl.java:1493)
	at com.google.cloud.hadoop.gcsio.ForwardingGoogleCloudStorage.getItemInfos(ForwardingGoogleCloudStorage.java:221)
	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfos(GoogleCloudStorageFileSystem.java:1159)
	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.mkdirs(GoogleCloudStorageFileSystem.java:530)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.mkdirs(GoogleHadoopFileSystemBase.java:1382)
	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1819)
	at com.google.cloud.hadoop.io.bigquery.AbstractExportToCloudStorage.prepare(AbstractExportToCloudStorage.java:59)
	at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.getSplits(AbstractBigQueryInputFormat.java:123)
	at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:125)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1327)
	at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1368)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.first(RDD.scala:1367)
	at com.spotify.spark.bigquery.BigQuerySQLContext.bigQueryTable(BigQuerySQLContext.scala:112)
	at com.spotify.spark.bigquery.BigQuerySQLContext.bigQuerySelect(BigQuerySQLContext.scala:93)
	at com.powerspace.bigquery.BigQueryExporter.read(BigQueryExporter.scala:24)

Cheers

Fail to run Zeppelin on Dataproc after including spark-bigquery

I have tried including spark-bigquery in Zeppelin on Google's Dataproc cluster, but it will cause failure of launching spar. This is regardless of which version of spark-bigquery I use and the spark-bigquery is loaded through Zeppelin's dependency management.

java.lang.NullPointerException at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38) at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:33) at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext_2(SparkInterpreter.java:391) at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext(SparkInterpreter.java:380) at org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:146) at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:828) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:483) at org.apache.zeppelin.scheduler.Job.run(Job.java:175) at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)

Clear Cache from Select

is there a way to clear the cache from the select?

I have a job that is calling the select multiple times and i keep getting the old value

I need it because I am writing updates to a table and my source is S3, so i basically need to know when the last time the table was updated and I dont want to store state in the code

Is that possible?

spark-bigquery 0.2.1 with Spark 2.2.0

I'm trying to use spark-bigquery 0.2.1 with a locally running Spark 2.2.0.

When trying to run through the spark-shell via spark-shell --packages com.spotify:spark-bigquery_2.11:0.2.1 I run into the following exception due to incompatible versions of guava:

java.lang.NoSuchMethodError: com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase$ParentTimestampUpdateIncludePredicate.create(GoogleHadoopFileSystemBase.java:572)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.createOptionsBuilderFromConfig(GoogleHadoopFileSystemBase.java:1890)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1587)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:793)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:756)
...

What is the recommended way of working around this?

In another scenario I built a spark package based on spark-bigquery 0.2.1 that attempts to connect Google Bigquery (again from a locally running Spark 2.2.0 cluster). In that case, I always run into the following issue:

Error: java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token
	at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:209)
	at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:72)
	at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.createBigQueryCredential(BigQueryFactory.java:81)
	at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQuery(BigQueryFactory.java:101)
	at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQueryHelper(BigQueryFactory.java:89)
	at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.getBigQueryHelper(AbstractBigQueryInputFormat.java:362)
	at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.getSplits(AbstractBigQueryInputFormat.java:101)
	at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:125)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1327)
	at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1368)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.first(RDD.scala:1367)
	at com.spotify.spark.bigquery.BigQuerySQLContext.bigQueryTable(BigQuerySQLContext.scala:112)
	at com.spotify.spark.bigquery.BigQuerySQLContext.bigQueryTable(BigQuerySQLContext.scala:125)
	at com.mirai.sbbhistory.DefaultSource.createRelation(DefaultSource.scala:14)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at sparklyr.Invoke$.invoke(invoke.scala:102)
	at sparklyr.StreamHandler$.handleMethodCall(stream.scala:97)
	at sparklyr.StreamHandler$.read(stream.scala:62)
	at sparklyr.BackendHandler.channelRead0(handler.scala:52)
	at sparklyr.BackendHandler.channelRead0(handler.scala:14)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: metadata
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
	at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
	at sun.net.www.http.HttpClient.New(HttpClient.java:339)
	at sun.net.www.http.HttpClient.New(HttpClient.java:357)
	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1202)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1138)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1032)
	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:966)
	at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
	at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:160)
	at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
	at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:207)
	... 71 more

I tried several of the following combinations but to no avail:

  • setting all relevant options (setGcpJsonKeyFile, setBigQueryProjectId, setBigQueryGcsBucket, setBigQueryDatasetLocation)
  • setting the environment variable GOOGLE_APPLICATION_CREDENTIALS to a valid JSON key file (before starting the local Spark cluster, in spark-env.sh and in spark-defaults.conf; setting GOOGLE_APPLICATION_CREDENTIALS seems to work with https://github.com/samelamin/spark-bigquery though)
  • setting various other mapred.bq.*, fs.gs.* and google.cloud.* options)

spark 2.0 support?

Is there any reason why you havent released the connector with spark 2.0 support?

I know that you suggested rebuilding from source with the 0.2 Snapshot and I am trying that now

But I figured there is no harm in asking if there is a reason, save me yak shaving ๐Ÿ˜ƒ ๐Ÿ˜†

spark-bigquery jar is useable

I'm trying to use spark-bigquery as a dependency by either:

scalaVersion  := "2.11.8"
val sparkVersion = "2.0.0"

resolvers += "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,
  "spotify" % "spark-bigquery" % "0.1.2-s_2.11"
  )

or

val sparkVersion = "2.0.0"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,
  "com.spotify" %% "spark-bigquery" % "0.1.2"
  )

(Both resolved dependency just fine)

and when I try to use the classes/methods provided by spark-bigquery I get either:

  • class not found:
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.DataFrame
  at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  • method not found:
java.lang.NoSuchMethodError: com.spotify.spark.bigquery.package$BigQuerySQLContext.bigQueryTable(Lcom/google/api/services/bigquery/model/TableReference;)Lorg/apache/spark/sql/Dataset;

which is a manifestation of the same problem. org.apache.spark.sql.DataFrame is not a class, but an alias:

type DataFrame = org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]

Unfortunately the jar provided by maven/spark-packages does not translate the alias, instead uses org.apache.spark.sql.DataFrame in the compiled code, therefor my project gets confused, and can't load org.apache.spark.sql.DataFrame because it does not exist (and should not).

There seems to be a problem in the way spark-packages is building/publishing jars. If I package spark-bigquery locally, and decompile it (for example BigQuerySQLContext), we can see that locally compiled code is in fact using org.apache.spark.sql.Dataset:

โžœ  jar xf spark-bigquery_2.11-0.2.0-SNAPSHOT.jar
โžœ  javap -v com/spotify/spark/bigquery/package\$BigQuerySQLContext.class | grep bigQueryTable | grep NameAndType
   #87 = NameAndType        #85:#86       // bigQueryTable:(Lcom/google/api/services/bigquery/model/TableReference;)Lorg/apache/spark/sql/Dataset;

while jar from maven/spark-packages is still using org.apache.spark.sql.DataFrame in the compiled code:

โžœ  jar -xf spark-bigquery-0.1.1-s_2.11.jar
โžœ  javap -v com/spotify/spark/bigquery/package\$BigQuerySQLContext.class | grep bigQueryTable | grep NameAndType
   #79 = NameAndType        #77:#78       // bigQueryTable:(Lcom/google/api/services/bigquery/model/TableReference;)Lorg/apache/spark/sql/DataFrame;

at this point org.apache.spark.sql.DataFrame is expected to be a class and classloader gets confused -> throws error.

One way to solve it might be to use Dataset explicitly, but honestly this seems like something that should be solved in spark-packages.

timeout ?

Is there a way to set a timeout ?

Which would throw an exception if exceed.

saveAsBigQueryTable() method returns NoSuchMethodError

I am running my spark-shell with Scala on version 2.2.1

spark-shell --packages com.spotify:spark-bigquery_2.10:0.2.0

scala> sqlContext.setGcpJsonKeyFile(file_path)

scala> sqlContext.setBigQueryProjectId("proj")

scala> sqlContext.setBigQueryGcsBucket("dummy_bucket")

scala> sqlContext.setBigQueryDatasetLocation("US")

I am trying to load some data in BigQuery which returns an error as shown below--

scala> val df = Seq((1,1,1), (2,2,2)).toDF("A","B","C")

scala> df.show
+---+---+---+
|  A|  B|  C|
+---+---+---+
|  1|  1|  1|
|  2|  2|  2|
+---+---+---+


scala> df.saveAsBigQueryTable("proj:dataset_name.table_name")
java.lang.NoSuchMethodError: com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase$ParentTimestampUpdateIncludePredicate.create(GoogleHadoopFileSystemBase.java:572)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.createOptionsBuilderFromConfig(GoogleHadoopFileSystemBase.java:1890)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1587)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:793)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:756)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:394)
  at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
  at com.databricks.spark.avro.package$AvroDataFrameWriter$$anonfun$avro$1.apply(package.scala:26)
  at com.databricks.spark.avro.package$AvroDataFrameWriter$$anonfun$avro$1.apply(package.scala:26)
  at com.spotify.spark.bigquery.package$BigQueryDataFrame.saveAsBigQueryTable(package.scala:159)
  at com.spotify.spark.bigquery.package$BigQueryDataFrame.saveAsBigQueryTable(package.scala:171)
  ... 50 elided


Kindly help !!

Implement Spark's DataSource API

This ticket is to discuss whether there is interest in migrating this project to providing a more "native" integration via Spark's DataSource API. From looking at spark-avro this looks doable and I think it would make it easier deal with ticket #2 and #3.

This would also make it possible to provide different ways of loading data, for example via JSON or Avro, using BQ partitions etc based on user preference. We are currently using JSON internally to work around what seems to be an BigQuery Avro bug: https://code.google.com/p/google-bigquery/issues/detail?id=549.

Performance tune df.saveAsBigQueryTable

Hi

I am trying to load a csv zip file from google cloud into BQ, file size is 100 GB but the load is taking lot of time,
is there a way to tune the df.saveAsBigQueryTable command to speed up the loads

 val rowData = input.map(x => Row(x(0), x(1), x(2), x(3).toLong, x(4), x(5), x(6), x(7), x(8), x(9), x(10), x(11), x(12), x(13), x(14), x(15), x(16), x(17)))
      val df = sqlContext.createDataFrame(rowData, schemaTraffic)
      df.saveAsBigQueryTable(bqTrafficTable + partitionDate)

Temporary table creation fails when location contains "-"

I would like to fix the logic to generate datasetId in BigQueryClient # stagingTable.
ย ย ย ย val datasetId = prefix + location.toLowerCase

Is the following modification possible? (The notation is java.)
ย ย String datasetId = prefix + location.toLowerCase (). ReplaceAll ("[^ a-z0-9 _] +", "_")

Explain the situation.
I am considering using connectors to transfer data from BigQuery to the application on dataproc.

The data location we use is "asia-northeast1".
This is a string containing "-".

As a result, it seems that table creation fails when creating a temporary table like the following log.

{"loglevel": "INFO", "time": "2019-06-13 11: 21: 46.591", "appname": "job-executor", "function": "com.spotify.spark.bigquery.BigQueryClient .stagingDataset: 148 "," message ": Creating staging dataset repx-dev-jp-fiot-mgr: spark_bigquery_staging_asia-northeast1}
java.util.concurrent.ExecutionException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
ย ย "code": 400,
ย ย "errors": [{
ย ย ย ย "domain": "global",
ย ย ย ย "message": "Invalid dataset ID " spark_bigquery_staging_asia-northeast1 \ ". Dataset IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long.",
ย ย ย ย "reason": "invalid"
ย ย }],
ย ย "message": "Invalid dataset ID " spark_bigquery_staging_asia-northeast1 \ ". Dataset IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long.",
ย ย "status": "INVALID_ARGUMENT"
}
at com.google.common.util.concurrent.AbstractFuture.getDoneValue (AbstractFuture.java:500)
at com.google.common.util.concurrent.AbstractFuture.get (AbstractFuture.java:459)
at com.google.common.util.concurrent.AbstractFuture $ TrustedFuture.get (AbstractFuture.java:76)
at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly (Uninterruptibles.java:142)
at com.google.common.cache.LocalCache $ Segment.getAndRecordStats (LocalCache.java:2373)
at com.google.common.cache.LocalCache $ Segment.loadSync (LocalCache.java:2337)
at com.google.common.cache.LocalCache $ Segment.lockedGetOrLoad (LocalCache.java:2295)
at com.google.common.cache.LocalCache $ Segment.get (LocalCache.java:2208)
at com.google.common.cache.LocalCache.get (LocalCache.java:4053)
at com.google.common.cache.LocalCache.getOrLoad (LocalCache.java:4057)
at com.google.common.cache.LocalCache $ LocalLoadingCache.get (LocalCache.java:4986)
at com.spotify.spark.bigquery.BigQueryClient.query (BigQueryClient.scala: 105)
at com.spotify.spark.bigquery.BigQuerySQLContext.bigQuerySelect (BigQuerySQLContext.scala: 93)

Query DATE_RANGE or table wildcards

I want to use the BigQuery connector to load data from multiple tables at a time. I tried to use DATE_RANGE functions and table wildcards, but both doesn't work and I get the following error: Invalid datasetAndTableString '<dataset-id>.<some-table-name-with-wildcard>'; must match regex '[a-zA-Z0-9_]+\.[a-zA-Z0-9_]+'.

I want to load the raw data, so I can't use sqlContext.bigQuerySelect() and provide SQL statement, because then I would have to unnest the nested data.

tl;dr: Is it possible to load raw data from multiple tables at one time with something like DATE_RANGE or wildcard functions which are provided by BigQuery itself?

Partition table by date

Hi guys

Really awesome support on this connector, I appreciate it

I was wondering if there is a flag to enable partitioning by date. I see google analytics use that

Since Big Query charges per query you would ideally want your dataset small to keep your costs low

It would be very useful if we can partition tables by dates. Google docs here has more details

Thoughts?

Incorrect BigQuery schema for decimal, timestamp and date schema on save

Currently, BigQuery schema seems to be inferred from the loaded Avro schema on write. Due to BigQuery spark-avro implementation lacking support for Avro spec logical types annotations, DecimalType gets loaded into Bigquery as StringTimestampType is loaded as Integer and DateType is not recognized at all.

Should BigQuery schema be explicitly specified rather than using the mismatched Avro type?

Make use of parquet format instead of avro

Hi @nevillelyh

We use avro format to save a dataframe to GCS before loading the avro files to bigquery. One of the biggest advantages of avro format is that bigquery can read the schema from the avro metadata.
However, avro format doesn't support timestamp type. So we need a twists to store a dataframe which includes timestamp columns.

I guess parquet support timestamp type. Moreover, if bigquery can load parquet files on GCS without explicit schema, it would be better to use parquet format. What do you think?

https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet

Test if this works with DataProc & bdutil

Spark clusters created with DataProc or bdutil should have GCS connector and GCP credentials already setup. A user should be able to just load the package and read/write BQ tables.

Suggested replacements to this project?

The README suggests that this project is no longer used widely within Spotify. Are there suggested replacements for transferring data from Bigquery into spark for big processing jobs? Or is this workflow in general outdated? So far this is the best piece of OSS I've found for the purpose.

Does `spark-bigquery` work in Spark 2.0?

Thank you for the great package.

When I just tried to execute saveAsBigQueryTable in Spark Shell and Apache Zeppelin, I got the below error. I guess BigQueryDataFrame doesn't work well. Does the package work in Spark 2.0?

Otherwise, what do I need to do anything like configuration about the spark-bigquery package?

> df.saveAsBigQueryTable(destinationName, WriteDisposition.WRITE_TRUNCATE)
com.spotify.spark.bigquery.package$.BigQueryDataFrame(Lorg/apache/spark/sql/Dataset;)Lcom/spotify/spark/bigquery/package$BigQueryDataFrame;
  ... 56 elided

Unable to read table .. can write .. Dataproc scala

I am new to this architecture and have read many articles on various dependencies and such not working. Can someone point out where I might have gone wrong?
Lots of trial and error in this build.sbt
Spark 2.2.0
Scala 2.11.8

build.sbt:

version := "2.0.2"
scalaVersion := "2.11.8"
artifactName := { (sv: ScalaVersion, module: ModuleID, artifact: Artifact) =>
  artifact.name + "." + artifact.extension
}

resolvers += "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
resolvers += "jitpack" at "https://jitpack.io"

// https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.11
//TRYlibraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.2.0" % "provided"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.0.0"

// https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11/2.2.0
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.0.0"


// https://spark-packages.org/package/spotify/spark-bigquery
libraryDependencies += "com.spotify" % "spark-bigquery_2.11" % "0.2.1"

// https://mvnrepository.com/artifact/com.databricks/spark-avro_2.11
// https://github.com/databricks/spark-avro
//V1 //libraryDependencies += "com.databricks" % "spark-avro_2.11" % "3.0.0"
//V2 libraryDependencies += "com.github.databricks" % "spark-avro" % "204864b6cf"
libraryDependencies += "com.databricks" %% "spark-avro" % "3.0.0" 

The Error:

        at com.spotify.spark.bigquery.BigQuerySQLContext.bigQueryTable(BigQuerySQLContext.scala:116)
        at com.spotify.spark.bigquery.BigQuerySQLContext.bigQuerySelect(BigQuerySQLContext.scala:93)
        at Query$.main(myquery.scala:19)
        at Query.main(myquery.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/09/09 03:02:55 INFO org.spark_project.jetty.server.ServerConnector: Stopped ServerConnector@30c31dd7{HTTP/1.1}{0.0.0.0:4040}
ERROR: (gcloud.dataproc.jobs.submit.spark) Job [4947fcc7-1bb4-4db0-9d33-9b9d67d88db8] entered state [ERROR] while waiting for [DONE].

Also using the init code when Dataproc builds the cluster to replace the avro files.

rm -rf /usr/lib/h{adoop,ive}*/{,lib/}*avro*.jar
# Consider staging these jars in GCS to avoid being throttled & be nice to Maven Central.
gsutil cp gs://prod-bigdata/jar/avro-1.7.7.jar /usr/lib/hadoop/lib
gsutil cp gs://prod-bigdata/jar/avro-mapred-1.7.7-hadoop2.jar /usr/lib/hadoop-mapreduce

Sample script attempting to run.. [some cutting and pasting here]
I have tried both the direct Table call and the bigQuerySelect call. The Save DOES work..

import com.databricks.spark.avro._

object Query {
   def main(args: Array[String]) {
       val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
       import com.spotify.spark.bigquery._
       val table2 = spark.sqlContext.bigQuerySelect("SELECT * FROM [prod:data_analytics_poc.REGIONS]").limit(100)
       table2.show(20)
}

How does spotify transfer data from RDB to bigquery now?

@nevillelyh

I'm just curious. How does spotify transfer data from RDB like MySQL to bigquery now? I know spark-biguqery is maitenance mode. That is, you might get another better way for that.

We centralize every data on bigquery. As well as we are still transfering data from MySQL to bigquery with scheduled jobs. As you know, transfering a huge table can be very tough without using distributed processing framework, such as spark. If you have any other better way to transfer data, would you please tell me that.

I really appriciate if you could answer my question, as far as you can tell me on github.

Failing to save dataframe to bigquery

Hi folks

I am launching a new cluster using Google's dataproc and although using bigQuerySelect works and return results when I try to save those results to a big query table it fails with the below stack trace

Also as a side note, I am packaging my spark application as an "uber" or fat jar with this connector added as a dependency

for some reason when I launch a job with "--packages com.spotify:spark-bigquery_2.10:0.1.0" as an argument, it cant find the dependency when it tries to import the package in the code base

should I pass the --packages somewhere else?

my code:

    val df = sqlContext.bigQuerySelect( "SELECT *  FROM [bigquery-public-data:hacker_news.comments] LIMIT 1000")
    df.limit(10).show()

    df.limit(10).saveAsBigQueryTabele("projectid:tablename")
16/08/10 13:00:35 INFO com.databricks.spark.avro.AvroRelation: using deflate: -1 for Avro output

[Stage 5:>                                                          (0 + 1) / 2]
[Stage 5:=============================>                             (1 + 1) / 2]16/08/10 13:00:39 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 6.0 (TID 71, cluster-1-w-0.c.justeat-datalake.internal): java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
    at org.apache.avro.mapreduce.AvroKeyRecordWriter.<init>(AvroKeyRecordWriter.java:55)
    at org.apache.avro.mapreduce.AvroKeyOutputFormat$RecordWriterFactory.create(AvroKeyOutputFormat.java:79)
    at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
    at com.databricks.spark.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:82)
    at com.databricks.spark.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:31)
    at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:129)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:255)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

16/08/10 13:00:40 ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 1 on cluster-1-w-0.c.justeat-datalake.internal: Container marked as failed: container_1470833273562_0003_01_000002 on host: cluster-1-w-0.c.justeat-datalake.internal. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_1470833273562_0003_01_000002
Exit code: 50
Stack trace: ExitCodeException exitCode=50: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
    at org.apache.hadoop.util.Shell.run(Shell.java:456)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 50

IoException with bigQuerySelect

For some reason I get an IOException when I use bigQuerySelect(). However starting with bigQueryTable() and doing the equivalent select works fine. I tried multiple tables.

Using 0.2.2-s_2.11


17/12/20 21:15:05 ERROR ApplicationMaster: User class threw exception: java.util.concurrent.ExecutionException: java.io.IOException: Encountered " "-" "- "" at line 1, column 20.
Was expecting:
    <EOF> 
    
java.util.concurrent.ExecutionException: java.io.IOException: Encountered " "-" "- "" at line 1, column 20.
Was expecting:
    <EOF> 
    
	at shaded_guavaz.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500)
	at shaded_guavaz.util.concurrent.AbstractFuture.get(AbstractFuture.java:459)
	at shaded_guavaz.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
	at shaded_guavaz.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:142)
	at shaded_guavaz.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2373)
	at shaded_guavaz.cache.LocalCache$Segment.loadSync(LocalCache.java:2337)
	at shaded_guavaz.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2295)
	at shaded_guavaz.cache.LocalCache$Segment.get(LocalCache.java:2208)
	at shaded_guavaz.cache.LocalCache.get(LocalCache.java:4053)
	at shaded_guavaz.cache.LocalCache.getOrLoad(LocalCache.java:4057)
	at shaded_guavaz.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4986)
	at com.spotify.spark.bigquery.BigQueryClient.query(BigQueryClient.scala:105)
	at com.spotify.spark.bigquery.BigQuerySQLContext.bigQuerySelect(BigQuerySQLContext.scala:93)
	at com.zulily.utils.data.BqTest$.foo(BqTest.scala:186)
	at com.zulily.utils.data.BqTest$.main(BqTest.scala:46)
	at com.zulily.utils.data.BqTest.main(BqTest.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
Caused by: java.io.IOException: Encountered " "-" "- "" at line 1, column 20.
Was expecting:
    <EOF> 
    
	at com.google.cloud.hadoop.io.bigquery.BigQueryUtils.waitForJobCompletion(BigQueryUtils.java:95)
	at com.spotify.spark.bigquery.BigQueryClient.com$spotify$spark$bigquery$BigQueryClient$$waitForJob(BigQueryClient.scala:134)
	at com.spotify.spark.bigquery.BigQueryClient$$anon$1.load(BigQueryClient.scala:90)
	at com.spotify.spark.bigquery.BigQueryClient$$anon$1.load(BigQueryClient.scala:79)
	at shaded_guavaz.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3628)
	at shaded_guavaz.cache.LocalCache$Segment.loadSync(LocalCache.java:2336)
	... 15 more

Exception when saving dataframe back to BigQuery

Hi folks.

I'm trying to save a DF on BigQuery but I'm getting this error. My current environment is Spark 2.2.0 running on a Google DataProc cluster.

It seems that an abstract method is getting called.

Any help will be appreciated. Thanks in advance!

scala> df.saveAsBigQueryTable("DATA_SET.NEW_TABLE")
[Stage 6:>                                                          (0 + 0) / 2]17/12/02 20:29:55 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 6.0 (TID 13, dotz-dataproc-m.c.dotzcloud-datalabs-dev.internal, executor 1): org.apache.spark.SparkException: Task failed while writing rows
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:299)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:314)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
        ... 8 more

[Stage 6:>                                                          (0 + 2) / 2]17/12/02 20:29:56 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 6.0 failed 4 times; aborting job
17/12/02 20:29:56 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 20, dotz-dataproc-m.c.dotzcloud-datalabs-dev.internal, executor 1): org.apache.spark.SparkException: Task failed while writing rows
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:299)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:314)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
        ... 8 more

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:188)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
        at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:438)
        at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:474)
        at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
        at com.databricks.spark.avro.package$AvroDataFrameWriter$$anonfun$avro$1.apply(package.scala:26)
        at com.databricks.spark.avro.package$AvroDataFrameWriter$$anonfun$avro$1.apply(package.scala:26)
        at com.spotify.spark.bigquery.package$BigQueryDataFrame.saveAsBigQueryTable(package.scala:159)
        at com.spotify.spark.bigquery.package$BigQueryDataFrame.saveAsBigQueryTable(package.scala:171)
        at $line40.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:31)
        at $line40.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:36)
        at $line40.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:38)
        at $line40.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:40)
        at $line40.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:42)
        at $line40.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:44)
        at $line40.$read$$iw$$iw$$iw$$iw.<init>(<console>:46)
        at $line40.$read$$iw$$iw$$iw.<init>(<console>:48)
        at $line40.$read$$iw$$iw.<init>(<console>:50)
        at $line40.$read$$iw.<init>(<console>:52)
        at $line40.$read.<init>(<console>:54)
        at $line40.$read$.<init>(<console>:58)
        at $line40.$read$.<clinit>(<console>)
        at $line40.$eval$.$print$lzycompute(<console>:7)
        at $line40.$eval$.$print(<console>:6)
        at $line40.$eval.$print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
        at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
        at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
        at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
        at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
        at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
        at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
        at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
        at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
        at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
        at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:415)
        at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:923)
        at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
        at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
        at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
        at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
        at org.apache.spark.repl.Main$.doMain(Main.scala:70)
        at org.apache.spark.repl.Main$.main(Main.scala:53)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Task failed while writing rows
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:299)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:314)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
        ... 8 more
17/12/02 20:29:56 WARN org.apache.spark.ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
org.apache.spark.SparkException: Job aborted.
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:215)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:438)
  at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:474)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
  at com.databricks.spark.avro.package$AvroDataFrameWriter$$anonfun$avro$1.apply(package.scala:26)
  at com.databricks.spark.avro.package$AvroDataFrameWriter$$anonfun$avro$1.apply(package.scala:26)
  at com.spotify.spark.bigquery.package$BigQueryDataFrame.saveAsBigQueryTable(package.scala:159)
  at com.spotify.spark.bigquery.package$BigQueryDataFrame.saveAsBigQueryTable(package.scala:171)
  ... 50 elided
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 20, dotz-dataproc-m.c.dotzcloud-datalabs-dev.internal, executor 1): org.apache.spark.SparkException: Task failed while writing rows
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:299)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:314)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
        ... 8 more

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:188)
  ... 87 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:108)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:299)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:314)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
  at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
  ... 8 more

The Apache Avro library failed to parse the header

Spark version: 2.2.0
Spotify/spark-bigquery version: 0.2.2

Hi,

I am trying to use the saveAsBigQuery table function to write a schema that has an array of struct as a field. However, I am getting the following error:

The Apache Avro library failed to parse the header with the follwing error: Invalid namespace: .topic_scores

The offending field is:


{
            "type": [
                {
                    "items": [
                        {
                            "namespace": ".topic_scores",
                            "type": "record",
                            "name": "topic_scores",
                            "fields": [
                                {
                                    "type": "int",
                                    "name": "index"
                                },
                                {
                                    "type": "float",
                                    "name": "score"
                                }
                            ]
                        },
                        "null"
                    ],
                    "type": "array"
                },
                "null"
            ],
            "name": "topic_scores"
        }

You can see that the namespace field begins with a dot. My guess is that the issue stems from https://github.com/spotify/spark-bigquery/blob/master/src/main/scala/com/databricks/spark/avro/SchemaConverters.scala#L342-L346

I can't find a way to configure the recordNamespace value. According to avro documentation:

You can specify the record name and namespace like this:

import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().master("local").getOrCreate()
val df = spark.read.avro("src/test/resources/episodes.avro")

val name = "AvroTest"
val namespace = "com.databricks.spark.avro"
val parameters = Map("recordName" -> name, "recordNamespace" -> namespace)

df.write.options(parameters).avro("/tmp/output")

I think this is the line that reads that option, and sets the value to an empty string if not provided: https://github.com/databricks/spark-avro/blob/branch-4.0/src/main/scala/com/databricks/spark/avro/DefaultSource.scala#L114

These options are not parameterized anywhere in the Spotify library. Has anyone seen this issue or have a workaround? Thanks!

Support configurable query job priority

Query jobs are currently submitted with a priority that is dependent on whether the job is submitted from an interactive context or not. If a job is submitted from the Scala REPL, it is submitted as an interactive BigQuery job, otherwise as a BigQuery batch job. See

Rather than pre-determining the query job priority, it would be nice if it could be configured from the outside - possibly keeping the current logic as a default.

Sometime loading avro files to a BigQurey table fails

Sometime loading avro files to a BigQuery table fails, since a _temporary directory on GCS doesn't exist.

16/12/02 23:46:47 INFO com.spotify.spark.bigquery.BigQueryClient: Loading gs://spark-helper-us-region/hadoop/tmp/spark-bigquery/spark-bigquery-1480722354001=863039906 into sage-shard-740:analytics_us.activities_20160903
Exception in thread "main" java.io.IOException: Not found: Uri gs://spark-helper-us-region/hadoop/tmp/spark-bigquery/spark-bigquery-1480722354001=863039906/_temporary/0/_temporary/attempt_201612022346_0113_m_000065_0/part-r-00065-0f64344d-0f7e-4677-a28b-56e79a287e41.avro
	at com.google.cloud.hadoop.io.bigquery.BigQueryUtils.waitForJobCompletion(BigQueryUtils.java:95)
	at com.spotify.spark.bigquery.BigQueryClient.com$spotify$spark$bigquery$BigQueryClient$$waitForJob(BigQueryClient.scala:134)
	at com.spotify.spark.bigquery.BigQueryClient.load(BigQueryClient.scala:130)
	at com.spotify.spark.bigquery.package$BigQueryDataFrame.saveAsBigQueryTable(package.scala:150)
	at com.spotify.spark.bigquery.package$BigQueryDataFrame.saveAsBigQueryTable(package.scala:159)
	at com.mercari.spark.sql.SparkBigQueryHelper$.saveBigQueryTableByDataFrame(SparkBigQueryHelper.scala:229)
	at com.mercari.spark.sql.SparkBigQueryHelper.saveBigQueryTableByDataFrame(SparkBigQueryHelper.scala:66)
	at com.mercari.spark.batch.ActivitiesTableCreator$.apply(ActivitiesTableCreator.scala:226)
	at com.mercari.spark.batch.ActivitiesTableCreator$.main(ActivitiesTableCreator.scala:210)
	at com.mercari.spark.batch.ActivitiesTableCreator.main(ActivitiesTableCreator.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Google cloud dataproc not recognizing 'sqlContext' command

I am a newbie in Spark & Scala. I ran this package in local SBT and everything worked fine.
But now I am trying to do the connection in Google cloud dataproc and executed the command:

spark-shell --packages com.spotify:spark-bigquery_2.10:0.2.0

But after that when I am trying to run the following command at scala prompt:
val table = sqlContext.bigQueryTable("bigquery-public-data:samples.shakespeare")
It can't recognize 'sqlContext' command.
Am I missing something?

Is the environment variable required?

Locally sqlContext.setGcpJsonKeyFile("<JSON_KEY_FILE>") works but on a spark-yarn-2.0.1 I'm getting Caused by: java.io.IOException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information..

Is the env variable required for use in the cluster or is it not finding the file and using some kind of fallback mechanism?

Unable to call the big query implicit methods in the spark shell

I'm testing this package in the spark shell using this settings:

Apache spark 1.6.1
Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)

I am able to import the main package import com.spotify.spark.bigquery._, but afterwards I'm unable to set the spark context settings. For example:

sc.setGcpJsonKeyFile("test")

I get:

error: value setGcpJsonKeyFile is not a member of org.apache.spark.SparkContext

I also tried to import sqlContext.implicits._, but this didn't help. I get the same erro when I'm trying to call the same method on the sqlContext.

Not sure, why I'm unable to use the package.

Thanks
Bogdan

Not able to load more then 1 million records in BigQuery from Dataproc

I am trying to load file which has more then 30 million records but I get error saying ArrayIndexOutofBound. Please suggest me solution ASAP.

    at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:302)
    at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:431)
    at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:148)
    at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:131)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
    at org.apache.spark.scheduler.Task.run(Task.scala:86)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.ArrayIndexOutOfBoundsException

spark-bigquery support in pyspark?

I'm trying to use the spark-biguery connector using the pyspark shell using pyspark --package com.spotify:spark-bigquery_2.10:0.1.0 . When I'm trying to import the main package import com.spotify.spark.bigquery I get an ImportError Is there no way to use this connector in python?

The Application Default Credentials are not available.

Hi Folks

I am trying to use your connector but keep getting this error

The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.

All I am trying to do is set the path of the GCPJsonKey File

import com.spotify.spark.bigquery._

// Set up GCP credentials
sqlContext.setGcpJsonKeyFile("/path/to/jsonfile")

Am I missing something completely obvious?

Error: java.io.IOException: Too many tables and views for query: Max: 1000

Hi all,

I have gotten the error below since 4, May 2017.
It seems that inserting data on Spark to a BigQuery temporary table was failed.
I guess the error was caused by some changes on BigQuery.

I will report the error to the BigQuery issues site later as well.

My environment is under:

  • Spark: 2.0.2,
  • spark-bigquery: 0.2.0,
  • Google Dataproc: 1.1

Best,

Error Message

17/05/04 13:00:30 INFO com.spotify.spark.bigquery.BigQueryClient: Destination table: {datasetId=XXXXXXXXXXX, projectId=XXXXXXXX, tableId=spark_bigquery_20170504130030_1067308100}
Exception in thread "main" java.util.concurrent.ExecutionException: java.io.IOException: Too many tables and views for query: Max: 1000
	at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:289)
	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111)
	at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:132)
	at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2381)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2351)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
	at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
	at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
	at com.spotify.spark.bigquery.BigQueryClient.query(BigQueryClient.scala:105)
	at com.spotify.spark.bigquery.package$BigQuerySQLContext.bigQuerySelect(package.scala:93)
	at com.mercari.spark.sql.SparkBigQueryHelper.selectBigQueryTable(SparkBigQueryHelper.scala:110)
	at com.mercari.spark.batch.UserProfilesTableCreator.fetchUUID(UserProfilesTableCreator.scala:148)
	at com.mercari.spark.batch.UserProfilesTableCreator.fetch(UserProfilesTableCreator.scala:45)
	at com.mercari.spark.batch.UserProfilesTableCreator.run(UserProfilesTableCreator.scala:24)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:28)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:34)
	at com.mercari.spark.batch.AbstractBatch.runWithRetry(AbstractBatch.scala:25)
	at com.mercari.spark.batch.UserProfilesTableCreator$.main(UserProfilesTableCreator.scala:239)
	at com.mercari.spark.batch.UserProfilesTableCreator.main(UserProfilesTableCreator.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Too many tables and views for query: Max: 1000
	at com.google.cloud.hadoop.io.bigquery.BigQueryUtils.waitForJobCompletion(BigQueryUtils.java:95)
	at com.spotify.spark.bigquery.BigQueryClient.com$spotify$spark$bigquery$BigQueryClient$$waitForJob(BigQueryClient.scala:134)
	at com.spotify.spark.bigquery.BigQueryClient$$anon$1.load(BigQueryClient.scala:90)
	at com.spotify.spark.bigquery.BigQueryClient$$anon$1.load(BigQueryClient.scala:79)
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
	... 44 more
17/05/04 13:00:31 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
17/05/04 13:00:31 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.

Unable to get a sqlContext

I don't have a sqlContext when I fire up the spark shell as instructed. I also encounter the following error, is this a permissions/firewall issue?

16/10/24 14:53:31 WARN org.apache.hadoop.hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1245)
        at java.lang.Thread.join(Thread.java:1319)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:609)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:370)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:546)

(I get 5 of these on startup)

Thanks,
Mike S

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.