Coder Social home page Coder Social logo

nvidia / spark-rapids-examples Goto Github PK

View Code? Open in Web Editor NEW
116.0 17.0 48.0 9.92 MB

A repo for all spark examples using Rapids Accelerator including ETL, ML/DL, etc.

License: Apache License 2.0

Dockerfile 0.36% Shell 0.38% Python 1.58% Jupyter Notebook 97.68%

spark-rapids-examples's Introduction

spark-rapids-examples

This is the RAPIDS Accelerator for Apache Spark examples repo. RAPIDS Accelerator for Apache Spark accelerates Spark applications with no code changes. You can download the latest version of RAPIDS Accelerator here. This repo contains examples and applications that showcases the performance and benefits of using RAPIDS Accelerator in data processing and machine learning pipelines. There are broadly four categories of examples in this repo:

  1. SQL/Dataframe
  2. Spark XGBoost
  3. Deep Learning/Machine Learning
  4. RAPIDS UDF
  5. Databricks Tools demo notebooks

For more information on each of the examples please look into respective categories.

Here is the list of notebooks in this repo:

Category Notebook Name Description
1 SQL/DF Microbenchmark Spark SQL operations such as expand, hash aggregate, windowing, and cross joins with up to 20x performance benefits
2 SQL/DF Customer Churn Data federation for modeling customer Churn with a sample telco customer data
3 XGBoost Agaricus (Scala) Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the agaricus dataset
4 XGBoost Mortgage (Scala) End-to-end ETL + XGBoost example to predict mortgage default with Fannie Mae Single-Family Loan Performance Data
5 XGBoost Taxi (Scala) End-to-end ETL + XGBoost example to predict taxi trip fare amount with NYC taxi trips data set
6 ML/DL Criteo Training ETL and deep learning training of the Criteo 1TB Click Logs dataset
7 ML/DL PCA End-to-End Spark MLlib based PCA example to train and transform with a synthetic dataset
8 UDF cuSpatial - Point in Polygon Spark cuSpatial example for Point in Polygon function using NYC Taxi pickup location dataset

Here is the list of Apache Spark applications (Scala and PySpark) that can be built for running on GPU with RAPIDS Accelerator in this repo:

Category Notebook Name Description
1 XGBoost Agaricus (Scala) Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the agaricus dataset
2 XGBoost Mortgage (Scala) End-to-end ETL + XGBoost example to predict mortgage default with Fannie Mae Single-Family Loan Performance Data
3 XGBoost Taxi (Scala) End-to-end ETL + XGBoost example to predict taxi trip fare amount with NYC taxi trips data set
4 ML/DL PCA End-to-End Spark MLlib based PCA example to train and transform with a synthetic dataset
5 UDF cuSpatial - Point in Polygon Spark cuSpatial example for Point in Polygon function using NYC Taxi pickup location dataset
6 UDF URL Decode Decodes URL-encoded strings using the Java APIs of RAPIDS cudf
7 UDF URL Encode URL-encodes strings using the Java APIs of RAPIDS cudf
8 UDF CosineSimilarity Computes the cosine similarity between two float vectors using native code
9 UDF StringWordCount Implements a Hive simple UDF using native code to count words in strings

spark-rapids-examples's People

Contributors

eordentlich avatar firestarman avatar garyshen2008 avatar gerashegalov avatar jlowe avatar leewyang avatar mattahrens avatar nvauto avatar nvliyuan avatar nvtimliu avatar parthosa avatar pxli avatar res-life avatar rongou avatar sauravdev avatar surajaralihalli avatar tgravescs avatar wbo4958 avatar wjxiz1992 avatar yanxuanliu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-rapids-examples's Issues

Xgboost training fails if input dataframe has vector type

If the input dataframe has vector type inside, xgboost training fails with below error:

22/02/16 14:36:29 ERROR GpuXGBoostSpark: The job was aborted due to
java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuUtils$.toColumnarRdd(GpuUtils.scala:49)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.trainOnGpuInternal(GpuXGBoost.scala:240)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.trainDistributedOnGpu(GpuXGBoost.scala:186)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.trainOnGpu(GpuXGBoost.scala:91)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.fitOnGpu(GpuXGBoost.scala:52)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.fit(XGBoostClassifier.scala:170)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:51)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:56)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:58)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:60)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:62)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:64)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:66)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:68)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:70)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:72)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:74)
	at $line43.$read$$iw$$iw$$iw$$iw.<init>(<console>:76)
	at $line43.$read$$iw$$iw$$iw.<init>(<console>:78)
	at $line43.$read$$iw$$iw.<init>(<console>:80)
	at $line43.$read$$iw.<init>(<console>:82)
	at $line43.$read.<init>(<console>:84)
	at $line43.$read$.<init>(<console>:88)
	at $line43.$read$.<clinit>(<console>)
	at $line43.$eval$.$print$lzycompute(<console>:7)
	at $line43.$eval$.$print(<console>:6)
	at $line43.$eval.$print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:745)
	at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1021)
	at scala.tools.nsc.interpreter.IMain.$anonfun$interpret$1(IMain.scala:574)
	at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:41)
	at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:37)
	at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:41)
	at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:600)
	at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.nvidia.spark.rapids.ColumnarRdd$.convert(ColumnarRdd.scala:52)
	at com.nvidia.spark.rapids.ColumnarRdd.convert(ColumnarRdd.scala)
	... 53 more
Caused by: java.lang.IllegalArgumentException: Cannot convert [label: float, feature: float ... 1 more field] to GPU columnar Set(org.apache.spark.mllib.linalg.VectorUDT@f71b0bce) are not currently supported data types for columnar.
	at org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter$.extractRDDColumnarBatch(InternalColumnarRddConverter.scala:665)
	at org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter$.convert(InternalColumnarRddConverter.scala:718)
	at org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter.convert(InternalColumnarRddConverter.scala)
	... 59 more

Below is a minimum reproduce notebook in scala

import org.apache.spark.sql.SparkSession
sc.stop()

// Build the spark session and data reader as usual
val spark = SparkSession.builder.appName("xgboost_vector_test").getOrCreate
val reader = spark.read

import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassifier, XGBoostClassificationModel}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.sql.types.{FloatType, IntegerType, StructField, StructType}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val trainPath = "/home/xxx/data/xgboost_vector_test"

// with Vector
val rows = spark.sparkContext.parallelize(
  List(
    Row(0.0, 1.2, org.apache.spark.mllib.linalg.Vectors.dense(1.0, 2.0))
  )
)

val schema = List(
  StructField("label", DoubleType, true),
  StructField("feature", DoubleType, true),
  StructField("a_vector", new org.apache.spark.mllib.linalg.VectorUDT, true)
)

val df = spark.createDataFrame(
  rows,
  StructType(schema)
)

df.show()
df.printSchema
df.write.format("parquet").mode("overwrite").save(trainPath)

val trainSet = reader.parquet(trainPath)
trainSet.printSchema

val labelColName = "label"
val featureNames = Array("feature")

val commParamMap = Map(
  "eta" -> 0.1,
  "gamma" -> 0.1,
  "missing" -> 0.0,
  "max_depth" -> 10,
  "max_leaves" -> 256,
  "objective" -> "binary:logistic",
  "grow_policy" -> "depthwise",
  "min_child_weight" -> 30,
  "lambda" -> 1,
  "scale_pos_weight" -> 2,
  "subsample" -> 1,
  "nthread" -> 1,
  "num_round" -> 100)

val xgbParamFinal = commParamMap ++ Map("tree_method" -> "gpu_hist", "num_workers" -> 1)

val xgbClassifier = new XGBoostClassifier(xgbParamFinal)
      .setLabelCol(labelColName)
      // === diff ===
      .setFeaturesCols(featureNames)

xgbClassifier.fit(trainSet)

Test env:
Standalone Spark cluster
Spark 3.1.1
22.02 snapshot rapids-spark and cudf jars
xgboost4j_3.0-1.4.2-0.2.0.jar
xgboost4j-spark_3.0-1.4.2-0.2.0.jar

This is a customer blocker issue.

CrossValidation fails with "Check failed: n_uniques == world (1 vs. 2) : Multiple processes within communication group running on same CUDA device is not supported"

CrossValidation fails with "Check failed: n_uniques == world (1 vs. 2) : Multiple processes within communication group running on same CUDA device is not supported"

Env:
Databricks 9.1ML GPU
2-nodes cluster
22.02GA jars
xgboost4j_3.0-1.4.2-0.2.0.jar
xgboost4j-spark_3.0-1.4.2-0.2.0.jar

Sample code:

import time
import os
from pyspark import broadcast
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

from ml.dmlc.xgboost4j.scala.spark.rapids import CrossValidator
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

from ml.dmlc.xgboost4j.scala.spark import XGBoostClassificationModel, XGBoostClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
reader = spark.read

trainPath = "/xxx/mortgage_train/"

label = "delinquency_12"
schema = StructType([
    StructField("orig_channel", FloatType()),
    StructField("first_home_buyer", FloatType()),
    StructField("loan_purpose", FloatType()),
    StructField("property_type", FloatType()),
    StructField("occupancy_status", FloatType()),
    StructField("property_state", FloatType()),
    StructField("product_type", FloatType()),
    StructField("relocation_mortgage_indicator", FloatType()),
    StructField("seller_name", FloatType()),
    StructField("mod_flag", FloatType()),
    StructField("orig_interest_rate", FloatType()),
    StructField("orig_upb", IntegerType()),
    StructField("orig_loan_term", IntegerType()),
    StructField("orig_ltv", FloatType()),
    StructField("orig_cltv", FloatType()),
    StructField("num_borrowers", FloatType()),
    StructField("dti", FloatType()),
    StructField("borrower_credit_score", FloatType()),
    StructField("num_units", IntegerType()),
    StructField("zip", IntegerType()),
    StructField("mortgage_insurance_percent", FloatType()),
    StructField("current_loan_delinquency_status", IntegerType()),
    StructField("current_actual_upb", FloatType()),
    StructField("interest_rate", FloatType()),
    StructField("loan_age", FloatType()),
    StructField("msa", FloatType()),
    StructField("non_interest_bearing_upb", FloatType()),
    StructField(label, IntegerType()),
])
features = [ x.name for x in schema if x.name != label ]

# load dataset from file
train_data = reader.parquet(trainPath)

classifier = XGBoostClassifier(**params).setLabelCol(label).setFeaturesCols(features)
evaluator = BinaryClassificationEvaluator(labelCol=label, metricName='areaUnderROC')
param_grid = (ParamGridBuilder()
    .addGrid(classifier.maxDepth, [x, x])   ,
    .addGrid(classifier.numRound, [x, x])   , 
    .addGrid(classifier.eta, [x.xx, x.xx, x.xx, x.x])
    .addGrid(classifier.gamma, [x.xx, x.xx, x.xx, x.x])
    .addGrid(classifier.subsample, [x.xx, x.xx, x.xx, x.x])
    .build())

cross_validator = (CrossValidator()
    .setEstimator(classifier)
    .setEvaluator(evaluator)
    .setEstimatorParamMaps(param_grid)
    .setNumFolds(x))

model = cross_validator.fit(train_data).bestModel

The executor log shows below error and crashed.

[00:57:37] task 0 got new rank 1
22/03/19 00:57:37 ERROR GpuXGBoostSpark: XGBooster worker 1 has failed 0 times due to
ml.dmlc.xgboost4j.java.XGBoostError: [00:57:37] [/home/jenkins/agent/workspace/xgboost-release@2/src/common/device_helpers.cu](mailto:/home/jenkins/agent/workspace/xgboost-release@2/src/common/device_helpers.cu):64: Check failed: n_uniques == world (1 vs. 2) : Multiple processes within communication group running on same CUDA device is not supported
Stack trace:
  [bt] (0) /local_disk0/tmp/libxgboost4j4349310341251842705.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6e) [0x7f46931d221e]
  [bt] (1) /local_disk0/tmp/libxgboost4j4349310341251842705.so(dh::AllReducer::Init(int)+0x365) [0x7f46934c1285]
  [bt] (2) /local_disk0/tmp/libxgboost4j4349310341251842705.so(xgboost::common::SketchContainer::AllReduce()+0x781) [0x7f469351a791]
  [bt] (3) /local_disk0/tmp/libxgboost4j4349310341251842705.so(xgboost::common::SketchContainer::MakeCuts(xgboost::common::HistogramCuts*)+0xa6) [0x7f469351b2e6]
  [bt] (4) /local_disk0/tmp/libxgboost4j4349310341251842705.so(xgboost::data::IterativeDeviceDMatrix::Initialize(void*, float, int)+0xb36) [0x7f4693548ab6]
  [bt] (5) /local_disk0/tmp/libxgboost4j4349310341251842705.so(xgboost::DMatrix* xgboost::DMatrix::Create<void*, void*, void (void*), int (void*)>(void*, void*, void (*)(void*), int (*)(void*), float, int, int)+0xb0) [0x7f46932bfa20]
  [bt] (6) /local_disk0/tmp/libxgboost4j4349310341251842705.so(XGDeviceQuantileDMatrixCreateFromCallback+0xd) [0x7f4693212c7d]
  [bt] (7) /local_disk0/tmp/libxgboost4j4349310341251842705.so(xgboost::spark::XGDeviceQuantileDMatrixCreateFromCallbackImpl(JNIEnv_*, _jclass*, _jobject*, float, int, int, _jlongArray*)+0x1d9) [0x7f46931f5bd9]
  [bt] (8) [0x7f4bd4018527]

update cudf api from legacy to stable

we should revert cudf apis links from legacy to stable. When we update the java API of cudf, the hyperlinks will break for some days, it is normal.

undefined symbol issue in cuSpatial benchmark built by docker

got some undefined symbol in dynamic library libspatialudfjni.so

lab@dgxstation-s7:~/johnny/sparkRapidsTest/logs/cuspatial/amd64/Linux$ ldd -r libspatialudfjni.so
        linux-vdso.so.1 (0x00007fffb016a000)
        libcudf.so => not found
        libcuspatial.so => not found
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f0dc7b2f000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f0dc7917000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0dc7526000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f0dc7188000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f0dc828d000)
undefined symbol: cudaGetDevice (./libspatialudfjni.so)
undefined symbol: cudaGetErrorName      (./libspatialudfjni.so)

Cannot run CPU based version of Rapids XGBoost examples of Taxi notebooks

Describe the bug
If you follow comments from GPU examples of NY Taxi notebooks (scala, python) from https://github.com/NVIDIA/spark-rapids-examples/tree/branch-21.08/examples/taxi/notebooks
The notebooks are always failing.
For GPU/CPU comparison we need to have a working version on the notebooks

Steps/Code to reproduce bug
Take a notebook from https://github.com/NVIDIA/spark-rapids-examples/tree/branch-21.08/examples/taxi/notebooks substitute GPU code with CPU code (from comments). The code is failing on fit method

Expected behavior
The CPU version of notebooks should run

Environment details (please complete the following information)
Synapse notebooks. Should run in notebook environment

Nvtabular jars update

Describe the bug
Nvtabular jar need to change from v21.06 to v21.10 when spark-rapids v21.10 release

MortgageETL+XGBoost.ipynb notebook fail running on CPU

we will hit exception because of schema mismatch

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=XXXX, DMLC_TRACKER_PORT=XXXX, DMLC_NUM_WORKER=1024}
XXXARN TaskSetManager: Lost task 92.0 in stage 62.0 (TID 10523) (10.XX executor 254): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs://XXX:9000/datXXXXn/part-0XXXX711607e3ba72-c000.snappy.parquet. Column: [loan_age], Expected: float, Found: DOUBLE
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)

Update 22.06 examples to not rely on cudfjni

Describe the bug
Most example code are still based on 22.04, it requires cudfjni as dep, e.g.
https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.06/examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml#L56-L61
https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.06/examples/UDF-Examples/Spark-cuSpatial/pom.xml#L43-L47

Need to walk through all pom files to check if we could exclude cudfjni dep, since 22.06 plugin is a self-contained one.

Update preparing_datasets.md

Since we don't need examples/taxi/notebooks/python/Taxi_ETL.ipynb anymore, we should update preparing_datasets.md.

When reading perf/acq csv files, the reader should not use "option("header", true)"

When reading perf/acq csv files, the reader should not use "option("header", true)".

https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.06/examples/XGBoost-Examples/mortgage/notebooks/scala/mortgage-ETL.ipynb is one example.

val reader = sparkSession.read.option("header", true).schema(performanceSchema)

val optionsMap = Map("header" -> "true")

The reason is the underline CSV files you downloaded does not have header.
If you do this, you end up reading one less row.

We need to set it false everywhere.

Illegal Argument Exception: features does not exist...

Describe the bug
When trying to train an XGBoost classifier with GPU's, it produces the following error:

IllegalArgumentException: features does not exist

Steps/Code to reproduce bug
Calling the fit method as follows:

val xgbClassifier = new XGBoostClassifier(paramMap)
.setLabelCol(labelName)
.setFeaturesCols(featureCols)
xgbClassifier.fit(trainDF)
Expected behavior
I expected the model to successfully train when running on GPU's.

Environment details (please complete the following information)

Running Spark job on GCP Dataproc (YARN) with Nvidia Tesla T4 GPU.

The following JAR's are in the /usr/lib/spark/jars/ classPath:

Rapids-4-Spark: rapids-4-spark_2.12-21.08.0.jar
XGBoost4J: xgboost4j_3.0-1.4.2-0.1.0.jar
XGBoost4J-Spark: xgboost4j-spark_3.0-1.4.2-0.1.0.jar
CUDA: cudf-21.08.2-cuda11.jar
Using the following DataProc initializers to install GPU Drivers and Rapids Accelerators:

goog-dataproc-initialization-actions-us-central1/gpu/install_gpu_driver.sh
goog-dataproc-initialization-actions-us-central1/rapids/rapids.sh

Using the following Spark parameter configurations:
"spark.executor.resource.gpu.amount": "1"
"spark.task.resource.gpu.amount": "1"
"spark.rapids.sql.explain": "ALL"
"spark.rapids.sql.concurrentGpuTasks": "2"
"spark.rapids.memory.pinnedPool.size": "2G"
"spark.executor.extraJavaOptions": "-Dai.rapids.cudf.prefer-pinned=true"
"spark.locality.wait": "0s"
"spark.plugins": "com.nvidia.spark.SQLPlugin"
"spark.rapids.sql.hasNans": "false"
"spark.rapids.sql.batchSizeBytes": "512M"
"spark.rapids.sql.reader.batchSizeBytes": "768M"
"spark.rapids.sql.variableFloatAgg.enabled": "true"
"spark.rapids.sql.decimalType.enabled": "true"
"spark.rapids.memory.gpu.pooling.enabled": "false"
"spark.executor.resource.gpu.discoveryScript": "/usr/lib/spark/scripts/gpu/getGpusResources.sh"

Failed to build docker image of examples/Spark-cuML/pca/Dockerfile

When building examples/Spark-cuML/pca/Dockerfile, I got below error.

PackagesNotFoundError: The following packages are not available from current channels:

  • cudf=21.12

Current channels:

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

ERROR conda.cli.main_run:execute(33): Subprocess for 'conda run ['/bin/bash', '-c', 'conda install -c rapidsai-nightly -c nvidia -c conda-forge cudf=21.12 python=3.8 cudatoolkit=11.2 -y']' command failed. (See above for error)

Need to declare jdk8 is recommended while setting up the spark cluster

for the post v21.08 plugin version, it would fail with the no-such-method exception if we set up spark cluster with jdk11 while running scala notebook.

Unable to create executor due to No such method: addURL() on object: jdk.internal.loader.ClassLoaders$AppClassLoader

so need to note that jdk8 is recommended.

sample_xgboost_apps-0.2.2-jar-with-dependencies.jar doesn't have cuda11.2 with libxgboost4j.so

Describe the bug
XGBoost spark training job failed on cuda 11.2 because it cannot find libxgboost4j.so

Steps/Code to reproduce bug
If you follow https://github.com/NVIDIA/spark-xgboost-examples/blob/spark-3/getting-started-guides/on-prem-cluster/standalone-scala.md
and create sample_xgboost_apps-0.2.2-jar-with-dependencies.jar (from https://github.com/NVIDIA/spark-rapids-examples/blob/branch-21.08/docs/get-started/xgboost-examples/building-sample-apps/scala.md ) it doesn't have the path /lib/cuda11.2/ and spark job on cuda 11.2 server cannot find libxgboost4j.so. It has only /lib/cuda11

Environment details (please complete the following information)
Spark standalone cluster with cuda 11.2

update databricks init script

if there are already xgboost jars in dbfs, we will run into some issues, we need to remove them from the init script.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.