nvidia / spark-rapids-examples Goto Github PK

A repo for all spark examples using Rapids Accelerator including ETL, ML/DL, etc.

License: Apache License 2.0

Dockerfile 0.36% Shell 0.38% Python 1.58% Jupyter Notebook 97.68%

spark-rapids-examples's Introduction

spark-rapids-examples

This is the RAPIDS Accelerator for Apache Spark examples repo. RAPIDS Accelerator for Apache Spark accelerates Spark applications with no code changes. You can download the latest version of RAPIDS Accelerator here. This repo contains examples and applications that showcases the performance and benefits of using RAPIDS Accelerator in data processing and machine learning pipelines. There are broadly four categories of examples in this repo:

For more information on each of the examples please look into respective categories.

Here is the list of notebooks in this repo:

	Category	Notebook Name	Description
1	SQL/DF	Microbenchmark	Spark SQL operations such as expand, hash aggregate, windowing, and cross joins with up to 20x performance benefits
2	SQL/DF	Customer Churn	Data federation for modeling customer Churn with a sample telco customer data
3	XGBoost	Agaricus (Scala)	Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the agaricus dataset
4	XGBoost	Mortgage (Scala)	End-to-end ETL + XGBoost example to predict mortgage default with Fannie Mae Single-Family Loan Performance Data
5	XGBoost	Taxi (Scala)	End-to-end ETL + XGBoost example to predict taxi trip fare amount with NYC taxi trips data set
6	ML/DL	Criteo Training	ETL and deep learning training of the Criteo 1TB Click Logs dataset
7	ML/DL	PCA End-to-End	Spark MLlib based PCA example to train and transform with a synthetic dataset
8	UDF	cuSpatial - Point in Polygon	Spark cuSpatial example for Point in Polygon function using NYC Taxi pickup location dataset

Here is the list of Apache Spark applications (Scala and PySpark) that can be built for running on GPU with RAPIDS Accelerator in this repo:

	Category	Notebook Name	Description
1	XGBoost	Agaricus (Scala)	Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the agaricus dataset
2	XGBoost	Mortgage (Scala)	End-to-end ETL + XGBoost example to predict mortgage default with Fannie Mae Single-Family Loan Performance Data
3	XGBoost	Taxi (Scala)	End-to-end ETL + XGBoost example to predict taxi trip fare amount with NYC taxi trips data set
4	ML/DL	PCA End-to-End	Spark MLlib based PCA example to train and transform with a synthetic dataset
5	UDF	cuSpatial - Point in Polygon	Spark cuSpatial example for Point in Polygon function using NYC Taxi pickup location dataset
6	UDF	URL Decode	Decodes URL-encoded strings using the Java APIs of RAPIDS cudf
7	UDF	URL Encode	URL-encodes strings using the Java APIs of RAPIDS cudf
8	UDF	CosineSimilarity	Computes the cosine similarity between two float vectors using native code
9	UDF	StringWordCount	Implements a Hive simple UDF using native code to count words in strings

spark-rapids-examples's People

Contributors

Stargazers

Watchers

Forkers

garyshen2008 nvtimliu wjxiz1992 luckyplusten joesan tgravescs freyreeste nvliyuan wbo4958 viadea pxli gerashegalov asanchez75 res-life rongou firestarman python-repository-hub ozgunakin ucasfl binzhango razajafri surajaralihalli eordentlich nvnavkumar anhmike 5l1v3r1 roberrag mattahrens yanxuanliu sauravdev ramkipalle zamannoor mamhoud sourcegraph-ce leewyang liukeooo cnt-ai parthosa rohitreddy1698 rambits-ai sqytec jlowe jbkkdoyle33 lilyleking sameerz cindyyuanjiang raybellwaves

spark-rapids-examples's Issues

A typo in ec2.md

Describe the bug
A typo in ec2.md,
https://github.com/NVIDIA/spark-rapids-examples/blob/branch-21.12/docs/get-started/xgboost-examples/csp/aws/ec2.md#step-41-edit-spark-defualtconf

defualtconf => defaultconf

add init scripts for DB 11.0 ML runtime version

Need to enable auto-merge between dev branches.

Describe the bug
We should enable the auto-merge between branch-21.12 and branch-22.02.
And keep it in the later dev branches.

Xgboost training fails if input dataframe has vector type

If the input dataframe has vector type inside, xgboost training fails with below error:

22/02/16 14:36:29 ERROR GpuXGBoostSpark: The job was aborted due to
java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuUtils$.toColumnarRdd(GpuUtils.scala:49)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.trainOnGpuInternal(GpuXGBoost.scala:240)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.trainDistributedOnGpu(GpuXGBoost.scala:186)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.trainOnGpu(GpuXGBoost.scala:91)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.fitOnGpu(GpuXGBoost.scala:52)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.fit(XGBoostClassifier.scala:170)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:51)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:56)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:58)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:60)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:62)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:64)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:66)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:68)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:70)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:72)
	at $line43.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:74)
	at $line43.$read$$iw$$iw$$iw$$iw.<init>(<console>:76)
	at $line43.$read$$iw$$iw$$iw.<init>(<console>:78)
	at $line43.$read$$iw$$iw.<init>(<console>:80)
	at $line43.$read$$iw.<init>(<console>:82)
	at $line43.$read.<init>(<console>:84)
	at $line43.$read$.<init>(<console>:88)
	at $line43.$read$.<clinit>(<console>)
	at $line43.$eval$.$print$lzycompute(<console>:7)
	at $line43.$eval$.$print(<console>:6)
	at $line43.$eval.$print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:745)
	at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1021)
	at scala.tools.nsc.interpreter.IMain.$anonfun$interpret$1(IMain.scala:574)
	at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:41)
	at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:37)
	at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:41)
	at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:600)
	at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.nvidia.spark.rapids.ColumnarRdd$.convert(ColumnarRdd.scala:52)
	at com.nvidia.spark.rapids.ColumnarRdd.convert(ColumnarRdd.scala)
	... 53 more
Caused by: java.lang.IllegalArgumentException: Cannot convert [label: float, feature: float ... 1 more field] to GPU columnar Set(org.apache.spark.mllib.linalg.VectorUDT@f71b0bce) are not currently supported data types for columnar.
	at org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter$.extractRDDColumnarBatch(InternalColumnarRddConverter.scala:665)
	at org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter$.convert(InternalColumnarRddConverter.scala:718)
	at org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter.convert(InternalColumnarRddConverter.scala)
	... 59 more

Below is a minimum reproduce notebook in scala

import org.apache.spark.sql.SparkSession
sc.stop()

// Build the spark session and data reader as usual
val spark = SparkSession.builder.appName("xgboost_vector_test").getOrCreate
val reader = spark.read

import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassifier, XGBoostClassificationModel}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.sql.types.{FloatType, IntegerType, StructField, StructType}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val trainPath = "/home/xxx/data/xgboost_vector_test"

// with Vector
val rows = spark.sparkContext.parallelize(
  List(
    Row(0.0, 1.2, org.apache.spark.mllib.linalg.Vectors.dense(1.0, 2.0))
  )
)

val schema = List(
  StructField("label", DoubleType, true),
  StructField("feature", DoubleType, true),
  StructField("a_vector", new org.apache.spark.mllib.linalg.VectorUDT, true)
)

val df = spark.createDataFrame(
  rows,
  StructType(schema)
)

df.show()
df.printSchema
df.write.format("parquet").mode("overwrite").save(trainPath)

val trainSet = reader.parquet(trainPath)
trainSet.printSchema

val labelColName = "label"
val featureNames = Array("feature")

val commParamMap = Map(
  "eta" -> 0.1,
  "gamma" -> 0.1,
  "missing" -> 0.0,
  "max_depth" -> 10,
  "max_leaves" -> 256,
  "objective" -> "binary:logistic",
  "grow_policy" -> "depthwise",
  "min_child_weight" -> 30,
  "lambda" -> 1,
  "scale_pos_weight" -> 2,
  "subsample" -> 1,
  "nthread" -> 1,
  "num_round" -> 100)

val xgbParamFinal = commParamMap ++ Map("tree_method" -> "gpu_hist", "num_workers" -> 1)

val xgbClassifier = new XGBoostClassifier(xgbParamFinal)
      .setLabelCol(labelColName)
      // === diff ===
      .setFeaturesCols(featureNames)

xgbClassifier.fit(trainSet)

Test env:
Standalone Spark cluster
Spark 3.1.1
22.02 snapshot rapids-spark and cudf jars
xgboost4j_3.0-1.4.2-0.2.0.jar
xgboost4j-spark_3.0-1.4.2-0.2.0.jar

This is a customer blocker issue.

CrossValidation fails with "Check failed: n_uniques == world (1 vs. 2) : Multiple processes within communication group running on same CUDA device is not supported"

Env:
Databricks 9.1ML GPU
2-nodes cluster
22.02GA jars
xgboost4j_3.0-1.4.2-0.2.0.jar
xgboost4j-spark_3.0-1.4.2-0.2.0.jar

Sample code:

import time
import os
from pyspark import broadcast
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

from ml.dmlc.xgboost4j.scala.spark.rapids import CrossValidator
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

from ml.dmlc.xgboost4j.scala.spark import XGBoostClassificationModel, XGBoostClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
reader = spark.read

trainPath = "/xxx/mortgage_train/"

label = "delinquency_12"
schema = StructType([
    StructField("orig_channel", FloatType()),
    StructField("first_home_buyer", FloatType()),
    StructField("loan_purpose", FloatType()),
    StructField("property_type", FloatType()),
    StructField("occupancy_status", FloatType()),
    StructField("property_state", FloatType()),
    StructField("product_type", FloatType()),
    StructField("relocation_mortgage_indicator", FloatType()),
    StructField("seller_name", FloatType()),
    StructField("mod_flag", FloatType()),
    StructField("orig_interest_rate", FloatType()),
    StructField("orig_upb", IntegerType()),
    StructField("orig_loan_term", IntegerType()),
    StructField("orig_ltv", FloatType()),
    StructField("orig_cltv", FloatType()),
    StructField("num_borrowers", FloatType()),
    StructField("dti", FloatType()),
    StructField("borrower_credit_score", FloatType()),
    StructField("num_units", IntegerType()),
    StructField("zip", IntegerType()),
    StructField("mortgage_insurance_percent", FloatType()),
    StructField("current_loan_delinquency_status", IntegerType()),
    StructField("current_actual_upb", FloatType()),
    StructField("interest_rate", FloatType()),
    StructField("loan_age", FloatType()),
    StructField("msa", FloatType()),
    StructField("non_interest_bearing_upb", FloatType()),
    StructField(label, IntegerType()),
])
features = [ x.name for x in schema if x.name != label ]

# load dataset from file
train_data = reader.parquet(trainPath)

classifier = XGBoostClassifier(**params).setLabelCol(label).setFeaturesCols(features)
evaluator = BinaryClassificationEvaluator(labelCol=label, metricName='areaUnderROC')
param_grid = (ParamGridBuilder()
    .addGrid(classifier.maxDepth, [x, x])   ,
    .addGrid(classifier.numRound, [x, x])   , 
    .addGrid(classifier.eta, [x.xx, x.xx, x.xx, x.x])
    .addGrid(classifier.gamma, [x.xx, x.xx, x.xx, x.x])
    .addGrid(classifier.subsample, [x.xx, x.xx, x.xx, x.x])
    .build())

cross_validator = (CrossValidator()
    .setEstimator(classifier)
    .setEvaluator(evaluator)
    .setEstimatorParamMaps(param_grid)
    .setNumFolds(x))

model = cross_validator.fit(train_data).bestModel

The executor log shows below error and crashed.

[00:57:37] task 0 got new rank 1
22/03/19 00:57:37 ERROR GpuXGBoostSpark: XGBooster worker 1 has failed 0 times due to
ml.dmlc.xgboost4j.java.XGBoostError: [00:57:37] [/home/jenkins/agent/workspace/xgboost-release@2/src/common/device_helpers.cu](mailto:/home/jenkins/agent/workspace/xgboost-release@2/src/common/device_helpers.cu):64: Check failed: n_uniques == world (1 vs. 2) : Multiple processes within communication group running on same CUDA device is not supported
Stack trace:
  [bt] (0) /local_disk0/tmp/libxgboost4j4349310341251842705.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6e) [0x7f46931d221e]
  [bt] (1) /local_disk0/tmp/libxgboost4j4349310341251842705.so(dh::AllReducer::Init(int)+0x365) [0x7f46934c1285]
  [bt] (2) /local_disk0/tmp/libxgboost4j4349310341251842705.so(xgboost::common::SketchContainer::AllReduce()+0x781) [0x7f469351a791]
  [bt] (3) /local_disk0/tmp/libxgboost4j4349310341251842705.so(xgboost::common::SketchContainer::MakeCuts(xgboost::common::HistogramCuts*)+0xa6) [0x7f469351b2e6]
  [bt] (4) /local_disk0/tmp/libxgboost4j4349310341251842705.so(xgboost::data::IterativeDeviceDMatrix::Initialize(void*, float, int)+0xb36) [0x7f4693548ab6]
  [bt] (5) /local_disk0/tmp/libxgboost4j4349310341251842705.so(xgboost::DMatrix* xgboost::DMatrix::Create<void*, void*, void (void*), int (void*)>(void*, void*, void (*)(void*), int (*)(void*), float, int, int)+0xb0) [0x7f46932bfa20]
  [bt] (6) /local_disk0/tmp/libxgboost4j4349310341251842705.so(XGDeviceQuantileDMatrixCreateFromCallback+0xd) [0x7f4693212c7d]
  [bt] (7) /local_disk0/tmp/libxgboost4j4349310341251842705.so(xgboost::spark::XGDeviceQuantileDMatrixCreateFromCallbackImpl(JNIEnv_*, _jclass*, _jobject*, float, int, int, _jlongArray*)+0x1d9) [0x7f46931f5bd9]
  [bt] (8) [0x7f4bd4018527]

update cudf api from legacy to stable

we should revert cudf apis links from legacy to stable. When we update the java API of cudf, the hyperlinks will break for some days, it is normal.

the cpu version micro-benchmarks can have a higher tuned params

try to launch more executors to have a better performance

need to update scala notebook kernel to spylon

need to use spylon kernel instead of toree, so we don't need to spend a lot of time to download the toree source code and build.

multi-process within communication group on the same CUDA device is not supported

need to add a caveat for cross-validation job

spark-rapids drop support for spark3.0.x version from v22.04

we need to update the doc for not supported spark version,
see NVIDIA/spark-rapids#4915

undefined symbol issue in cuSpatial benchmark built by docker

got some undefined symbol in dynamic library libspatialudfjni.so

lab@dgxstation-s7:~/johnny/sparkRapidsTest/logs/cuspatial/amd64/Linux$ ldd -r libspatialudfjni.so
        linux-vdso.so.1 (0x00007fffb016a000)
        libcudf.so => not found
        libcuspatial.so => not found
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f0dc7b2f000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f0dc7917000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0dc7526000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f0dc7188000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f0dc828d000)
undefined symbol: cudaGetDevice (./libspatialudfjni.so)
undefined symbol: cudaGetErrorName      (./libspatialudfjni.so)

I wish we can provide custom computing environments

If we could provide an interactive notebook for customers to play immediately, it would be better.
We can use Binder to create custom computing environments that can be shared and used by customers.
For example https://sedona.apache.org/setup/install-python/#setup-environment-variables

we need to add gcp guide doc

cuspatial on DB 10.4 runtime

run cuspatial on DB 10.4 runtime by init script, not docker

Need Signoff checks

update the init scripts for xgboost1.6.1

since we update xgboost version to 1.6.1, we need to update CSPs init scripts and verify the impact on spark-rapids repo

Cannot run CPU based version of Rapids XGBoost examples of Taxi notebooks

Describe the bug
If you follow comments from GPU examples of NY Taxi notebooks (scala, python) from https://github.com/NVIDIA/spark-rapids-examples/tree/branch-21.08/examples/taxi/notebooks
The notebooks are always failing.
For GPU/CPU comparison we need to have a working version on the notebooks

Steps/Code to reproduce bug
Take a notebook from https://github.com/NVIDIA/spark-rapids-examples/tree/branch-21.08/examples/taxi/notebooks substitute GPU code with CPU code (from comments). The code is failing on fit method

Expected behavior
The CPU version of notebooks should run

Environment details (please complete the following information)
Synapse notebooks. Should run in notebook environment

Nvtabular jars update

Describe the bug
Nvtabular jar need to change from v21.06 to v21.10 when spark-rapids v21.10 release

update the guide links in top readme

target the guide links in the top readme to dir rather than sub-readme.md

MortgageETL+XGBoost.ipynb notebook fail running on CPU

we will hit exception because of schema mismatch

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=XXXX, DMLC_TRACKER_PORT=XXXX, DMLC_NUM_WORKER=1024}
XXXARN TaskSetManager: Lost task 92.0 in stage 62.0 (TID 10523) (10.XX executor 254): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs://XXX:9000/datXXXXn/part-0XXXX711607e3ba72-c000.snappy.parquet. Column: [loan_age], Expected: float, Found: DOUBLE
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)

Update 22.06 examples to not rely on cudfjni

Describe the bug
Most example code are still based on 22.04, it requires cudfjni as dep, e.g.
https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.06/examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml#L56-L61
https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.06/examples/UDF-Examples/Spark-cuSpatial/pom.xml#L43-L47

Need to walk through all pom files to check if we could exclude cudfjni dep, since 22.06 plugin is a self-contained one.

update commits from the original repo

need to update commits from the original repo of v21.06

add broken links checker

we need a git action to check broken links in this repo

Update preparing_datasets.md

Since we don't need examples/taxi/notebooks/python/Taxi_ETL.ipynb anymore, we should update preparing_datasets.md.

update xgboost jar to the latest version next release

since pr merged so we need to update xgboost4j jars to latest version

This PCA example lacks of mean centering preprocess

Describe the bug
The PCA example is producing different results for GPU and CPU run.
Reason is lacking of the mean centering op in cell[4] in https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.02/examples/Spark-cuML/pca/Spark_PCA_End_to_End.ipynb

I'll add missing code to resolve the problem.

When reading perf/acq csv files, the reader should not use "option("header", true)"

When reading perf/acq csv files, the reader should not use "option("header", true)".

https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.06/examples/XGBoost-Examples/mortgage/notebooks/scala/mortgage-ETL.ipynb is one example.

val reader = sparkSession.read.option("header", true).schema(performanceSchema)

val optionsMap = Map("header" -> "true")

The reason is the underline CSV files you downloaded does not have header.
If you do this, you end up reading one less row.

We need to set it false everywhere.

Create branch-22.02 for PCA nightly pipeline

Due to the recent API change in PCA in branch-22.02, there're some code modification in #59.
We need a 22.02 branch to run the nightly test against PCA 22.02 version code.

Change spark-xgboost examples to depend on dmlc/xgboost v1.6.1

Describe the bug
Right now the xgboost examples depend on v1.4.0-0.3.0.

Expected behavior
Need to depend on dmlc/xgboost v1.6.1

need to cherry pick databricks updated docs from #64 to v22.02

MIG-Support README.md has a broken link to Pluggable Device Framework

In Section "Single MIG GPU per Container" there is a newline disassociating [] and ()

link error in guide doc

need to add AWS EMR getting started doc and fixed linked error

Mortgage ETL demo produce no output

Describe the bug
GPU run’s ETL will produce 0 row

Random data generation in PCA test case will cause OOM

Describe the bug
java.lang.OutOfMemoryError: Java heap space is observed when generating random data for PCA test. Currently the data is generate at driver side. We should make this generation more rubust, like using RDD to generate mentioned here: https://stackoverflow.com/questions/55083170/efficient-way-to-generate-large-randomized-data-in-spark

Illegal Argument Exception: features does not exist...

Describe the bug
When trying to train an XGBoost classifier with GPU's, it produces the following error:

IllegalArgumentException: features does not exist

Steps/Code to reproduce bug
Calling the fit method as follows:

val xgbClassifier = new XGBoostClassifier(paramMap)
.setLabelCol(labelName)
.setFeaturesCols(featureCols)
xgbClassifier.fit(trainDF)
Expected behavior
I expected the model to successfully train when running on GPU's.

Environment details (please complete the following information)

Running Spark job on GCP Dataproc (YARN) with Nvidia Tesla T4 GPU.

The following JAR's are in the /usr/lib/spark/jars/ classPath:

Rapids-4-Spark: rapids-4-spark_2.12-21.08.0.jar
XGBoost4J: xgboost4j_3.0-1.4.2-0.1.0.jar
XGBoost4J-Spark: xgboost4j-spark_3.0-1.4.2-0.1.0.jar
CUDA: cudf-21.08.2-cuda11.jar
Using the following DataProc initializers to install GPU Drivers and Rapids Accelerators:

goog-dataproc-initialization-actions-us-central1/gpu/install_gpu_driver.sh
goog-dataproc-initialization-actions-us-central1/rapids/rapids.sh

Using the following Spark parameter configurations:
"spark.executor.resource.gpu.amount": "1"
"spark.task.resource.gpu.amount": "1"
"spark.rapids.sql.explain": "ALL"
"spark.rapids.sql.concurrentGpuTasks": "2"
"spark.rapids.memory.pinnedPool.size": "2G"
"spark.executor.extraJavaOptions": "-Dai.rapids.cudf.prefer-pinned=true"
"spark.locality.wait": "0s"
"spark.plugins": "com.nvidia.spark.SQLPlugin"
"spark.rapids.sql.hasNans": "false"
"spark.rapids.sql.batchSizeBytes": "512M"
"spark.rapids.sql.reader.batchSizeBytes": "768M"
"spark.rapids.sql.variableFloatAgg.enabled": "true"
"spark.rapids.sql.decimalType.enabled": "true"
"spark.rapids.memory.gpu.pooling.enabled": "false"
"spark.executor.resource.gpu.discoveryScript": "/usr/lib/spark/scripts/gpu/getGpusResources.sh"

Failed to build docker image of examples/Spark-cuML/pca/Dockerfile

When building examples/Spark-cuML/pca/Dockerfile, I got below error.

PackagesNotFoundError: The following packages are not available from current channels:

cudf=21.12

Current channels:

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

ERROR conda.cli.main_run:execute(33): Subprocess for 'conda run ['/bin/bash', '-c', 'conda install -c rapidsai-nightly -c nvidia -c conda-forge cudf=21.12 python=3.8 cudatoolkit=11.2 -y']' command failed. (See above for error)

[Documentation] The Toree installation doc is out of date

The document of Apache Toree installation is out of date, latest Toree v0.5.0-incubating-rc4 can be installed successfully.

customer churn demo need additional python codes support

customer churn demo need additional python codes support import churn.augment

scala notebook fail with 2108 plugin, need to target to 2110

scala notebook run with v21.08 plugin will fail with exception:
ClassNotFoundException: com.nvidia.spark.rapids.ColumnarRdd, see bug.
we should target to v21.10

Need to declare jdk8 is recommended while setting up the spark cluster

for the post v21.08 plugin version, it would fail with the no-such-method exception if we set up spark cluster with jdk11 while running scala notebook.

Unable to create executor due to No such method: addURL() on object: jdk.internal.loader.ClassLoaders$AppClassLoader

so need to note that jdk8 is recommended.

sample_xgboost_apps-0.2.2-jar-with-dependencies.jar doesn't have cuda11.2 with libxgboost4j.so

Describe the bug
XGBoost spark training job failed on cuda 11.2 because it cannot find libxgboost4j.so

Steps/Code to reproduce bug
If you follow https://github.com/NVIDIA/spark-xgboost-examples/blob/spark-3/getting-started-guides/on-prem-cluster/standalone-scala.md
and create sample_xgboost_apps-0.2.2-jar-with-dependencies.jar (from https://github.com/NVIDIA/spark-rapids-examples/blob/branch-21.08/docs/get-started/xgboost-examples/building-sample-apps/scala.md ) it doesn't have the path /lib/cuda11.2/ and spark job on cuda 11.2 server cannot find libxgboost4j.so. It has only /lib/cuda11

Environment details (please complete the following information)
Spark standalone cluster with cuda 11.2