Coder Social home page Coder Social logo

spark-rapids-container's Introduction

spark-rapids-container's People

Contributors

garyshen2008 avatar nvnavkumar avatar pxli avatar res-life avatar tgravescs avatar wjxiz1992 avatar yanxuanliu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

spark-rapids-container's Issues

[BUG] ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3

Describe the bug
After following the step by step installation guide on Azure Databricks, I was met with a bug (ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3). I can confirm that the cluster version is 10.4 LTS, with a worker type NC6_v3, and that the docker image is exactly the rapids-4-spark-databricks:23.02.0 from the repo. Is there an undocumented element to it ?

Steps/Code to reproduce bug
Follow the tutorial on running rapids with Docker on Databricks

Expected behavior
A running container

Environment details (please complete the following information)

  • cloud : Azure
  • Spark : 10.4 LTS
  • Autoscaling : disabled
  • Multi-node
  • Photon : disabled
  • auto-shutdown : disabled
  • init script provided : file:/opt/spark-rapids/init.sh (from repo)

[BUG] Alluxio command reports error

Describe the bug

  • Alluxio command reports logger not config error.
  • Alluxio command reports PrometheusMetricsServlet ClassNotFoundException.
    It does not impact functionalities of Alluxio, but If we login to Web Terminal and run alluxio commans it shows errors.
    Since these errors do not show to end users, we can put this issue into low priority.

./alluxio fsadmin report

root@0224-020451-pugqs81d-10-59-182-117:/opt/alluxio/bin# ./alluxio fsadmin report
ERROR StatusLogger Reconfiguration failed: No configuration found for '7d4991ad' at 'null' in 'null'
06:44:21.573 [main] ERROR alluxio.metrics.MetricsSystem - Sink class alluxio.metrics.sink.PrometheusMetricsServlet cannot be instantiated
java.lang.ClassNotFoundException: alluxio.metrics.sink.PrometheusMetricsServlet
        at java.net.URLClassLoader.findClass(URLClassLoader.java:387) ~[?:1.8.0_362]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_362]
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_362]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_362]
        at java.lang.Class.forName0(Native Method) ~[?:1.8.0_362]
        at java.lang.Class.forName(Class.java:264) ~[?:1.8.0_362]
        at alluxio.metrics.MetricsSystem.startSinksFromConfig(MetricsSystem.java:273) ~[alluxio-client-2.9.0.jar:?]
        at alluxio.metrics.MetricsSystem.startSinks(MetricsSystem.java:210) ~[alluxio-client-2.9.0.jar:?]
        at alluxio.client.file.FileSystemContext.initContext(FileSystemContext.java:316) ~[alluxio-client-2.9.0.jar:?]
        at alluxio.client.file.FileSystemContext.init(FileSystemContext.java:305) ~[alluxio-client-2.9.0.jar:?]
        at alluxio.client.file.FileSystemContext.create(FileSystemContext.java:256) ~[alluxio-client-2.9.0.jar:?]
        at alluxio.client.file.FileSystemContext.create(FileSystemContext.java:225) ~[alluxio-client-2.9.0.jar:?]
        at alluxio.cli.fsadmin.FileSystemAdminShellUtils.checkMasterClientService(FileSystemAdminShellUtils.java:60) ~[alluxio-client-2.9.0.jar:?]
        at alluxio.cli.fsadmin.command.ReportCommand.run(ReportCommand.java:107) ~[alluxio-client-2.9.0.jar:?]
        at alluxio.cli.AbstractShell.run(AbstractShell.java:134) ~[alluxio-client-2.9.0.jar:?]
        at alluxio.cli.fsadmin.FileSystemAdminShell.main(FileSystemAdminShell.java:72) ~[alluxio-client-2.9.0.jar:?]


====================== The above is error msg================
====================== The below is output of this command ======

Alluxio cluster summary: 
    Master Address: 10.59.182.117:19998
    Web Port: 19999
    Rpc Port: 19998
    Started: 04-28-2023 06:39:15:645
    Uptime: 0 day(s), 0 hour(s), 5 minute(s), and 6 second(s)
    Version: 2.9.0
    Safe Mode: false
    Zookeeper Enabled: false
    Raft-based Journal: true
    Raft Journal Addresses: 
        10.59.182.117:19200
    Live Workers: 1
    Lost Workers: 0
    Total Capacity: 68.73GB
        Tier: SSD  Size: 68.73GB
    Used Capacity: 0B
        Tier: SSD  Size: 0B
    Free Capacity: 68.73GB

Steps/Code to reproduce bug
Create a Databricks cluster with:
Environment variables:
PROMETHEUS_COPY_DATA_PATH=/dbfs/chongg/dblogs-prometheus
ENABLE_ALLUXIO=1
Docker image URL: gaochong365/rapids-4-spark-databricks:23.04.0-rc1

After started the cluster, login to Web Terminal via Apps Tab.
cd /opt/alluxio/bin
./alluxio fsadmin report

[FEA] Alluxio home settings should be pinned in Databricks Docker container

Is your feature request related to a problem? Please describe.
When using Alluxio in the Databricks Docker container, we require the user to set both ALLUXIO_HOME environment variables and the Spark configuration spark.rapids.alluxio.home. These should be unnecessary as the version of Alluxio is fixed inside the Docker container image.

Describe the solution you'd like
The ALLUXIO_HOME environment variable should be set in the Docker container's built-in init script. This can done during the Docker build process using a sed command inside the Dockerfile. The documentation should be updated to not require this setting. Also, for spark.rapids.alluxio.home, maybe the code itself can read ALLUXIO_HOME as a possible default value here (then we can file a corresponding issue in https://github.com/NVIDIA/spark-rapids`) or this setting can be added via sed command in the Dockerfile to update 00-custom-spark-driver-defaults.conf. Therefore the user should not be fully concerned with knowing the exact Alluxio paths, it should just work if ENABLE_ALLUXIO=1

[BUG] databricks docker missing numpy, pandas, pyarrow, and psycopg2

Describe the bug
our databricks docker container is missing basic python libraries numpy, pandas, pyarrow, and psycopg2. Since these are used a lot by many users we should include them. One specific customer tried our container and had failures because these missing.

Specific versions in 10.4 and I believe 11.3:

numpy==1.20.1 \
pandas==1.2.4 \
pyarrow==4.0.0 \
psycopg2==2.8.5 \

[DOC] we need release docs

Report incorrect documentation

we need release docs to point to released versions so peple can find our images from here

nvcr.io/nvidia/rapids-4-spark-databricks:22.10.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.