Coder Social home page Coder Social logo

Comments (8)

jlowe avatar jlowe commented on August 16, 2024 1

From the stderr, "symbol lookup error: /tmp/udfexamplesjni413332660788026425.so: undefined symbol: _ZN4cudf12copy_bitmaskERKNS_11column_viewEN3rmm16cuda_stream_viewEPNS3_2mr22device_memory_resourceE", it failed to find the libcudf.so I think.

This might be a case of a mismatch between the libcudf.so used to build the examples and the one that was found at runtime. The libcudf API is not backwards compatible from version to version, so it's important to build the UDF example against the same version of cudf that is being used at runtime by the RAPIDS Accelerator.

Note that the latest code on spark-rapids-examples main will build against libcudf from 23.12.0, but the version found at runtime appears to be 23.10.0. I verified that the symbol in question is present in libcudf 23.12.0 but is missing from 23.10.0.

@LIN-Yu-Ting you will need to ensure the libcudf that is built against by the examples matches the version found at runtime, either by rolling the examples back to build against libcudf 23.10.0 or moving the RAPIDS Accelerator jar forward to version 23.12.0.

from spark-rapids-examples.

LIN-Yu-Ting avatar LIN-Yu-Ting commented on August 16, 2024 1

@GaryShen2008 @jlowe
I recompiled rapids-4-spark-udf-examples_2.12-23.10.0-SNAPSHOT.jar with Spark Rapids 23.10.0 and run those erroneous commands and no more errors happen again.

from spark-rapids-examples.

GaryShen2008 avatar GaryShen2008 commented on August 16, 2024

Hi @LIN-Yu-Ting,

Thanks for reporting this issue.
Is it possible to check the executors' logs in your standalone cluster? The stderr or stdout log in the SPARK_HOME/work///?

from spark-rapids-examples.

LIN-Yu-Ting avatar LIN-Yu-Ting commented on August 16, 2024

@GaryShen2008 This is the stderr of one of executor

root@7ae88f7da20c49e48c9ddbf89e51dbe6000000:/home/spark-current/work/app-20240116060142-0001# cat 0/stderr 
Spark Executor Command: "/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java" "-cp" "/home/spark-current/jars/rapids-4-spark_2.12-23.10.0-cuda12.jar:/home/spark-current/conf/:/home/spark-current/jars/*:/usr/local/hadoop-3.3.0/etc/hadoop/:/usr/local/hadoop-3.3.0/share/hadoop/common/lib/*:/usr/local/hadoop-3.3.0/share/hadoop/common/*:/usr/local/hadoop-3.3.0/share/hadoop/hdfs/:/usr/local/hadoop-3.3.0/share/hadoop/hdfs/lib/*:/usr/local/hadoop-3.3.0/share/hadoop/hdfs/*:/usr/local/hadoop-3.3.0/share/hadoop/yarn/:/usr/local/hadoop-3.3.0/share/hadoop/yarn/lib/*:/usr/local/hadoop-3.3.0/share/hadoop/yarn/*:/usr/local/hadoop-3.3.0/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-3.3.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar" "-Xmx20480M" "-Dspark.driver.port=34209" "-XX:+IgnoreUnrecognizedVMOptions" "--add-opens=java.base/java.lang=ALL-UNNAMED" "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" "--add-opens=java.base/java.io=ALL-UNNAMED" "--add-opens=java.base/java.net=ALL-UNNAMED" "--add-opens=java.base/java.nio=ALL-UNNAMED" "--add-opens=java.base/java.util=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" "--add-opens=java.base/sun.security.action=ALL-UNNAMED" "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" "-XX:+UseG1GC" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@7ae88f7da20c49e48c9ddbf89e51dbe6000000.internal.cloudapp.net:34209" "--executor-id" "0" "--hostname" "10.0.0.11" "--cores" "4" "--app-id" "app-20240116060142-0001" "--worker-url" "spark://[email protected]:37895" "--resourcesFile" "/home/spark-current/work/app-20240116060142-0001/0/resource-executor-3543028932059117698.json"
========================================

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/spark-current/jars/atgenomix-plugin.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/spark-current/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder]
/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java: symbol lookup error: /tmp/udfexamplesjni413332660788026425.so: undefined symbol: _ZN4cudf12copy_bitmaskERKNS_11column_viewEN3rmm16cuda_stream_viewEPNS3_2mr22device_memory_resourceE

and for stdout

root@7ae88f7da20c49e48c9ddbf89e51dbe6000000:/home/spark-current/work/app-20240116060142-0001# cat 0/stdout
connected to null
connected to null
06:01:44,686 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml]
06:01:44,686 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback.groovy]
06:01:44,687 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [file:/mnt/spark-current/conf/logback.xml]
06:01:44,688 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs multiple times on the classpath.
06:01:44,688 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/mnt/spark-current/jars/atgenomix-plugin.jar!/logback.xml]
06:01:44,688 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [file:/mnt/spark-current/conf/logback.xml]
06:01:44,727 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set
06:01:44,727 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
06:01:44,732 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT]
06:01:44,736 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
06:01:44,763 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.FileAppender]
06:01:44,765 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [AUDIT_FILE]
06:01:44,766 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
06:01:44,766 |-INFO in ch.qos.logback.core.FileAppender[AUDIT_FILE] - File property is set to [audit.log]
06:01:44,767 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.FileAppender]
06:01:44,767 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [DEBUG_FILE]
06:01:44,767 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
06:01:44,768 |-INFO in ch.qos.logback.core.FileAppender[DEBUG_FILE] - File property is set to [debug.log]
06:01:44,768 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [com.atgenomix.seqslab.piper.common.log.UnixSocketSyslogAppender]
06:01:44,769 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [SYSLOG]
06:01:44,966 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [com.atgenomix.seqslab.piper.common.log.UnixSocketSyslogAppender]
06:01:44,966 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [SECURITY]
06:01:44,968 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [audit] to INFO
06:01:44,968 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [AUDIT_FILE] to Logger[audit]
06:01:44,968 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [dbg] to DEBUG
06:01:44,968 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [DEBUG_FILE] to Logger[dbg]
06:01:44,968 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [syslog] to INFO
06:01:44,968 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [SYSLOG] to Logger[syslog]
06:01:44,968 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [security] to INFO
06:01:44,968 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [SECURITY] to Logger[security]
06:01:44,968 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.apache.spark] to INFO
06:01:44,968 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.apache.spark.repl.Main] to WARN
06:01:44,969 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.sparkproject.jetty] to WARN
06:01:44,969 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.sparkproject.jetty.util.component.AbstractLifeCycle] to ERROR
06:01:44,969 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.apache.spark.repl.SparkIMain$exprTyper] to INFO
06:01:44,969 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.apache.spark.repl.SparkILoop$SparkILoopInterpreter] to INFO
06:01:44,969 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.apache.parquet] to ERROR
06:01:44,969 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [parquet] to ERROR
06:01:44,969 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to WARN
06:01:44,969 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT]
06:01:44,969 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
06:01:44,969 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@35b74c5c - Registering current configuration as safe fallback point

24/01/16 06:01:44.975 [main] INFO  o.a.s.e.CoarseGrainedExecutorBackend - Started daemon with process name: 19577@7ae88f7da20c49e48c9ddbf89e51dbe6000000
24/01/16 06:01:44.980 [main] INFO  org.apache.spark.util.SignalUtils - Registering signal handler for TERM
24/01/16 06:01:44.981 [main] INFO  org.apache.spark.util.SignalUtils - Registering signal handler for HUP
24/01/16 06:01:44.981 [main] INFO  org.apache.spark.util.SignalUtils - Registering signal handler for INT
24/01/16 06:01:45.376 [main] INFO  org.apache.spark.SecurityManager - Changing view acls to: root
24/01/16 06:01:45.376 [main] INFO  org.apache.spark.SecurityManager - Changing modify acls to: root
24/01/16 06:01:45.377 [main] INFO  org.apache.spark.SecurityManager - Changing view acls groups to: 
24/01/16 06:01:45.377 [main] INFO  org.apache.spark.SecurityManager - Changing modify acls groups to: 
24/01/16 06:01:45.377 [main] INFO  org.apache.spark.SecurityManager - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
24/01/16 06:01:45.597 [netty-rpc-connection-0] INFO  o.a.s.n.c.TransportClientFactory - Successfully created connection to 7ae88f7da20c49e48c9ddbf89e51dbe6000000.internal.cloudapp.net/10.0.0.11:34209 after 49 ms (0 ms spent in bootstraps)
24/01/16 06:01:45.688 [main] INFO  org.apache.spark.SecurityManager - Changing view acls to: root
24/01/16 06:01:45.688 [main] INFO  org.apache.spark.SecurityManager - Changing modify acls to: root
24/01/16 06:01:45.688 [main] INFO  org.apache.spark.SecurityManager - Changing view acls groups to: 
24/01/16 06:01:45.688 [main] INFO  org.apache.spark.SecurityManager - Changing modify acls groups to: 
24/01/16 06:01:45.688 [main] INFO  org.apache.spark.SecurityManager - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
24/01/16 06:01:45.727 [netty-rpc-connection-0] INFO  o.a.s.n.c.TransportClientFactory - Successfully created connection to 7ae88f7da20c49e48c9ddbf89e51dbe6000000.internal.cloudapp.net/10.0.0.11:34209 after 1 ms (0 ms spent in bootstraps)
24/01/16 06:01:45.790 [main] INFO  o.a.spark.storage.DiskBlockManager - Created local directory at /mnt/spark-deec7eff-062f-49c3-9faf-0d023590611c/executor-1ca20b0d-4a78-4ec3-b514-85ccf268b23c/blockmgr-79e96be6-6c7d-46e8-8566-b113ee1dcdd4
24/01/16 06:01:45.819 [main] INFO  o.a.spark.storage.memory.MemoryStore - MemoryStore started with capacity 11.8 GiB
24/01/16 06:01:45.989 [dispatcher-Executor] INFO  o.a.s.e.CoarseGrainedExecutorBackend - Connecting to driver: spark://CoarseGrainedScheduler@7ae88f7da20c49e48c9ddbf89e51dbe6000000.internal.cloudapp.net:34209
24/01/16 06:01:45.990 [main] INFO  o.a.s.deploy.worker.WorkerWatcher - Connecting to worker spark://[email protected]:37895
24/01/16 06:01:46.038 [netty-rpc-connection-1] INFO  o.a.s.n.c.TransportClientFactory - Successfully created connection to /10.0.0.11:37895 after 23 ms (0 ms spent in bootstraps)
24/01/16 06:01:46.145 [dispatcher-Executor] INFO  o.a.spark.resource.ResourceUtils - ==============================================================
24/01/16 06:01:46.145 [dispatcher-Executor] INFO  o.a.spark.resource.ResourceUtils - Custom resources for spark.executor:
gpu -> [name: gpu, addresses: 0]
24/01/16 06:01:46.146 [dispatcher-Executor] INFO  o.a.spark.resource.ResourceUtils - ==============================================================
24/01/16 06:01:46.183 [dispatcher-Executor] INFO  o.a.s.e.CoarseGrainedExecutorBackend - Successfully registered with driver
24/01/16 06:01:46.186 [dispatcher-Executor] INFO  org.apache.spark.executor.Executor - Starting executor ID 0 on host 10.0.0.11
24/01/16 06:01:46.265 [dispatcher-Executor] INFO  org.apache.spark.util.Utils - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33733.
24/01/16 06:01:46.266 [dispatcher-Executor] INFO  o.a.s.n.n.NettyBlockTransferService - Server created on 10.0.0.11:33733
24/01/16 06:01:46.267 [dispatcher-Executor] INFO  o.apache.spark.storage.BlockManager - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
24/01/16 06:01:46.272 [dispatcher-Executor] INFO  o.a.spark.storage.BlockManagerMaster - Registering BlockManager BlockManagerId(0, 10.0.0.11, 33733, None)
24/01/16 06:01:46.280 [dispatcher-Executor] INFO  o.a.spark.storage.BlockManagerMaster - Registered BlockManager BlockManagerId(0, 10.0.0.11, 33733, None)
24/01/16 06:01:46.280 [dispatcher-Executor] INFO  o.apache.spark.storage.BlockManager - external shuffle service port = 7337
24/01/16 06:01:46.281 [dispatcher-Executor] INFO  o.apache.spark.storage.BlockManager - Registering executor with local external shuffle service.
24/01/16 06:01:46.287 [dispatcher-Executor] INFO  o.a.s.n.c.TransportClientFactory - Successfully created connection to /10.0.0.11:7337 after 1 ms (0 ms spent in bootstraps)
24/01/16 06:01:46.295 [dispatcher-Executor] INFO  o.apache.spark.storage.BlockManager - Initialized BlockManager: BlockManagerId(0, 10.0.0.11, 33733, None)
24/01/16 06:01:46.300 [dispatcher-Executor] INFO  org.apache.spark.executor.Executor - Starting executor with user classpath (userClassPathFirst = false): ''
24/01/16 06:02:07.647 [dispatcher-Executor] WARN  c.n.spark.rapids.RapidsPluginUtils - RAPIDS Accelerator 23.10.0 using cudf 23.10.0.
24/01/16 06:02:08.400 [dispatcher-Executor] INFO  o.a.s.r.s.s.ShimDiskBlockManager - Created local directory at /mnt/spark-deec7eff-062f-49c3-9faf-0d023590611c/executor-1ca20b0d-4a78-4ec3-b514-85ccf268b23c/blockmgr-4e8986c2-0337-437b-84c2-9063c7704401
24/01/16 06:02:09.006 [dispatcher-Executor] INFO  o.a.s.i.p.ExecutorPluginContainer - Initialized executor component for plugin com.nvidia.spark.SQLPlugin.
24/01/16 06:02:13.924 [dispatcher-Executor] INFO  o.a.s.e.CoarseGrainedExecutorBackend - Got assigned task 0
24/01/16 06:02:13.927 [dispatcher-Executor] INFO  o.a.s.e.CoarseGrainedExecutorBackend - Got assigned task 1
24/01/16 06:02:13.930 [Executor task launch worker for task 1.0 in stage 0.0 (TID 1)] INFO  org.apache.spark.executor.Executor - Running task 1.0 in stage 0.0 (TID 1)
24/01/16 06:02:13.931 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  org.apache.spark.executor.Executor - Running task 0.0 in stage 0.0 (TID 0)
24/01/16 06:02:14.247 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  o.a.spark.broadcast.TorrentBroadcast - Started reading broadcast variable 0 with 1 pieces (estimated total size 4.0 MiB)
24/01/16 06:02:14.278 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  o.a.s.n.c.TransportClientFactory - Successfully created connection to 7ae88f7da20c49e48c9ddbf89e51dbe6000000.internal.cloudapp.net/10.0.0.11:42275 after 1 ms (0 ms spent in bootstraps)
24/01/16 06:02:14.315 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  o.a.spark.storage.memory.MemoryStore - Block broadcast_0_piece0 stored as bytes in memory (estimated size 14.5 KiB, free 11.8 GiB)
24/01/16 06:02:14.323 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  o.a.spark.broadcast.TorrentBroadcast - Reading broadcast variable 0 took 76 ms
24/01/16 06:02:14.353 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  o.a.spark.storage.memory.MemoryStore - Block broadcast_0 stored as values in memory (estimated size 30.3 KiB, free 11.8 GiB)
24/01/16 06:02:15.332 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  o.a.s.s.c.e.codegen.CodeGenerator - Code generated in 143.751593 ms
24/01/16 06:02:15.388 [Executor task launch worker for task 1.0 in stage 0.0 (TID 1)] INFO  o.a.spark.api.python.PythonRunner - Times: total = 383, boot = 327, init = 56, finish = 0
24/01/16 06:02:15.388 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  o.a.spark.api.python.PythonRunner - Times: total = 383, boot = 330, init = 53, finish = 0
24/01/16 06:02:15.533 [Executor task launch worker for task 1.0 in stage 0.0 (TID 1)] INFO  org.apache.spark.executor.Executor - Finished task 1.0 in stage 0.0 (TID 1). 4391 bytes result sent to driver
24/01/16 06:02:15.533 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] INFO  org.apache.spark.executor.Executor - Finished task 0.0 in stage 0.0 (TID 0). 4391 bytes result sent to driver
24/01/16 06:02:15.693 [dispatcher-Executor] INFO  o.a.s.e.CoarseGrainedExecutorBackend - Got assigned task 2
24/01/16 06:02:15.693 [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] INFO  org.apache.spark.executor.Executor - Running task 0.0 in stage 2.0 (TID 2)
24/01/16 06:02:15.729 [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] INFO  o.a.spark.MapOutputTrackerWorker - Updating epoch to 1 and clearing cache
24/01/16 06:02:15.731 [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] INFO  o.a.spark.broadcast.TorrentBroadcast - Started reading broadcast variable 1 with 1 pieces (estimated total size 4.0 MiB)
24/01/16 06:02:15.748 [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] INFO  o.a.spark.storage.memory.MemoryStore - Block broadcast_1_piece0 stored as bytes in memory (estimated size 11.5 KiB, free 11.8 GiB)
24/01/16 06:02:15.750 [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] INFO  o.a.spark.broadcast.TorrentBroadcast - Reading broadcast variable 1 took 19 ms
24/01/16 06:02:15.752 [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] INFO  o.a.spark.storage.memory.MemoryStore - Block broadcast_1 stored as values in memory (estimated size 24.2 KiB, free 11.8 GiB)
24/01/16 06:02:15.861 [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] INFO  o.a.spark.MapOutputTrackerWorker - Don't have map outputs for shuffle 0, fetching them
24/01/16 06:02:15.862 [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] INFO  o.a.spark.MapOutputTrackerWorker - Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@7ae88f7da20c49e48c9ddbf89e51dbe6000000.internal.cloudapp.net:34209)
24/01/16 06:02:15.905 [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] INFO  o.a.spark.MapOutputTrackerWorker - Got the map output locations
24/01/16 06:02:15.935 [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] INFO  o.a.s.s.ShuffleBlockFetcherIterator - Getting 2 (352.0 B) non-empty blocks including 2 (352.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
24/01/16 06:02:15.937 [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] INFO  o.a.s.s.ShuffleBlockFetcherIterator - Started 0 remote fetches in 10 ms
24/01/16 06:02:16.000 [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] INFO  o.a.s.s.c.e.codegen.CodeGenerator - Code generated in 10.078387 ms

from spark-rapids-examples.

GaryShen2008 avatar GaryShen2008 commented on August 16, 2024

Hi @LIN-Yu-Ting ,

From the stderr, "symbol lookup error: /tmp/udfexamplesjni413332660788026425.so: undefined symbol: _ZN4cudf12copy_bitmaskERKNS_11column_viewEN3rmm16cuda_stream_viewEPNS3_2mr22device_memory_resourceE", it failed to find the libcudf.so I think.

Did you use a single node to set up the spark standalone cluster or use multiple nodes?
If it's single node (driver and executors run on the same node), it's strange that one query succeeded but another failed.
If it's multiple nodes, I wonder if the rapids-4-spark_2.12-23.10.0-cuda12.jar wasn't copied to the executor's node so that it failed to find the library.

from spark-rapids-examples.

GaryShen2008 avatar GaryShen2008 commented on August 16, 2024

Or it may be the same problem as #344.

from spark-rapids-examples.

LIN-Yu-Ting avatar LIN-Yu-Ting commented on August 16, 2024

Hi @LIN-Yu-Ting ,

From the stderr, "symbol lookup error: /tmp/udfexamplesjni413332660788026425.so: undefined symbol: _ZN4cudf12copy_bitmaskERKNS_11column_viewEN3rmm16cuda_stream_viewEPNS3_2mr22device_memory_resourceE", it failed to find the libcudf.so I think.

Did you use a single node to set up the spark standalone cluster or use multiple nodes? If it's single node (driver and executors run on the same node), it's strange that one query succeeded but another failed. If it's multiple nodes, I wonder if the rapids-4-spark_2.12-23.10.0-cuda12.jar wasn't copied to the executor's node so that it failed to find the library.

@GaryShen2008 My spark environment is single node (driver and executors run on the same node)

from spark-rapids-examples.

GaryShen2008 avatar GaryShen2008 commented on August 16, 2024

Please let us know if you solve the issue after using the same version of cudf like 23.12.0.

from spark-rapids-examples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.