Coder Social home page Coder Social logo

Comments (18)

galderz avatar galderz commented on August 17, 2024 1

and we can't control when Arc.INSTANCE and any references from that will become image heap objects, at which stage the verifications for invalid objects is carried out.

That doesn't sound right to me. Isn't the image heap supposed to get checked/verified after all iterations have been done? I.e. when the heap is in a stable state and there is no build-time code being run anymore?

That's what I thought initially would happen, but having looked at the code it used object reachable callbacks to detect thread leaks and those callbacks can happen at any time. Have you seen something in the code that makes you think that? Let's stop going around in circles about this.

Furthermore, here's output on the main thread being checked to see if it's reachable. There's no indication that this is happening at the end of the points-to analysis and indeed it pops even before the first points-to iteration has finished (which is signalled showing the '*' char). To reiterate, I don't see any guarantees on when this check will happen, so it can indeed, through timing such that main thread is marked reacheable while another thread is in the middle of the lock acquire/release block that I mentioned above. This is consistent with the behaviour of the bug, which happens randomly and therefore the right timing needs to happen.

Build resources:
 - 24.18GB of memory (75.6% of 32.00GB system memory, determined at start)
 - 10 thread(s) (100.0% of 10 available processor(s), determined at start)
Check for thread to be reachable: Thread[#38,main,5,main]
Stacktrace:
java.lang.Exception
	at org.graalvm.nativeimage.builder/com.oracle.svm.core.image.DisallowedImageHeapObjects.onThreadReachable(DisallowedImageHeapObjects.java:122)
	at org.graalvm.nativeimage.builder/com.oracle.svm.hosted.image.DisallowedImageHeapObjectFeature.lambda$duringSetup$4(DisallowedImageHeapObjectFeature.java:79)
	at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.meta.AnalysisType.lambda$notifyObjectReachable$12(AnalysisType.java:659)
	at java.base/java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4714)
	at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.util.ConcurrentLightHashSet.forEach(ConcurrentLightHashSet.java:146)
	at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.meta.AnalysisType.notifyObjectReachable(AnalysisType.java:659)
	at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.heap.ImageHeapScanner.onObjectReachable(ImageHeapScanner.java:586)
	at org.graalvm.nativeimage.builder/com.oracle.svm.hosted.heap.SVMImageHeapScanner.onObjectReachable(SVMImageHeapScanner.java:125)
	at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.heap.ImageHeapScanner.lambda$markReachable$10(ImageHeapScanner.java:572)
	at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.util.CompletionExecutor.executeCommand(CompletionExecutor.java:166)
	at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.util.CompletionExecutor.lambda$executeService$0(CompletionExecutor.java:152)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.compute(ForkJoinTask.java:1726)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.compute(ForkJoinTask.java:1717)
	at java.base/java.util.concurrent.ForkJoinTask$InterruptibleTask.exec(ForkJoinTask.java:1641)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:507)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1489)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:2071)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2033)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:189)
12:15:03,625 WARN  [io.net.res.dns.DnsServerAddressStreamProviders] Can not find io.netty.resolver.dns.macos.MacOSDnsServerAddressStreamProvider in the classpath, fallback to system defaults. This may result in incorrect DNS resolutions on MacOS. Check whether you have a dependency on 'io.netty:netty-resolver-dns-native-macos'
[2/8] Performing analysis...  [*****]

I have tried to introduced a small delay inside the LazyValye.get() acquire/release section but have not yet reproduced the issue. I'll try to increase it and see if it breaks.

from quarkus.

zakkak avatar zakkak commented on August 17, 2024 1

That's what I thought initially would happen, but having looked at the code it used object reachable callbacks to detect thread leaks and those callbacks can happen at any time. Have you seen something in the code that makes you think that?

I had a look and did some testing. So what GraalVM ensures is that sequential class initialization will complete before proceeding with the analysis. If the class initialization ends up deadlocking or taking way too long GraalVM's wathdog should detect it and fail the build, which doesn't appear to be the case here.

Now if we add concurrency in class initialization there is no way for GraalVM to detect whether all threads have completed or not. That said if a class initializer spawns a thread and returns, the spawned thread can potentially run in parallel with the analysis (e.g. the reachable callbacks) and image building in general which can be problematic...

What's interesting in this specific case though is that the thread holding the lock appears to be main, i.e. the sequential part of the class initializer has still not released the lock.

Another case where class initialization happens during Analysis, and might thus result in what we see above, is when a type not already initialized is found to be reachable by the analysis. In this case class initialization is triggered by whichever thread happened to detect the reachable type, and the initialization runs in parallel with other callbacks which could lead to the observed behavior...

So to sum up I believe that we need to detect which class initializer is triggering the LazyValue instance "initialization" (i.e. acquires the lock) and ensure that it is either no longer run or it runs before the analysis phase.

I will try to have another look the next days.

from quarkus.

galderz avatar galderz commented on August 17, 2024 1

@zakkak I didn't reproduce this myself, @gsmet found it in the failures in https://github.com/quarkiverse/quarkus-github-app/actions/runs/9153770878/job/25163260450 It was using Mandrel-23.1.3.1-Final

from quarkus.

quarkus-bot avatar quarkus-bot commented on August 17, 2024

/cc @Ladicek (arc), @manovotn (arc), @mkouba (arc)

from quarkus.

mkouba avatar mkouba commented on August 17, 2024

Hm, do we know why it only happens occasionally?

An easier fix here might be to make sure that all the locks inside the LazyValue instances are released.

All the locks should be released. If they're not then we should try to find out why...

We use the same strategy (CHM + LazyValue) for the application context BTW.

FTR previously, the LazyValue used volatile + synchronized but we switched to Lock to become virtual threads-friendly.

from quarkus.

galderz avatar galderz commented on August 17, 2024

If you need locking, I would recommend trying to use a lock that does not keep a strong reference to a Thread. Could something be constructed around thread ids instead? /cc @franz1981

The exception above shows that the lock has not been released:

    java.util.concurrent.locks.ReentrantLock$NonfairSync@69557381: java.util.concurrent.locks.ReentrantLock$NonfairSync@69557381[State = 1, empty q...

The State = 1... part means that a lock has been acquired but it has not yet been released. If it had been released, state would be 0 as per the ReentrantLock/AbstractQueuedSynchronizer implementations.

How is it possible that this happens? Each points-to-analysis iteration discovers objects that are referenced from static objects, e.g. from Arc.INSTANCE, it scans its fields to find new objects..., and during this process an image objects will be created for all these objects. One of the things done when creating image objects is checking for disallowed image heap objects, e.g. Thread.

The chances of the bug here happening are small but it can happen. For the issue to appear, points-to-analysis needs to be try to create image object for the lock inside the LazyValue. The window is quite small since it needs to happen within the window that the lock is held in LazyValue.get():

public T get() {
    // ...

    lock.lock();
    // Between here...
    try {
        if (value == null) {
            value = supplier.get();
        }
        return value;
    } finally {
        lock.unlock();
        // And here
    }
}

I can't think of anything that really can be done at the Quarkus level to protect against this kind of race condition. Avoiding strong references to disallowed instances is the best option I can think of /cc @zakkak

from quarkus.

mkouba avatar mkouba commented on August 17, 2024

I can't think of anything that really can be done at the Quarkus level to protect against this kind of race condition. Avoiding strong references to disallowed instances is the best option I can think of /cc @zakkak

Are you saying that any extension that makes use of ReentrantLock during the STATIC_INIT can break native image compilation? 👀

from quarkus.

dmlloyd avatar dmlloyd commented on August 17, 2024

Locking during static init should be fine, however the lock should be definitely released using finally and should not have some chance to be re-acquired during the "steady state" at the end of native image build. I would think that if this invariant is respected, you should have no trouble.

from quarkus.

galderz avatar galderz commented on August 17, 2024

@mkouba @dmlloyd I don't think the issue is necessarily about the reentrant lock acquire/release, but rather the fact that the reentrant lock is part of the image heap state (via Arc.INSTANCE), and we can't control when Arc.INSTANCE and any references from that will become image heap objects, at which stage the verifications for invalid objects is carried out.

from quarkus.

zakkak avatar zakkak commented on August 17, 2024

and we can't control when Arc.INSTANCE and any references from that will become image heap objects, at which stage the verifications for invalid objects is carried out.

That doesn't sound right to me. Isn't the image heap supposed to get checked/verified after all iterations have been done? I.e. when the heap is in a stable state and there is no build-time code being run anymore?

from quarkus.

dmlloyd avatar dmlloyd commented on August 17, 2024

@galderz is there any way you can get a look at the fields of the exact object that has the strong reference to the thread?

from quarkus.

dmlloyd avatar dmlloyd commented on August 17, 2024

If it is in fact java.util.concurrent.locks.AbstractOwnableSynchronizer.exclusiveOwnerThread then the lock was simply not released for some reason. That field is nulled out on release AFAICT.

from quarkus.

mkouba avatar mkouba commented on August 17, 2024

IOW, the cache in SingletonContext should probably become an immutable hashmap of values that have no lock references.

@galderz This cannot be done. The singleton context is not prepopulated during the STATIC_INIT. Beans are instantiated lazily when first used, i.e. at any time after the application starts.

An easier fix here might be to make sure that all the locks inside the LazyValue instances are released.

"released" as in calling Lock#unlock(), or?

and we can't control when Arc.INSTANCE and any references from that will become image heap objects, at which stage the verifications for invalid objects is carried out.

That doesn't sound right to me. Isn't the image heap supposed to get checked/verified after all iterations have been done? I.e. when the heap is in a stable state and there is no build-time code being run anymore?

@zakkak I would expect this behavior as well.

If it is in fact java.util.concurrent.locks.AbstractOwnableSynchronizer.exclusiveOwnerThread then the lock was simply not released for some reason. That field is nulled out on release AFAICT.

@dmlloyd Yes, it seems to be the java.util.concurrent.locks.AbstractOwnableSynchronizer.exclusiveOwnerThread (at least that's what I read in the original description). And yes, it shoud be nulled out in ReentrantLock.Sync.tryRelease(int).

from quarkus.

galderz avatar galderz commented on August 17, 2024

Thanks for the investigation @zakkak

So to sum up I believe that we need to detect which class initializer is triggering the LazyValue instance "initialization" (i.e. acquires the lock) and ensure that it is either no longer run or it runs before the analysis phase.

I'm not sure how feasible this is, but looking at how objects are added to ComputingCacheContextInstances, these contexts are added as a result of creating CDI beans. In the original exception:

    java.util.concurrent.ConcurrentHashMap$Node@77437480: 9SDuiV8v87J2ZUi6goJuuKNan1k=io.quarkus.arc.impl.LazyValue@57f1e065

9SDuiV8v87J2ZUi6goJuuKNan1k is the bean ID. If the way these beans are added to context can be somehow controlled to avoid the situation you explain, then maybe it can work. But if these beans are added to context as user types are discovered, then it might be hard to do. Since beans are discovered through annotations and we have the jandex for it, in theory all bean discovery should be doable early on and in a single pass.

Using something other than ReentrantLock in LazyValue might be easier to implement/achieve, but the Thread issue here could manifest in some other way with say, another forbidden type, so I can see the value of exploring this further.

from quarkus.

zakkak avatar zakkak commented on August 17, 2024

So far I see that:

  1. io.quarkus.runner.ApplicationImpl is registered for reflection
  2. This triggers an onTypeReachable callback which triggers the class initialization which ends up acquiring the lock
The stack trace at this point looks like this:
get:31, LazyValue (io.quarkus.arc.impl)
beanManager:404, ArcContainerImpl (io.quarkus.arc.impl)
<init>:40, ArcCDIProvider$ArcCDI (io.quarkus.arc.impl)
<init>:23, ArcCDIProvider (io.quarkus.arc.impl)
init:273, ArcContainerImpl (io.quarkus.arc.impl)
initialize:40, Arc (io.quarkus.arc)
initContainer:49, ArcRecorder (io.quarkus.arc.runtime)
deploy_0:-1, ArcProcessor$initializeContainer1770303700 (io.quarkus.deployment.steps)
deploy:-1, ArcProcessor$initializeContainer1770303700 (io.quarkus.deployment.steps)
<clinit>:-1, ApplicationImpl (io.quarkus.runner)
ensureClassInitialized0:-1, Unsafe (jdk.internal.misc)
ensureClassInitialized:1160, Unsafe (jdk.internal.misc)
ensureClassInitialized:177, ClassInitializationSupport (com.oracle.svm.hosted.classinitialization)
computeInitKindAndMaybeInitializeClass:191, AllowAllHostedUsagesClassInitializationSupport (com.oracle.svm.hosted.classinitialization)
computeInitKindAndMaybeInitializeClass:129, AllowAllHostedUsagesClassInitializationSupport (com.oracle.svm.hosted.classinitialization)
maybeInitializeAtBuildTime:161, ClassInitializationSupport (com.oracle.svm.hosted.classinitialization)
maybeInitializeAtBuildTime:150, ClassInitializationSupport (com.oracle.svm.hosted.classinitialization)
onTypeReachable:310, SVMHost (com.oracle.svm.hosted)
onTypeReachable:699, AnalysisUniverse (com.oracle.graal.pointsto.meta)
lambda$new$0:310, AnalysisType (com.oracle.graal.pointsto.meta)
run:-1, AnalysisType$$Lambda/0x0000000801102dd0 (com.oracle.graal.pointsto.meta)
call:572, Executors$RunnableAdapter (java.util.concurrent)
run:317, FutureTask (java.util.concurrent)
ensureDone:63, AnalysisFuture (com.oracle.graal.pointsto.util)
ensureOnTypeReachableTaskDone:696, AnalysisType (com.oracle.graal.pointsto.meta)
onReachable:590, AnalysisType (com.oracle.graal.pointsto.meta)
run:-1, AnalysisType$$Lambda/0x0000000801103660 (com.oracle.graal.pointsto.meta)
atomicSetAndRun:49, AtomicUtils (com.oracle.graal.pointsto.util)
lambda$registerAsReachable$8:562, AnalysisType (com.oracle.graal.pointsto.meta)
accept:-1, AnalysisType$$Lambda/0x0000000801103430 (com.oracle.graal.pointsto.meta)
forAllSuperTypes:676, AnalysisType (com.oracle.graal.pointsto.meta)
forAllSuperTypes:659, AnalysisType (com.oracle.graal.pointsto.meta)
forAllSuperTypes:655, AnalysisType (com.oracle.graal.pointsto.meta)
registerAsReachable:562, AnalysisType (com.oracle.graal.pointsto.meta)
registerClass:202, ReflectionDataBuilder (com.oracle.svm.hosted.reflect)
lambda$register$1:171, ReflectionDataBuilder (com.oracle.svm.hosted.reflect)
run:-1, ReflectionDataBuilder$$Lambda/0x0000000801133b70 (com.oracle.svm.hosted.reflect)
executeCommand:187, CompletionExecutor (com.oracle.graal.pointsto.util)
lambda$executeService$0:171, CompletionExecutor (com.oracle.graal.pointsto.util)
run:-1, CompletionExecutor$$Lambda/0x00000008011b1168 (com.oracle.graal.pointsto.util)
exec:1423, ForkJoinTask$RunnableExecuteAction (java.util.concurrent)
doExec:387, ForkJoinTask (java.util.concurrent)
externalHelpQuiesce:2260, ForkJoinPool (java.util.concurrent)
helpQuiescePool:2297, ForkJoinPool (java.util.concurrent)
awaitQuiescence:3565, ForkJoinPool (java.util.concurrent)
complete:237, CompletionExecutor (com.oracle.graal.pointsto.util)
doTypeflow:529, PointsToAnalysis (com.oracle.graal.pointsto)
finish:517, PointsToAnalysis (com.oracle.graal.pointsto)
runAnalysis:162, AbstractAnalysisEngine (com.oracle.graal.pointsto)
runPointsToAnalysis:783, NativeImageGenerator (com.oracle.svm.hosted)
doRun:592, NativeImageGenerator (com.oracle.svm.hosted)
run:550, NativeImageGenerator (com.oracle.svm.hosted)
buildImage:538, NativeImageGeneratorRunner (com.oracle.svm.hosted)
build:720, NativeImageGeneratorRunner (com.oracle.svm.hosted)
start:142, NativeImageGeneratorRunner (com.oracle.svm.hosted)
main:97, NativeImageGeneratorRunner (com.oracle.svm.hosted)
Note that I even managed to get the breakpoint to hit while "main" was indeed holding the lock:
lock = {ReentrantLock@21401} "java.util.concurrent.locks.ReentrantLock@575d8c53[Locked by thread main]"
 sync = {ReentrantLock$NonfairSync@22299} "java.util.concurrent.locks.ReentrantLock$NonfairSync@31c5a544[State = 1, empty queue]"
  head = null
  tail = null
  state = 1
  exclusiveOwnerThread = {Thread@1} "Thread[#1,main,5,main]"
   eetop = 140043113163408
   tid = 1
   name = "main"
   interrupted = false
   contextClassLoader = {NativeImageClassLoader@22303} 
   inheritedAccessControlContext = {AccessControlContext@22304} 
   holder = {Thread$FieldHolder@22305} 
   threadLocals = {ThreadLocal$ThreadLocalMap@22306} 
   inheritableThreadLocals = null
   scopedValueBindings = {Class@885} "class java.lang.Thread"
   interruptLock = {Object@22307} 
   parkBlocker = null
   nioBlocker = null
   cont = null
   uncaughtExceptionHandler = null
   threadLocalRandomSeed = 6170759035374141833
   threadLocalRandomProbe = -1463423431
   threadLocalRandomSecondarySeed = 1232312558
   container = null
   headStackableScopes = null

Now at this point, if io.quarkus.arc.Arc.container happens to get parsed (by another thread) while main still has the lock it could lead to the observed error.

@Galder what Mandrel/GraalVM version were you able to reproduce this with? If you can only reproduce with master builds, it could be oracle/graal@9d52cb9 causing this.

from quarkus.

zakkak avatar zakkak commented on August 17, 2024

OK, but the trace in #40737 (comment) is not from Mandrel-23.1.3.1-Final, right? How did you get that trace? Did you use a synthetic "reproducer"?

from quarkus.

zakkak avatar zakkak commented on August 17, 2024

Another case where class initialization happens during Analysis, and might thus result in what we see above, is when a type not already initialized is found to be reachable by the analysis. In this case class initialization is triggered by whichever thread happened to detect the reachable type, and the initialization runs in parallel with other callbacks which could lead to the observed behavior...

To add some more info to this...

Although a class initialization might get triggered this way GraalVM tries to ensure that it won't have any side effects before doing so. According to these checks if a class initializer accesses any static field from a different class the class initialization is postponed for run-time, so accessing the reentrant lock should not be possible this way, unless there is some bug in the checks and the code escapes them (but we can't tell with the current data).

Update: the above seems to hold only for classes not marked explicitly for build time initialization so it doesn't apply in our case...

Interestingly, when registering all classes for build time initialization no class initializers appear to run at the "Initializing" phase, they run in the "Analysis" phase instead when marked as reachable...

from quarkus.

galderz avatar galderz commented on August 17, 2024

OK, but the trace in #40737 (comment) is not from Mandrel-23.1.3.1-Final, right? How did you get that trace? Did you use a synthetic "reproducer"?

Ah yes, this particular comment was produced on master, not 23.1.3.1.

from quarkus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.