Skip to content

Deadlock on LazyStorage when code inside ClassLoader is instrumented and the initialization of LazyStorage is happening on a different thread #8434

@edwardxia

Description

@edwardxia

Describe the bug

We have a custom instrumentation for X509 certificates object, and this instrumentation happened to be loaded during ClassLoader.

Once the lock inside ClassLoader is taken on main thread, the instrumentation attempts to create a span, which access LazyStorage. However, at the same time, on the Notification Thread where otel is handling gc notifications, it's trying to create a metrics that is initializing the LazyStorage, however, because the main thread already taken the ClassLoader lock, LazyStorage cannot be initialized.

In other words:

  • main holds lock, and wait for LazyStorage to initialize.
  • Notification Thread is initializing LazyStorage, but it's blocked by the lock that held by main thread.

Please see the thread dump attached at the bottom for more detail.

Steps to reproduce
It's a timing issue so it's not always guaranteed to happen. As described above, to trigger the problem it requires an instrumentation that happens inside the ClassLoader code path after the lock on ZipFile is taken.

What did you expect to see?
It should work.

What did you see instead?
It got into a deadlock.

What version and what artifacts are you using?
Artifacts: opentelemetry-api/sdk/javaagent/custom extensions
Version: 1.62.0/2.28.1

Environment
Compiler: OpenJDK 21
OS: Linux

Additional context

Here is a sample thread dump of the deadlock:

"main" #1 prio=5 os_prio=0 cpu=2374.90ms elapsed=1121.98s tid=0x00007f4288032830 nid=0xd5 in Object.wait()  [0x00007f429238b000]
   java.lang.Thread.State: RUNNABLE
	at io.opentelemetry.javaagent.shaded.io.opentelemetry.context.ContextStorage.get(ContextStorage.java:72)
	- waiting on the Class initialization monitor for io.opentelemetry.javaagent.shaded.io.opentelemetry.context.LazyStorage
	at io.opentelemetry.javaagent.shaded.io.opentelemetry.context.Context.root(Context.java:105)
	at io.opentelemetry.sdk.trace.SdkSpanBuilder.setNoParent(SdkSpanBuilder.java:82)
	at io.opentelemetry.sdk.trace.ExtendedSdkSpanBuilder.setNoParent(ExtendedSdkSpanBuilder.java:46)
	at io.opentelemetry.sdk.trace.ExtendedSdkSpanBuilder.setNoParent(ExtendedSdkSpanBuilder.java:28)
	at sun.security.x509.X509CertImpl.<init>(java.base@17.0.17/X509CertImpl.java:176)
	at sun.security.provider.X509Factory.engineGenerateCertificate(java.base@17.0.17/X509Factory.java:102)
	at java.security.cert.CertificateFactory.generateCertificate(java.base@17.0.17/CertificateFactory.java:355)
	at sun.security.pkcs.PKCS7.parseSignedData(java.base@17.0.17/PKCS7.java:305)
	at sun.security.pkcs.PKCS7.parse(java.base@17.0.17/PKCS7.java:162)
	at sun.security.pkcs.PKCS7.parse(java.base@17.0.17/PKCS7.java:126)
	at sun.security.pkcs.PKCS7.<init>(java.base@17.0.17/PKCS7.java:108)
	at sun.security.util.SignatureFileVerifier.<init>(java.base@17.0.17/SignatureFileVerifier.java:125)
	at java.util.jar.JarVerifier.processEntry(java.base@17.0.17/JarVerifier.java:311)
	at java.util.jar.JarVerifier.update(java.base@17.0.17/JarVerifier.java:242)
	at java.util.jar.JarFile.initializeVerifier(java.base@17.0.17/JarFile.java:762)
	at java.util.jar.JarFile.ensureInitialization(java.base@17.0.17/JarFile.java:1038)
	- locked <0x0000000700e3c760> (a java.util.jar.JarFile)
	at java.util.jar.JavaUtilJarAccessImpl.ensureInitialization(java.base@17.0.17/JavaUtilJarAccessImpl.java:72)
	at jdk.internal.loader.URLClassPath$JarLoader$2.getManifest(java.base@17.0.17/URLClassPath.java:885)
	at jdk.internal.loader.BuiltinClassLoader.defineClass(java.base@17.0.17/BuiltinClassLoader.java:848)
	at jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(java.base@17.0.17/BuiltinClassLoader.java:760)
	at jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(java.base@17.0.17/BuiltinClassLoader.java:681)
	- locked <0x00000007f42a5478> (a java.lang.Object)
	at jdk.internal.loader.BuiltinClassLoader.loadClass(java.base@17.0.17/BuiltinClassLoader.java:639)
	at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(java.base@17.0.17/ClassLoaders.java:188)
	at java.lang.ClassLoader.loadClass(java.base@17.0.17/ClassLoader.java:525)
	at java.lang.Class.getDeclaredMethods0(java.base@17.0.17/Native Method)
	at java.lang.Class.privateGetDeclaredMethods(java.base@17.0.17/Class.java:3404)
	at java.lang.Class.getMethodsRecursive(java.base@17.0.17/Class.java:3545)
	at java.lang.Class.getMethod0(java.base@17.0.17/Class.java:3531)
	at java.lang.Class.getMethod(java.base@17.0.17/Class.java:2227)
	at sun.launcher.LauncherHelper.validateMainClass(java.base@17.0.17/LauncherHelper.java:867)
	at sun.launcher.LauncherHelper.checkAndLoadMain(java.base@17.0.17/LauncherHelper.java:726)

"Notification Thread" #18 daemon prio=9 os_prio=0 cpu=40.11ms elapsed=1112.49s tid=0x00007f428b3de4b0 nid=0x151 waiting for monitor entry  [0x00007f42203d7000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at java.util.zip.ZipFile.getEntry(java.base@17.0.17/ZipFile.java:337)
	- waiting to lock <0x0000000700e3c760> (a java.util.jar.JarFile)
	at java.util.jar.JarFile.getEntry(java.base@17.0.17/JarFile.java:518)
	at java.util.jar.JarFile.getJarEntry(java.base@17.0.17/JarFile.java:473)
	at jdk.internal.loader.URLClassPath$JarLoader.getResource(java.base@17.0.17/URLClassPath.java:956)
	at jdk.internal.loader.URLClassPath$JarLoader.findResource(java.base@17.0.17/URLClassPath.java:939)
	at jdk.internal.loader.URLClassPath$1.next(java.base@17.0.17/URLClassPath.java:348)
	at jdk.internal.loader.URLClassPath$1.hasMoreElements(java.base@17.0.17/URLClassPath.java:358)
	at jdk.internal.loader.BuiltinClassLoader$1.hasNext(java.base@17.0.17/BuiltinClassLoader.java:408)
	at jdk.internal.loader.BuiltinClassLoader$1.hasMoreElements(java.base@17.0.17/BuiltinClassLoader.java:416)
	at java.lang.CompoundEnumeration.next(java.base@17.0.17/ClassLoader.java:2730)
	at java.lang.CompoundEnumeration.hasMoreElements(java.base@17.0.17/ClassLoader.java:2739)
	at java.util.ServiceLoader$LazyClassPathLookupIterator.nextProviderClass(java.base@17.0.17/ServiceLoader.java:1210)
	at java.util.ServiceLoader$LazyClassPathLookupIterator.hasNextService(java.base@17.0.17/ServiceLoader.java:1228)
	at java.util.ServiceLoader$LazyClassPathLookupIterator.hasNext(java.base@17.0.17/ServiceLoader.java:1273)
	at java.util.ServiceLoader$2.hasNext(java.base@17.0.17/ServiceLoader.java:1309)
	at java.util.ServiceLoader$3.hasNext(java.base@17.0.17/ServiceLoader.java:1393)
	at io.opentelemetry.javaagent.shaded.io.opentelemetry.context.LazyStorage.createStorage(LazyStorage.java:109)
	at io.opentelemetry.javaagent.shaded.io.opentelemetry.context.LazyStorage.<clinit>(LazyStorage.java:80)
	at io.opentelemetry.javaagent.shaded.io.opentelemetry.context.ContextStorage.get(ContextStorage.java:72)
	at io.opentelemetry.javaagent.shaded.io.opentelemetry.context.Context.current(Context.java:92)
	at io.opentelemetry.sdk.metrics.SdkDoubleHistogram.record(SdkDoubleHistogram.java:56)
	at io.opentelemetry.javaagent.shaded.instrumentation.runtimetelemetry.internal.GarbageCollector$GcNotificationListener.handleNotification(GarbageCollector.java:136)
	at sun.management.NotificationEmitterSupport.sendNotification(java.management@17.0.17/NotificationEmitterSupport.java:155)
	at com.sun.management.internal.GarbageCollectorExtImpl.createGCNotification(jdk.management@17.0.17/GarbageCollectorExtImpl.java:115)

Note: the above was thread dump for 1.60.0, I was able to workaround the issue by not using setNoParent to bypass access to Context.

Now with 1.62.0 it's locked at a different location due to a recent change that makes a call into Context.root(): 74964c6

With this recent change, our workaround no longer works and it stuck at Context.root():

	io.opentelemetry.javaagent.shaded.io.opentelemetry.context.ContextStorage.get(ContextStorage.java:72)
	io.opentelemetry.javaagent.shaded.io.opentelemetry.context.Context.root(Context.java:105)
	io.opentelemetry.sdk.trace.SdkSpanBuilder.<clinit>(SdkSpanBuilder.java:50)
	io.opentelemetry.sdk.trace.IncubatingUtil.createExtendedSpanBuilder(IncubatingUtil.java:32)
	io.opentelemetry.sdk.trace.SdkTracer.spanBuilder(SdkTracer.java:75)
	io.opentelemetry.sdk.trace.ExtendedSdkTracer.spanBuilder(ExtendedSdkTracer.java:25)
	io.opentelemetry.sdk.trace.ExtendedSdkTracer.spanBuilder(ExtendedSdkTracer.java:14)
	sun.security.x509.X509CertImpl.<init>(X509CertImpl.java:176)

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions