Describe the bug
We have a custom instrumentation for X509 certificates object, and this instrumentation happened to be loaded during ClassLoader.
Once the lock inside ClassLoader is taken on main thread, the instrumentation attempts to create a span, which access LazyStorage. However, at the same time, on the Notification Thread where otel is handling gc notifications, it's trying to create a metrics that is initializing the LazyStorage, however, because the main thread already taken the ClassLoader lock, LazyStorage cannot be initialized.
In other words:
main holds lock, and wait for LazyStorage to initialize.
Notification Thread is initializing LazyStorage, but it's blocked by the lock that held by main thread.
Please see the thread dump attached at the bottom for more detail.
Steps to reproduce
It's a timing issue so it's not always guaranteed to happen. As described above, to trigger the problem it requires an instrumentation that happens inside the ClassLoader code path after the lock on ZipFile is taken.
What did you expect to see?
It should work.
What did you see instead?
It got into a deadlock.
What version and what artifacts are you using?
Artifacts: opentelemetry-api/sdk/javaagent/custom extensions
Version: 1.62.0/2.28.1
Environment
Compiler: OpenJDK 21
OS: Linux
Additional context
Here is a sample thread dump of the deadlock:
"main" #1 prio=5 os_prio=0 cpu=2374.90ms elapsed=1121.98s tid=0x00007f4288032830 nid=0xd5 in Object.wait() [0x00007f429238b000]
java.lang.Thread.State: RUNNABLE
at io.opentelemetry.javaagent.shaded.io.opentelemetry.context.ContextStorage.get(ContextStorage.java:72)
- waiting on the Class initialization monitor for io.opentelemetry.javaagent.shaded.io.opentelemetry.context.LazyStorage
at io.opentelemetry.javaagent.shaded.io.opentelemetry.context.Context.root(Context.java:105)
at io.opentelemetry.sdk.trace.SdkSpanBuilder.setNoParent(SdkSpanBuilder.java:82)
at io.opentelemetry.sdk.trace.ExtendedSdkSpanBuilder.setNoParent(ExtendedSdkSpanBuilder.java:46)
at io.opentelemetry.sdk.trace.ExtendedSdkSpanBuilder.setNoParent(ExtendedSdkSpanBuilder.java:28)
at sun.security.x509.X509CertImpl.<init>(java.base@17.0.17/X509CertImpl.java:176)
at sun.security.provider.X509Factory.engineGenerateCertificate(java.base@17.0.17/X509Factory.java:102)
at java.security.cert.CertificateFactory.generateCertificate(java.base@17.0.17/CertificateFactory.java:355)
at sun.security.pkcs.PKCS7.parseSignedData(java.base@17.0.17/PKCS7.java:305)
at sun.security.pkcs.PKCS7.parse(java.base@17.0.17/PKCS7.java:162)
at sun.security.pkcs.PKCS7.parse(java.base@17.0.17/PKCS7.java:126)
at sun.security.pkcs.PKCS7.<init>(java.base@17.0.17/PKCS7.java:108)
at sun.security.util.SignatureFileVerifier.<init>(java.base@17.0.17/SignatureFileVerifier.java:125)
at java.util.jar.JarVerifier.processEntry(java.base@17.0.17/JarVerifier.java:311)
at java.util.jar.JarVerifier.update(java.base@17.0.17/JarVerifier.java:242)
at java.util.jar.JarFile.initializeVerifier(java.base@17.0.17/JarFile.java:762)
at java.util.jar.JarFile.ensureInitialization(java.base@17.0.17/JarFile.java:1038)
- locked <0x0000000700e3c760> (a java.util.jar.JarFile)
at java.util.jar.JavaUtilJarAccessImpl.ensureInitialization(java.base@17.0.17/JavaUtilJarAccessImpl.java:72)
at jdk.internal.loader.URLClassPath$JarLoader$2.getManifest(java.base@17.0.17/URLClassPath.java:885)
at jdk.internal.loader.BuiltinClassLoader.defineClass(java.base@17.0.17/BuiltinClassLoader.java:848)
at jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(java.base@17.0.17/BuiltinClassLoader.java:760)
at jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(java.base@17.0.17/BuiltinClassLoader.java:681)
- locked <0x00000007f42a5478> (a java.lang.Object)
at jdk.internal.loader.BuiltinClassLoader.loadClass(java.base@17.0.17/BuiltinClassLoader.java:639)
at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(java.base@17.0.17/ClassLoaders.java:188)
at java.lang.ClassLoader.loadClass(java.base@17.0.17/ClassLoader.java:525)
at java.lang.Class.getDeclaredMethods0(java.base@17.0.17/Native Method)
at java.lang.Class.privateGetDeclaredMethods(java.base@17.0.17/Class.java:3404)
at java.lang.Class.getMethodsRecursive(java.base@17.0.17/Class.java:3545)
at java.lang.Class.getMethod0(java.base@17.0.17/Class.java:3531)
at java.lang.Class.getMethod(java.base@17.0.17/Class.java:2227)
at sun.launcher.LauncherHelper.validateMainClass(java.base@17.0.17/LauncherHelper.java:867)
at sun.launcher.LauncherHelper.checkAndLoadMain(java.base@17.0.17/LauncherHelper.java:726)
"Notification Thread" #18 daemon prio=9 os_prio=0 cpu=40.11ms elapsed=1112.49s tid=0x00007f428b3de4b0 nid=0x151 waiting for monitor entry [0x00007f42203d7000]
java.lang.Thread.State: BLOCKED (on object monitor)
at java.util.zip.ZipFile.getEntry(java.base@17.0.17/ZipFile.java:337)
- waiting to lock <0x0000000700e3c760> (a java.util.jar.JarFile)
at java.util.jar.JarFile.getEntry(java.base@17.0.17/JarFile.java:518)
at java.util.jar.JarFile.getJarEntry(java.base@17.0.17/JarFile.java:473)
at jdk.internal.loader.URLClassPath$JarLoader.getResource(java.base@17.0.17/URLClassPath.java:956)
at jdk.internal.loader.URLClassPath$JarLoader.findResource(java.base@17.0.17/URLClassPath.java:939)
at jdk.internal.loader.URLClassPath$1.next(java.base@17.0.17/URLClassPath.java:348)
at jdk.internal.loader.URLClassPath$1.hasMoreElements(java.base@17.0.17/URLClassPath.java:358)
at jdk.internal.loader.BuiltinClassLoader$1.hasNext(java.base@17.0.17/BuiltinClassLoader.java:408)
at jdk.internal.loader.BuiltinClassLoader$1.hasMoreElements(java.base@17.0.17/BuiltinClassLoader.java:416)
at java.lang.CompoundEnumeration.next(java.base@17.0.17/ClassLoader.java:2730)
at java.lang.CompoundEnumeration.hasMoreElements(java.base@17.0.17/ClassLoader.java:2739)
at java.util.ServiceLoader$LazyClassPathLookupIterator.nextProviderClass(java.base@17.0.17/ServiceLoader.java:1210)
at java.util.ServiceLoader$LazyClassPathLookupIterator.hasNextService(java.base@17.0.17/ServiceLoader.java:1228)
at java.util.ServiceLoader$LazyClassPathLookupIterator.hasNext(java.base@17.0.17/ServiceLoader.java:1273)
at java.util.ServiceLoader$2.hasNext(java.base@17.0.17/ServiceLoader.java:1309)
at java.util.ServiceLoader$3.hasNext(java.base@17.0.17/ServiceLoader.java:1393)
at io.opentelemetry.javaagent.shaded.io.opentelemetry.context.LazyStorage.createStorage(LazyStorage.java:109)
at io.opentelemetry.javaagent.shaded.io.opentelemetry.context.LazyStorage.<clinit>(LazyStorage.java:80)
at io.opentelemetry.javaagent.shaded.io.opentelemetry.context.ContextStorage.get(ContextStorage.java:72)
at io.opentelemetry.javaagent.shaded.io.opentelemetry.context.Context.current(Context.java:92)
at io.opentelemetry.sdk.metrics.SdkDoubleHistogram.record(SdkDoubleHistogram.java:56)
at io.opentelemetry.javaagent.shaded.instrumentation.runtimetelemetry.internal.GarbageCollector$GcNotificationListener.handleNotification(GarbageCollector.java:136)
at sun.management.NotificationEmitterSupport.sendNotification(java.management@17.0.17/NotificationEmitterSupport.java:155)
at com.sun.management.internal.GarbageCollectorExtImpl.createGCNotification(jdk.management@17.0.17/GarbageCollectorExtImpl.java:115)
Note: the above was thread dump for 1.60.0, I was able to workaround the issue by not using setNoParent to bypass access to Context.
Now with 1.62.0 it's locked at a different location due to a recent change that makes a call into Context.root(): 74964c6
With this recent change, our workaround no longer works and it stuck at Context.root():
io.opentelemetry.javaagent.shaded.io.opentelemetry.context.ContextStorage.get(ContextStorage.java:72)
io.opentelemetry.javaagent.shaded.io.opentelemetry.context.Context.root(Context.java:105)
io.opentelemetry.sdk.trace.SdkSpanBuilder.<clinit>(SdkSpanBuilder.java:50)
io.opentelemetry.sdk.trace.IncubatingUtil.createExtendedSpanBuilder(IncubatingUtil.java:32)
io.opentelemetry.sdk.trace.SdkTracer.spanBuilder(SdkTracer.java:75)
io.opentelemetry.sdk.trace.ExtendedSdkTracer.spanBuilder(ExtendedSdkTracer.java:25)
io.opentelemetry.sdk.trace.ExtendedSdkTracer.spanBuilder(ExtendedSdkTracer.java:14)
sun.security.x509.X509CertImpl.<init>(X509CertImpl.java:176)
Describe the bug
We have a custom instrumentation for X509 certificates object, and this instrumentation happened to be loaded during ClassLoader.
Once the lock inside ClassLoader is taken on
mainthread, the instrumentation attempts to create a span, which accessLazyStorage. However, at the same time, on theNotification Threadwhere otel is handling gc notifications, it's trying to create a metrics that is initializing theLazyStorage, however, because the main thread already taken the ClassLoader lock,LazyStoragecannot be initialized.In other words:
mainholds lock, and wait forLazyStorageto initialize.Notification Threadis initializingLazyStorage, but it's blocked by the lock that held bymainthread.Please see the thread dump attached at the bottom for more detail.
Steps to reproduce
It's a timing issue so it's not always guaranteed to happen. As described above, to trigger the problem it requires an instrumentation that happens inside the ClassLoader code path after the lock on ZipFile is taken.
What did you expect to see?
It should work.
What did you see instead?
It got into a deadlock.
What version and what artifacts are you using?
Artifacts: opentelemetry-api/sdk/javaagent/custom extensions
Version: 1.62.0/2.28.1
Environment
Compiler: OpenJDK 21
OS: Linux
Additional context
Here is a sample thread dump of the deadlock:
Note: the above was thread dump for 1.60.0, I was able to workaround the issue by not using
setNoParentto bypass access to Context.Now with 1.62.0 it's locked at a different location due to a recent change that makes a call into
Context.root(): 74964c6With this recent change, our workaround no longer works and it stuck at
Context.root():