Skip to content

Latest commit

 

History

History
142 lines (107 loc) · 8.97 KB

File metadata and controls

142 lines (107 loc) · 8.97 KB

java2graph Architecture & Design

java2graph is a high-performance, parallelized Command Line Interface (CLI) tool designed to parse Java source code and its compiled dependencies to generate a rich, queryable representation of the codebase. It extracts structural information (classes, interfaces, methods, lambdas) and behavioral relationships (inheritance, method definitions, method calls) and exports them simultaneously into CSV files and a highly-optimized embedded columnar graph database, LadybugDB.

This document outlines the system architecture, the data processing pipeline, the graph schema, and how to interact with the generated dataset.


1. High-Level Architecture

The CLI is built using a Pass-Based, Multi-Threaded Pipeline Architecture. Because parsing and resolving large codebases can be extremely computationally expensive, the work is divided into distinct, isolated passes. The passes operate on a shared context object (GraphContext) using concurrent data structures (ConcurrentHashMap, ConcurrentLinkedQueue) to ensure safe parallel execution.

The core technologies driving the tool are:

  • JavaParser & JavaSymbolSolver: For constructing Abstract Syntax Trees (ASTs) and performing semantic resolution (binding method calls to actual definitions, determining lambda signatures, and resolving inheritance across source files and JARs).
  • Lombok (delombok): For dynamic preprocessing of annotations to ensure accurate structural extraction without requiring the user to pre-compile the source with specific plugins.
  • LadybugDB: An embedded, serverless, columnar graph database optimized for complex analytical graph queries.
  • Picocli: For a rich, POSIX-compliant command-line interface.

2. The Processing Pipeline

The execution flow of the application is orchestrated in the Main class, which initializes the Java2GraphConfig and GraphContext and then sequentially executes an array of passes:

Pass 1: DelombokPass

Purpose: Preprocess source code to expand Lombok annotations (e.g., @Data, @Getter, @Builder) into standard Java boilerplate (getters, setters, constructors).

  • Why? JavaParser operates on the raw AST. Without delomboking, implicit methods generated by Lombok are invisible to the parser, leading to unresolvable method calls downstream.
  • How: The pass invokes the lombok.launch.Main engine via reflection (to bypass Java 11+ module encapsulation rules) and writes the expanded source code to a temporary directory. The application configuration's source directory is dynamically updated to point to this temporary directory for all subsequent passes.

Pass 2: ParsePass

Purpose: Convert raw Java source files into Abstract Syntax Trees (ASTs).

  • How: It configures a CombinedTypeSolver that combines:
    • ReflectionTypeSolver (for standard Java library classes).
    • JavaParserTypeSolver (for the project's source code).
    • JarTypeSolver (dynamically added for every .jar file provided via the CLI arguments).
  • Parallelism: Uses Files.walk to find all .java files and processes them using a parallelStream(). The resulting CompilationUnit objects are stored in the concurrent GraphContext.

Pass 3: ResolvePass

Purpose: Traverse the ASTs to extract structural nodes and establish semantic edges (relationships) between them.

  • How: It utilizes the Visitor Pattern (VoidVisitorAdapter) to traverse every CompilationUnit. For every relevant node (Classes, Methods, Object Creations, Method Calls), it invokes the JavaSymbolSolver to resolve the Fully Qualified Name (FQN).
  • Key extractions:
    • Classes/Interfaces: Captures FQN, name, and raw declaration code. Detects EXTENDS and IMPLEMENTS edges by resolving extended/implemented types.
    • Methods & Constructors: Captures signature, source code, and links them to their containing class.
    • Lambdas: Dynamically generates unique IDs (based on line numbers and containing scopes) for lambda expressions, treating them as first-class methods in the graph.
    • Method Calls: Resolves the caller's context and the target method's exact signature (even across JAR boundaries) to create a MethodCallEdge.

Pass 4: ExportPass

Purpose: Persist the in-memory graph structures to disk.

  • CSV Export: Uses Apache Commons CSV to quickly dump the raw nodes and edges into relational .csv files for traditional data processing.
  • LadybugDB Export: Initializes an embedded Ladybug database. It defines a strict property graph schema (Node Tables and Rel Tables) and uses PreparedStatements to efficiently batch-insert the nodes and edges from the GraphContext directly into the columnar storage engine.

3. Data Models & Graph Schema

The GraphContext relies on intermediate POJOs (ClassNode, MethodNode, InheritanceEdge, MethodCallEdge). When persisted to LadybugDB, these map directly to the following Cypher Schema:

Node Tables

1. Class Node Represents a Java Class or Interface.

  • id (STRING) - Primary Key: The Fully Qualified Name (FQN).
  • fqn (STRING): The Fully Qualified Name.
  • name (STRING): The short name of the class.
  • isInterface (BOOLEAN): True if it is an interface.
  • declarationCode (STRING): The source code of the class declaration block.

2. Method Node Represents a standard method, constructor, or lambda expression.

  • id (STRING) - Primary Key: The unique signature/FQN of the method.
  • fqn (STRING): The unique signature/FQN.
  • name (STRING): The short name of the method (or "lambda").
  • signature (STRING): The method signature parameters.
  • sourceCode (STRING): The full source code block of the method body.
  • isLambda (BOOLEAN): True if the method is an extracted lambda expression.

Relationship (Edge) Tables

  • Extends (Class -> Class): Indicates that the source Class extends the target Class.
  • Implements (Class -> Class): Indicates that the source Class implements the target Interface.
  • Defines (Class -> Method): Indicates that a Class encapsulates a specific Method.
  • Calls (Method -> Method): Indicates that the source Method's body contains an invocation of the target Method.

4. LadybugDB Example Cypher Queries

Once the database is generated (e.g., in a directory named my-graph.db), you can connect to it using LadybugDB's CLI or client libraries to perform deep architectural analysis.

Here are a few examples of what you can discover:

1. Find all highly-coupled classes (God Objects) Find classes that define the most methods.

MATCH (c:Class)-[:Defines]->(m:Method)
RETURN c.name, COUNT(m) AS methodCount
ORDER BY methodCount DESC
LIMIT 10;

2. Impact Analysis (Reverse Call Graph) If I change the saveUser method in UserRepository, which other methods are directly or indirectly affected? (Using variable-length paths)

MATCH (caller:Method)-[:Calls*1..3]->(target:Method {name: 'saveUser'})
RETURN caller.fqn, target.fqn;

3. Find Unused / Dead Code (Orphan Methods) Find methods that are defined, are not constructors or lambdas, and are never called by any other method in the analyzed codebase.

MATCH (c:Class)-[:Defines]->(m:Method)
WHERE NOT ()-[:Calls]->(m) AND m.name <> c.name AND m.isLambda = false
RETURN m.fqn;

4. Interface Implementation Discovery Find all concrete classes that implement java.io.Serializable and count how many methods they define.

MATCH (c:Class)-[:Implements]->(i:Class {name: 'Serializable'}), (c)-[:Defines]->(m:Method)
RETURN c.name, COUNT(m) AS definedMethods;

5. Build, Distribution, and Standalone Execution

Because parsing requires significant computational resources and users may not have a compatible JVM installed, java2graph is distributed as a Zero-Dependency Native Executable.

How it works:

  1. Maven Fat Jar: The project is compiled into a single jar-with-dependencies containing all libraries (JavaParser, LadybugDB native bindings, Picocli).
  2. jdeps: Analyzes the Fat Jar to dynamically determine exactly which JDK modules (e.g., java.base, java.compiler, java.sql) are required.
  3. jlink: Strips out the vast majority of the JVM, creating a custom, highly-compressed, minimal Java Runtime Environment (JRE) tailored exclusively for this CLI.
  4. jpackage: Bundles the Fat Jar and the custom minimal JRE into a standalone native application image (e.g., .app for macOS, .exe for Windows, or an ELF binary for Linux).

CI/CD (GitHub Actions)

The repository contains a .github/workflows/release.yml workflow. Upon pushing to the main branch, GitHub Actions will:

  • Spin up instances of Ubuntu, macOS, and Windows.
  • Build the Maven project.
  • Execute jlink and jpackage natively on each OS.
  • Zip and upload the cross-platform native binaries as build artifacts.

This guarantees that a user can simply download java2graph-Windows.zip or java2graph-macos.tar.gz, extract it, and run the tool immediately from their terminal without installing Java.