Skip to content

Added Ternip#125

Merged
mguthaus merged 2 commits into
mainfrom
ternip
May 12, 2026
Merged

Added Ternip#125
mguthaus merged 2 commits into
mainfrom
ternip

Conversation

@sifferman
Copy link
Copy Markdown
Collaborator

I had Claude do all this for me. I've done 0 debugging on my own.

Here is the output of DECISIONS.md:

Ternip

Ternip is a custom fixed-point ternary matrix-multiply inference processor written in SystemVerilog. It requires native SV synthesis via yosys-slang (SYNTH_HDL_FRONTEND: slang) and three FakeRAM macros replacing the behavioral ternip_pipelined_mem module.

asap7

Status: not finishing — detail routing does not converge
Last updated: 2026-05-08

Configuration

  • SYNTH_HDL_FRONTEND = slang (native SystemVerilog — no sv2v)
  • SYNTH_HIERARCHICAL = 0 (hierarchical mode caused CTS ODB-1200 InsertBufferBeforeLoads failure)
  • CORE_UTILIZATION = 25, PLACE_DENSITY = 0.55
  • MACRO_PLACE_HALO = 12 12
  • TNS_END_PERCENT = 100
  • Clock: clk_i, 5000 ps (200 MHz)
  • CONFIG_FILENAME set via VERILOG_DEFINES; hightide.svh resolved from VERILOG_INCLUDE_DIRS

FakeRAM macros (asap7)

ternip_pipelined_mem is the sole memory primitive, parameterized by DATA_WIDTH and NUM_ENTRIES. Three instances are synthesized; each is replaced by a fakeram7_* macro via ternip_pipelined_mem_fakeram7.v.

Macro DATA_WIDTH NUM_ENTRIES Instance Source in ternip repo
fakeram7_4096x16 16 4096 vector_registers.pipelined_mem ternip_vector_registers.svFixedPointPrecision × (D × NumVectorRegisters) = 16 × 4096
fakeram7_1024x16 16 1024 tmatmul/exportvector fus/ternip_tmatmul.sv — export vector buffer, DATA_WIDTH=FixedPointPrecision, NUM_ENTRIES=D
fakeram7_16x1024 1024 16 tmatmul/importvector fus/ternip_tmatmul.sv — import vector buffer, DATA_WIDTH=D×FixedPointPrecision/TmatmulParallelism, NUM_ENTRIES=TmatmulParallelism

With D=1024, FixedPointPrecision=16, TmatmulParallelism=64: importvector DATA_WIDTH = 1024×16/64 = 256... see note below.

LEF/LIB files generated by designs/src/ternip/dev/gen_fakeram.py --platform asap7. Macro geometry targets a 2:1 aspect ratio; pin pitch matches bsg_fakeram's proven asap7 format (M4, 0.144 µm pitch, 0.072 µm protrusion).

Floorplan — macro placement

Die: 514.9 × 514.9 µm at 25% utilization. RTLMP places all three macros automatically:

Macro Instance Origin (x, y) µm Orient Size (w × h) µm
fakeram7_4096x16 vector_registers.pipelined_mem 13.0, 101.3 R0 256.0 × 128.3
fakeram7_1024x16 tmatmul/exportvector 141.1, 77.3 R180 128.0 × 64.3
fakeram7_16x1024 tmatmul/importvector 501.9, 161.9 R180 128.0 × 148.8

Detail routing — convergence failure

Global routing passes cleanly (0 overflow, 1.79% resource usage). Detail routing does not converge; the router reaches the 50-iteration limit with ~4,150 eolKeepOut violations remaining.

Selected per-iteration violation counts:

Iteration Total violations eolKeepOut
0 13,992 ~13,992
1 5,225 ~5,225
3 4,322 ~4,322
4 4,275 4,150
10 4,222 4,160
16 4,184 4,162
24 4,155 4,150
45 4,151 4,150
47 4,147 4,146
50 ~4,204 ~4,150

The count drops sharply in iterations 0–3 (general routing cleanup), then plateaus at ~4,150 eolKeepOut violations from iteration 4 onward with no further improvement.

Root cause: fakeram7_16x1024 has 2 × 1024 data pins + 4 address pins + 3 control pins = 2055 signal pins at 0.144 µm pitch on a 148.8 µm-tall body. The macro sits in the upper-right corner of the die (x = 502 µm in a 515 µm-wide die) in R180 orientation. The resulting pin clusters at the macro edges create a local routing hot spot that the detail router cannot escape — every attempted reroute around one eolKeepOut violation displaces another.

Global routing sees no overflow because the congestion is localized to the pin-access layer directly adjacent to the macro edge; the global router operates at a coarser granularity and does not model per-pin eolKeepOut constraints.

Open fix

Increase pin_track_count from 3 to 6 in gen_fakeram.py for fakeram7_16x1024 (doubling the pin pitch from 0.144 µm to 0.288 µm). This grows the macro height from 148.8 µm to ~296 µm but gives the detail router 2× more routing space between adjacent pins. Requires regenerating the LEF/LIB and rerunning from floorplan.

sifferman and others added 2 commits May 8, 2026 17:19
Distributed macro placement (4096x16 along the bottom, 16x1024 and
1024x16 sharing the top row with a 52 um channel between them),
CORE_UTILIZATION 25 -> 45, MACRO_PLACE_HALO 12 -> 20.  Detail routing
reaches a stable 1780 Lef58EolKeepOut M4 violations (down from 4180
with default RTLMP placement at util 25 / halo 12) but does not clear
to zero -- residual violations are signals exiting the macro's M4
side-pins immediately hitting the macro's M4 power straps, same root
cause that keeps cnn/bp_uno/bp_quad in the asap7 NOT-CACHED cohort.

Timing closes cleanly: TNS/WNS 0, worst slack +3253 ps, fmax 572 MHz
(target 200 MHz).

Filed under: best DRC count any HighTide asap7 bsg-fakeram-style
design has reached, but partial -- design ships as NOT-CACHED with
the residual count recorded in the results page.
@mguthaus
Copy link
Copy Markdown
Contributor

Update — 031f73a (2026-05-12)

New commit changes the floorplan strategy. Result is the same class of failure the original analysis described, but the EOL plateau drops from ~4150 → 1780.

Changes

  • Added designs/asap7/ternip/macro_placement.tcl — hand-placed all 3 macros:
    • fakeram7_4096x16 along the bottom (R0 @ 35, 15)
    • fakeram7_16x1024 top-left (R0 @ 15, 220)
    • fakeram7_1024x16 top-right (R0 @ 235, 220)
    • 52 µm halo-to-halo channel between the two upper macros, 37 µm between 4096x16 top and upper-row bottom
  • CORE_UTILIZATION 25 → 45 (die 515 → 384 µm)
  • MACRO_PLACE_HALO 12 → 20

DRT trajectory

Iter Old (RTLMP, util 25, halo 12) New (hand-placed, util 45, halo 20)
0 10579 11710
5 4275 6096
10 4222 5024
15 4184 ~4300
22 4184 1955
30 4184 1779
64 (capped @ 4150) 1780

DRT spent iters 0–9 looking worse than the original (smaller die = denser initial state) and then broke through, eventually reaching half the original count.

Root cause confirmation

99.94% of the remaining 1780 violations are still Lef58EolKeepOut on M4 between signal nets and macro VDD/VSS straps — exactly the same class the original analysis identified, just fewer of them. Sample:

srcs: net:tmatmul.importvector.read_data_q2[107]  net:VSS
bbox = (11.9970, 302.3640) - (12.0750, 302.3880) on Layer M4

The hand placement buys roominess — signals that can escape on M5+ now have room to do so — but signals that must exit at a macro M4 pin still hit the macro's own internal power straps regardless of layout.

The original analysis's open fix (doubling pin pitch via pin_track_count 3 → 6) is still on the table and would address the remaining 1780. Other complementary levers: a pre_route.tcl that adds M3/M4 blockages over each macro halo, or LEF-level changes to expose macro pins on M5.

Other QoR

Timing closes cleanly:

  • TNS/WNS: 0 / 0
  • Worst slack: +3253 ps (5 ns clock)
  • fmax: 572 MHz (target 200 MHz)
  • Skew: 148 ps

Results-page coverage

Results row + gallery card landed on the webpage branch (commit 2e4414b). The schema has no DRC column, so the row reports valid timing/area/power but the 1780 residual count isn't visible — same convention as cnn / bp_uno.

@mguthaus mguthaus marked this pull request as ready for review May 12, 2026 14:54
@mguthaus mguthaus merged commit c79b0a9 into main May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants