Add DeepSeek Engram layer by shuningjin · Pull Request #3010 · AI-Hypercomputer/maxtext

shuningjin · 2026-01-26T20:49:10Z

Description

Background

Paper: https://arxiv.org/pdf/2601.07372
Reference Code: https://github.com/deepseek-ai/Engram/blob/main/engram_demo_v1.py

What this PR does

Add Engram layer: engram.py

compressed tokenizer (non-parametric)
n-gram hash mapping (non-parametric)
multi-head embedding
short convolution (multi-branch)
engram (multi-branch)

Add unit test: tests.unit.engram_vs_reference_test

verify each module match output of reference code

Implementation Notes

Relationship of components

n-gram hash mapping: encompass compressed tokenizer.
Engram: encompass multi-head embedding, short convolution
n-gram hash mapping and Engram
- n-gram hash mapping converts vanilla token-ids to hashed ngram token-ids, which Engram consumes for embedding lookup
- Future: n-gram hash mapping will need to be inserted in Data Input Pipeline in future integration

Multi-branch

engram and shortconv handles multi-branch input and multi-branch output (if hc_mult > 1), optimized with nnx.vmap
Future: to be integrated into multi-branch backbone like mHC.

Tests

unit test against reference

python3 -m pytest -v --pyargs tests.unit.engram_vs_reference_test -rP -s

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-01-26T21:01:59Z

Codecov Report

❌ Patch coverage is 0% with 209 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/MaxText/layers/engram.py	0.00%	209 Missing ⚠️

📢 Thoughts on this report? Let us know!

shuningjin · 2026-02-04T22:20:08Z

@gemini-cli /review

RissyRan

@gemini-cli /review

github-actions · 2026-02-05T03:16:54Z

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This pull request introduces a JAX implementation of the DeepSeek Engram layer, along with comprehensive unit tests that validate its behavior against a PyTorch reference. The code is well-structured and the implementation appears to be correct and thorough. The core logic is sound, and the use of vectorization with nnx.vmap is a good practice for performance.

🔍 General Feedback

Good Testing: The inclusion of unit tests comparing the JAX implementation to a PyTorch reference is excellent. This provides high confidence in the correctness of the implementation.
Clear Implementation: The code in engram.py is well-commented and organized, making it easy to follow the logic from the original paper.
TODOs: I've commented on the TODOs left in the code. Addressing them will improve the clarity and robustness of the implementation.

src/MaxText/pyconfig.py

github-actions · 2026-02-05T03:18:22Z

src/MaxText/layers/engram.py

+    # Structure: {layer_id: [[2gram_head1, ..., 2gram_headH], ..., [Ngram_head1, ..., Ngram_headH]]}
+    self.vocab_size_across_layers = self._calculate_vocab_size_across_layers()
+
+  def _calculate_multipliers_across_layers(self, seed: int):


🟡 Your comment here is valid and raises a good question. Using `tokenizer.pad_id` directly would be more robust and less prone to configuration errors. If there's a specific reason to pass `pad_id` separately and then look it up in the `lookup_table`, it would be beneficial to document that reasoning here. Otherwise, I'd recommend simplifying this to use the tokenizer's padding ID directly.

github-actions · 2026-02-05T03:18:23Z

src/MaxText/layers/engram.py

+          kernel_axes=("engram_dim", "embed"),
+          dtype=self.dtype,
+          weight_dtype=self.weight_dtype,
+          quant=self.quant,


🟡 You've correctly identified a point of potential confusion regarding the logical axes for sharding. For clarity and to ensure correct behavior in a distributed environment, it would be best to confirm the intended logical axis names. This will help maintainers and future contributors understand the sharding strategy.

github-actions · 2026-02-05T03:18:24Z

src/MaxText/layers/engram.py

+    # Value Projection (shared): Retrieved memory -> Value
+    self.value_proj = DenseGeneral(
+        in_features_shape=self.engram_dim,
+        out_features_shape=config.base_emb_dim,


🟡 This is another good catch regarding the logical axis names. Explicitly defining these based on your sharding plan will improve the code's readability and prevent potential issues with model parallelism. Please verify the correct logical axis names to be used here.

tests/unit/engram_vs_reference_test.py

github-actions · 2026-02-05T03:18:49Z

🤖 I'm sorry @RissyRan, but I was unable to process your request. Please see the logs for more details.

RissyRan

I reviewed the test and CompressedTokenizer. Will continue to review the rest part tomorrow.

src/MaxText/pyconfig.py

RissyRan · 2026-02-05T05:27:04Z

src/MaxText/layers/engram.py

+
+"""
+DeepSeek-AI, `Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
+  <https://arxiv.org/pdf/2601.07372>`_, 2026


is this extra "_" on purpose?

RissyRan · 2026-02-05T05:30:01Z

src/MaxText/layers/engram.py

+  """
+
+  def __init__(self, tokenizer: HFTokenizer):
+    # TODO(shuningjin): maybe don't need to hold tokenizer, if we only use the lookup table as bridge


What's the consequence if we remove it here?

RissyRan · 2026-02-05T05:59:57Z

tests/unit/engram_vs_reference_test.py

+  engram_head_dim: int = 32
+  engram_num_heads: int = 8  # num heads per n-gram
+  # Hashing
+  engram_pad_id: int = 2  # TODO(shuningjin): not the same as tokenizer.pad_id?


Does this need to be defined by users?

RissyRan

Thanks for the change! I left some initial comments, and may need to go over multihead embedding and conv parts. It should be quick.

RissyRan · 2026-02-05T20:38:19Z

src/MaxText/layers/engram.py

+  n-grams into fixed integer IDs. To handle the large combinatorial space, it uses:
+  1. Unique Prime Vocabularies: Per-head prime moduli to minimize collision overlap.
+  2. Sliding Window: Efficient shifting to generate n-gram views.
+  3. Lightweight Hashing: A multiplicative-XOR function (Rabin-Karp variant).


Nice! I may miss this part in reference implementation. Did you add this optimization?

RissyRan · 2026-02-05T20:44:18Z

src/MaxText/layers/engram.py

+      A dictionary mapping layer_id to a list of `max_ngram_size` multipliers.
+    """
+    # Pre-calculate bounds for random generation
+    max_long = np.iinfo(np.int64).max


Could you cross check if we could update all np to jnp?

RissyRan · 2026-02-05T20:49:12Z

src/MaxText/layers/engram.py

+    LAYER_PRIME_OFFSET = 10007
+
+    layer_multipliers = {}
+    for layer_id in self.layer_ids:


Do you think we could update this block using vectorized operation? dim will depends on len(layer_ids). It's fixed at compile time.

RissyRan · 2026-02-05T20:54:08Z

src/MaxText/layers/engram.py

+      quant: Optional[Quant] = None,
+      kernel_init: NdInitializer = nd_dense_init(1.0, "fan_in", "normal"),
+      *,
+      hc_mult: int = 4,


shall we put params with default value at the very end?

RissyRan · 2026-02-05T20:58:10Z

src/MaxText/layers/engram.py

+          axis=-1,
+          kernel_init=self.kernel_init,
+          # TODO(shuningjin): this needs to be actual logical axis? @reviewer
+          kernel_axes=("engram_dim", "embed"),


You could add the sharding constraint into base.yml.

maxtext/src/MaxText/configs/base.yml

Line 402 in 352dd58

logical_axis_rules: [

I see it is smaller dim compared to emb, we could shard on tensor as a starting point. I see embed usually sharding on fsdp, sequence, context etc.

RissyRan · 2026-02-05T21:00:32Z

src/MaxText/layers/engram.py

+    Shape annotation:
+      B: Batch Size
+      S: Sequence Length
+      G: hc_mult, Number of Branches


Do you plan to separate this config or treat it same as mhc_expansion_rate?

RissyRan · 2026-02-05T21:02:17Z

src/MaxText/layers/engram.py

+    # Norms (vectorized)
+    # Independent weights per branch, Branched input
+    @nnx.split_rngs(splits=hc_mult)
+    @nnx.vmap(in_axes=0, out_axes=0)


Are you sharding on batch dimension? Why is that? Similar comment for other in_axes=0 vmap op.

RissyRan · 2026-02-05T21:05:49Z

src/MaxText/layers/engram.py

+    # Vectorized broadcast: apply each of the G key_projs to the SAME embeddings.
+    # in_axes: (0, None) -> 0 splits the Dense layers, None broadcasts embeddings
+    # out_axes: 2        -> Stack the results at axis 2 to get (B, S, G, D)
+    @nnx.vmap(in_axes=(0, None), out_axes=2)


Do you know if this in_axes working properly? I see your unit test has b=2 setup. When I integrated flash attn with sparse attn, I have to change the unit test to from b=2 to b=4 when sharding on fsdp, otherwise, it will fail on v5p-8 local machine.

RissyRan · 2026-02-05T21:09:20Z

src/MaxText/layers/engram.py

+      max_ngram_size: int,
+      engram_num_heads: int,
+      layer_ids: List[int],
+      tokenizer: HFTokenizer,


When you saying would like to put the look up table into data pipeline. Is this structure or performance beneficial? When we call the engram from decoder layer, we need to pass this tokenizer. So you are thinking, this engram module will call/depend on data pipeline to look up?

RissyRan · 2026-02-05T21:10:35Z

tests/unit/engram_vs_reference_test.py

+    self.backbone_config = BackBoneConfig(self.config)
+    tokenizer = AutoTokenizer.from_pretrained(self.config.tokenizer_path, trust_remote_code=True)
+    # input
+    batch, seq_len = 2, 3


Could we set up a longer sequence, like 8, so test overlap of 2/3-grams?

shuningjin force-pushed the shuningjin-engram branch from e8ae3c9 to f095801 Compare January 29, 2026 15:00

shuningjin changed the title ~~[DRAFT] do no merge~~ [DRAFT] engram Jan 29, 2026

shuningjin force-pushed the shuningjin-engram branch 2 times, most recently from 93458cf to 21cec5f Compare January 30, 2026 17:52

shuningjin changed the title ~~[DRAFT] engram~~ Add DeepSeek Engram layer Feb 4, 2026

shuningjin force-pushed the shuningjin-engram branch from bb190ed to 2dc37df Compare February 4, 2026 21:34

shuningjin marked this pull request as ready for review February 4, 2026 21:48

shuningjin requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, gagika, gobbleturk, hengtaoguo, jesselu-google, jiangjy1982, khatwanimohit, parambole, richjames0, shralex, suexu1025 and vipannalla as code owners February 4, 2026 21:48

shuningjin assigned RissyRan and gagika Feb 4, 2026

RissyRan reviewed Feb 5, 2026

View reviewed changes

github-actions bot reviewed Feb 5, 2026

View reviewed changes

Add DeepSeek Engram layer

5371cae

shuningjin force-pushed the shuningjin-engram branch from 2dc37df to 5371cae Compare February 5, 2026 05:16

RissyRan reviewed Feb 5, 2026

View reviewed changes

Conversation

shuningjin commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Background

What this PR does

Implementation Notes

Tests

Checklist

Uh oh!

codecov bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shuningjin commented Feb 4, 2026

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

github-actions bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shuningjin commented Jan 26, 2026 •

edited

Loading

codecov bot commented Jan 26, 2026 •

edited

Loading