CogStack
diff --git a/‎.github/workflows/medcat-gliner_ci.yml‎
Lines changed: 121 additions & 0 deletions b/‎.github/workflows/medcat-gliner_ci.yml‎
Lines changed: 121 additions & 0 deletions
diff --git a/‎medcat-plugins/medcat-gliner/README.md‎
Lines changed: 38 additions & 0 deletions b/‎medcat-plugins/medcat-gliner/README.md‎
Lines changed: 38 additions & 0 deletions
diff --git a/‎medcat-plugins/medcat-gliner/pyproject.toml‎
Lines changed: 64 additions & 0 deletions b/‎medcat-plugins/medcat-gliner/pyproject.toml‎
Lines changed: 64 additions & 0 deletions
diff --git a/‎medcat-plugins/medcat-gliner/src/medcat_gliner/__init__.py‎
Lines changed: 3 additions & 0 deletions b/‎medcat-plugins/medcat-gliner/src/medcat_gliner/__init__.py‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎medcat-plugins/medcat-gliner/src/medcat_gliner/gliner_ner.py‎
Lines changed: 166 additions & 0 deletions b/‎medcat-plugins/medcat-gliner/src/medcat_gliner/gliner_ner.py‎
Lines changed: 166 additions & 0 deletions
@@ -0,0 +1,121 @@
+name: MedCAT-Gliner-CI (test / publish)
+
+on:
+  push:
+    branches: [ main ]
+    paths:
+      - 'medcat-plugins/medcat-gliner/**'
+      - '.github/workflows/medcat-gliner**'
+    tags:
+      - 'medcat-gliner/v*.*.*'  
+  pull_request:
+    paths:
+      - 'medcat-plugins/medcat-gliner/**'
+      - '.github/workflows/medcat-gliner**'
+
+permissions:
+  id-token: write
+
+defaults:
+  run:
+    working-directory: ./medcat-plugins/medcat-gliner
+
+jobs:
+  tests:
+
+    runs-on: ubuntu-latest
+    timeout-minutes: 30
+    strategy:
+      matrix:
+        python-version: [ '3.10', '3.11', '3.12', '3.13' ]
+      max-parallel: 4
+
+    steps:
+      - uses: actions/checkout@v6
+
+      - name: Install uv for Python ${{ matrix.python-version }}
+        uses: astral-sh/setup-uv@v7
+        with:
+          python-version: ${{ matrix.python-version }}
+          enable-cache: true
+
+      - name: Install dependencies
+        run: |
+          df -h # check spaces before
+          uv run python -m ensurepip
+          uv pip install -e ".[dev]" --extra-index-url https://download.pytorch.org/whl/cpu/
+          df -h # check spaces after
+
+      - name: Check types
+        run: |
+          uv run mypy --follow-imports=normal src/medcat_gliner --follow-untyped-imports
+
+      - name: Lint
+        run: |
+          uv run ruff check src/medcat_gliner --preview
+
+      - name: Test
+        run: |
+          uv run python -m unittest discover
+
+  publish-to-test-PyPI:
+    runs-on: ubuntu-latest
+    needs: tests
+    steps:
+      - name: Checkout main
+        uses: actions/checkout@v6
+        with:
+          fetch-depth: 0        # fetch all history
+          fetch-tags: true      # fetch tags explicitly
+
+      - name: Install uv for Python 3.10
+        uses: astral-sh/setup-uv@v7
+        with:
+          python-version: '3.10'
+          enable-cache: true
+
+      - name: Set timestamp-based dev version
+        run: |
+          TS=$(date -u +"%Y%m%d%H%M%S")
+          echo "SETUPTOOLS_SCM_PRETEND_VERSION_FOR_MEDCAT_GLINER=0.2.2.dev${TS}" >> $GITHUB_ENV
+
+      - name: Install dependencies
+        run: |
+          uv run python -m ensurepip
+
+      - name: Build package
+        run: |
+          uv build
+
+      - name: Publish distribution to TestPyPI
+        uses: pypa/gh-action-pypi-publish@release/v1
+        with:
+          repository_url: https://test.pypi.org/legacy/
+          packages_dir: medcat-plugins/medcat-gliner/dist
+
+  test-and-publish-to-PyPI:
+    runs-on: ubuntu-latest
+    if: startsWith(github.ref, 'refs/tags/')
+    needs: tests
+    steps:
+      - name: Checkout main
+        uses: actions/checkout@v6
+
+      - name: Install uv for Python 3.10
+        uses: astral-sh/setup-uv@v7
+        with:
+          python-version: '3.10'
+          enable-cache: true
+
+      - name: Install dependencies
+        run: |
+          uv run python -m ensurepip
+
+      - name: Build client package
+        run: |
+          uv build
+
+      - name: Publish production distribution to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1
+        with:
+          packages_dir: medcat-plugins/medcat-gliner/dist
@@ -0,0 +1,38 @@
+# MedCAT-gliner
+
+This provides [gliner](https://github.com/urchade/GLiNER) based NER step for MedCAT core library.
+
+# Usage
+
+First install from PyPI, e.g:
+```
+pip install medcat-gliner
+```
+Subsequently, if you have an existing model, you should be able to just change the NER component:
+```
+cat = CAT.load_model_pack("path/to/existing/model")
+# change component
+from medcat_gliner import GLiNERConfig
+cat.config.components.ner.comp_name = "gliner_ner"
+cat.config.components.ner.custom_cnf = GLiNERConfig()
+# recreate pipe with new NER component
+cat._recreate_pipe()
+# use as needed
+```
+
+## NER recall comparison (linkable SNOMED entities)
+
+The following results compare the existing NER (vocab based NER with spell checking) implementation with the gliner implementation when used as the NER component within MedCAT.
+Evaluation was performed on the **2023 SNOMED CT Linking Challenge** dataset.
+
+> **Important caveat**
+> This is **not a measure of general NER quality**.
+> Recall is computed only with respect to annotated, linkable SNOMED CT entities present in the linking dataset.
+> Mentions outside the annotation scope are treated as false positives by construction, so precision is not meaningful here.
+
+| Implementation         | True Positives | False Negatives | Recall | Runtime |
+| ---------------------- | -------------- | --------------- | ------ | ------- |
+| Vocab based NER        | 10,545         | 3,917           | 0.729  | ~5m 50s |
+| GliNER implementation  | 7,971          | 6,491           | 0.551  | ~34m    |
+
+As we can see, for this dataset, GliNER is significantly slower and performs worse than the standard vocab based implementation. This is likely because the vocab based NER step has been configured and tuned to work best within the MedCAT pipeline. It is likely that with additional tuning the GliNER implementation could perform as good or better than the vocab based linker does.
@@ -0,0 +1,64 @@
+[build-system]
+requires = ["setuptools>=61.0", "wheel",  "setuptools_scm>=8"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "medcat_gliner"
+dynamic = ["version"]
+description = ""
+readme = "README.md"
+license = { text = "Apache-2.0" }
+authors = [
+    { name="Mart Ratas", email="mart.ratas@kcl.ac.uk" }
+]
+requires-python = ">=3.10"
+
+keywords = ["NLP", "NER", "medical", "MedCAT", "gliner"]
+
+classifiers = [
+    "Development Status :: 3 - Alpha",
+    "Intended Audience :: Science/Research",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Programming Language :: Python :: 3.13",
+    "License :: OSI Approved :: Apache Software License"
+]
+
+dependencies = [
+    "medcat>=2.5",
+    "gliner",
+    # transitive dependency of gliner that `uv` pins to and
+    # doesn't work in python 3.10
+    "onnxruntime<1.24; python_version < '3.11'",
+]
+
+[project.optional-dependencies]
+dev = [
+    "ruff",
+    "mypy",
+]
+
+# entry-points to add onto medcat
+[project.entry-points."medcat.plugins"]
+ner_gliner = "medcat_gliner"
+
+[project.urls]
+Homepage = "https://github.com/CogStack/medcat-ops/tree/main/medcat-gliner"
+Repository = "https://github.com/CogStack/medcat-ops/tree/main/medcat-gliner"
+Issues = "https://github.com/CogStack/medcat-ops/issues"
+
+[tool.setuptools_scm]
+root = "../.."
+tag_regex = "^medcat-gliner/v(?P<version>\\d+(?:\\.\\d+)*)(?:[ab]\\d+|rc\\d+)?$"
+version_scheme = "post-release"
+local_scheme = "no-local-version"
+git_describe_command = "git describe --dirty --tags --long --match 'medcat-gliner/v*'"
+
+[tool.setuptools.packages.find]
+where = ["src"]
+
+[tool.setuptools.package-data]
+"medcat_gliner" = ["py.typed"]
@@ -0,0 +1,3 @@
+from .registration import do_registration as __register
+
+__register()
@@ -0,0 +1,166 @@
+from typing import Iterator
+import logging
+
+from gliner import GLiNER
+
+from medcat.components.types import AbstractEntityProvidingComponent
+from medcat.components.types import CoreComponentType
+from medcat.cdb import CDB
+from medcat.vocab import Vocab
+from medcat.config.config import ComponentConfig, SerialisableBaseModel
+from medcat.tokenizing.tokenizers import BaseTokenizer
+from medcat.tokenizing.tokens import MutableDocument, MutableEntity
+from medcat.tokenizing.tokens import MutableToken
+from medcat.components.ner.vocab_based_annotator import maybe_annotate_name
+
+
+logger = logging.getLogger(__name__)
+
+
+class GliNERConfig(SerialisableBaseModel):
+    model_name: str = "urchade/gliner_base"
+    """The model to use.
+
+    See options:
+    https://huggingface.co/models?library=gliner&sort=trending
+    """
+    threshold: float = 0.5
+    """The threshold for the prediciton.
+
+    Higher will probably mean more false positives, while lower will
+    probably mean missing some true positives (i.e more false negatives)
+    """
+    chunking_overlap_tokens: int = 10
+    """If we're chunking the text for the HF model, what is the overlap we use.
+
+    For text longer than 384 words (which GLiNer people say is equivalent
+    to 512 tokens) the text needs to chunked in order to be processed by
+    the HF model. However, if the chunking has no overalp, some of the context
+    will be lost and performance will probably deteriate.
+    At the same time, larger overlaps will need more data will be processed
+    which leads to lower throughput.
+    """
+
+
+# NOTE: They allow up to 384 WORDs. They say
+#       that's equivalent 512 tokens, see:
+# https://github.com/urchade/GLiNER/issues/183#issuecomment-2330882600
+#       However, it's easier for us to just count the tokens.
+#       Also, they say you can increase it at the expense of performance
+#       so I'd rather do the splitting manually at that point.
+#       I tried 500 tokens, and it was consistently too many
+#       so went down to 400
+MAX_TOKENS_2_GLINER_AT_ONCE = 400
+
+
+class GlinerNER(AbstractEntityProvidingComponent):
+    name = 'gliner_ner'
+
+    def __init__(self, tokenizer: BaseTokenizer,
+                 cdb: CDB) -> None:
+        super().__init__()
+        self.tokenizer = tokenizer
+        self.cdb = cdb
+        self.config = self.cdb.config
+        self._validate_cnf()
+        self._init_model()
+
+    def _validate_cnf(self):
+        cnf = self.config.components.ner.custom_cnf
+        if not isinstance(cnf, GliNERConfig):
+            logger.warning("No GliNERConfig was set - using default")
+            cnf = self.config.components.ner.custom_cnf = GliNERConfig()
+        self.gliner_cnf = cnf
+
+    def _init_model(self):
+        logger.info("Init model for %s", self.gliner_cnf.model_name)
+        self.model = GLiNER.from_pretrained(self.gliner_cnf.model_name)
+        # init labels from type id
+        self.labels = [
+            tid.name for tid in
+            self.cdb.type_id2info.values()]
+        logger.info("Using labels: %s", self.labels)
+
+    def get_type(self) -> CoreComponentType:
+        return CoreComponentType.ner
+
+    def predict_entities(self, doc: MutableDocument,
+                         ents: list[MutableEntity] | None = None
+                         ) -> list[MutableEntity]:
+        """Detect candidates for concepts - linker will then be able
+        to do the rest. It adds `entities` to the doc.entities and each
+        entity can have the entity.link_candidates - that the linker
+        will resolve.
+
+        Args:
+            doc (MutableDocument):
+                Document to be annotated with named entities.
+            ents (list[MutableEntity] | None):
+                The entities given. This should be None.
+
+        Returns:
+            list[MutableEntity]:
+                The NER'ed entities.
+        """
+        if ents is not None:
+            ValueError(f"Unexpected entities sent to NER: {ents}")
+        if self.cdb.has_changed_names:
+            self.cdb._reset_subnames()
+            self._init_model()
+        text = doc.base.text.lower()
+        all_tkns = list(doc)
+        if len(all_tkns) > MAX_TOKENS_2_GLINER_AT_ONCE:
+            return self._split_and_predict(doc, text, all_tkns)
+        return self._predict(doc, text, 0)
+
+    def _create_splits(self, all_tkns: list[MutableToken],
+                       full_text: str) -> Iterator[tuple[str, int]]:
+        overlap_tkns = self.gliner_cnf.chunking_overlap_tokens
+        leftover_tokens = list(all_tkns)
+        while leftover_tokens:
+            cur_tokens = leftover_tokens[:MAX_TOKENS_2_GLINER_AT_ONCE]
+            # keep overlap number of tokens
+            leftover_tokens = leftover_tokens[
+                MAX_TOKENS_2_GLINER_AT_ONCE - overlap_tkns:]
+            start_char = cur_tokens[0].base.char_index
+            end_char = cur_tokens[-1].base.char_index + len(
+                cur_tokens[-1].base.text)
+            cur_text = full_text[start_char:end_char]
+            yield cur_text, start_char
+
+    def _split_and_predict(self, doc: MutableDocument,
+                           full_text: str,
+                           all_tkns: list[MutableToken]
+                           ) -> list[MutableEntity]:
+        all_out: list[MutableEntity] = []
+        for cur_text, offset in self._create_splits(all_tkns, full_text):
+            all_out.extend(self._predict(doc, cur_text, offset))
+        return all_out
+
+    def _predict(self, doc: MutableDocument, text: str,
+                 char_offset: int) -> list[MutableEntity]:
+        ner_ents: list[MutableEntity] = []
+        for ent_dict in self.model.predict_entities(
+                text, self.labels, self.gliner_cnf.threshold):
+            start_char = ent_dict["start"] + char_offset
+            end_char = ent_dict["end"] + char_offset
+            # TODO: check the "text"?
+            # value = ent_dict["text"]
+            tokens = doc.get_tokens(start_char, end_char-1)
+            tokens_str = [tkn.base.lower for tkn in tokens]
+            preprocessed_name = self.config.general.separator.join(tokens_str)
+            if preprocessed_name not in self.cdb.name2info:
+                continue
+            ent = maybe_annotate_name(
+                self.tokenizer, preprocessed_name, tokens,
+                doc, self.cdb, self.config, len(ner_ents))
+            if ent:
+                ner_ents.append(ent)
+        return ner_ents
+
+    @classmethod
+    def create_new_component(
+            cls, cnf: ComponentConfig, tokenizer: BaseTokenizer,
+            cdb: CDB, vocab: Vocab, model_load_path: str | None
+            ) -> 'GlinerNER':
+        return cls(tokenizer, cdb)
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+from .registration import do_registration as __register`
	`2`	`+`
	`3`	`+__register()`