Skip to content

Commit 66219d8

Browse files
authored
chore(medcat-gliner): CU-869c3bvm0 Migrate gliner implementation to public (#328)
* CU-869bepjj9 Add gliner for medcat (#165) * CU-869bepjj9: Add gliner based NER * CU-869bepjj9: Depend on MedCAT v2.5 and onwards for lazy registration * CU-869bepjj9: Shuffle some code around to do registration separately * CU-869bepjj9: Fix minor logic issue * CU-869bepjj9: Add small registration test * CU-869bepjj9: Add a simple test for creator validation * CU-869bepjj9: Add workflow for gliner * CU-869bepjj9: Pin dependency to 2.0 and higher * CU-869bepjj9: Fix typing issue regards new method in v2.5 * CU-869bepjj9: Add missing dunder init file * CU-869bepjj9: Update tests to do a manual import before medcat 2.5.0 release * CU-869bepjj9: Fix minor tying issue to do with version check * CU-869bepjj9: Fix another minor tying issue to do with version check * CU-869bepjj9: Add a workaround for improper names in component classes to tests * CU-869bepjj9: Adopt rename of external (to MedCAT) projects as plugins (#171) * Add NER recall comparison section to README (#196) Added a section comparing NER recall between vocab based NER and GliNER implementation using the 2023 SNOMED CT Linking Challenge dataset. * CU-869c3bvm0: Add workflow for gliner * CU-869c3bvm0: Add release workflow for gliner * CU-869c3bvm0: Only do lazy registration for gliner * CU-869c3bvm0: Update pyproject toml to only support medcat 2.5 and up (for lazy registration) * CU-869c3bvm0: Add publish to TestPyPI to workflow * CU-869c3bvm0: Update version scheme in pyproject.toml * CU-869c3bvm0: Update pyproject.toml with root / git path for gliner * CU-869c3bvm0: Update pyproject.toml with git describe command * CU-869c3bvm0: Setup dev version before dep install * CU-869c3bvm0: Update some of the actions * CU-869c36ruk: Update tag regex * CU-869c3bvm0: [TODO: REMOVE] Add debug output regarding tags to main workflow * CU-869c3bvm0: Fix workflow actions version issue * Revert "CU-869c3bvm0: [TODO: REMOVE] Add debug output regarding tags to main workflow" This reverts commit 2c15273. * CU-869c3bvm0: Fix linting issue * CU-869c3bvm0: Fix permissions issue with workflow PyPI push * CU-869c3bvm0: Update gliner plugin details in plugin catalog * CU-869c3bvm0: Fix small issue with plugin catalog * CU-869c3bvm0: Update docstring for clarity in gliner_ner.py * CU-869c3bvm0: Rename workflow file * CU-869c3bvm0: Centralise workflow file * CU-869c3bvm0: Remove unnecessary line * CU-869c3bvm0: Moved to uv in workflows * CU-869c3bvm0: Add transitive dependency (with description) to pyproject.toml
1 parent 584e342 commit 66219d8

9 files changed

Lines changed: 467 additions & 5 deletions

File tree

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
name: MedCAT-Gliner-CI (test / publish)
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
paths:
7+
- 'medcat-plugins/medcat-gliner/**'
8+
- '.github/workflows/medcat-gliner**'
9+
tags:
10+
- 'medcat-gliner/v*.*.*'
11+
pull_request:
12+
paths:
13+
- 'medcat-plugins/medcat-gliner/**'
14+
- '.github/workflows/medcat-gliner**'
15+
16+
permissions:
17+
id-token: write
18+
19+
defaults:
20+
run:
21+
working-directory: ./medcat-plugins/medcat-gliner
22+
23+
jobs:
24+
tests:
25+
26+
runs-on: ubuntu-latest
27+
timeout-minutes: 30
28+
strategy:
29+
matrix:
30+
python-version: [ '3.10', '3.11', '3.12', '3.13' ]
31+
max-parallel: 4
32+
33+
steps:
34+
- uses: actions/checkout@v6
35+
36+
- name: Install uv for Python ${{ matrix.python-version }}
37+
uses: astral-sh/setup-uv@v7
38+
with:
39+
python-version: ${{ matrix.python-version }}
40+
enable-cache: true
41+
42+
- name: Install dependencies
43+
run: |
44+
df -h # check spaces before
45+
uv run python -m ensurepip
46+
uv pip install -e ".[dev]" --extra-index-url https://download.pytorch.org/whl/cpu/
47+
df -h # check spaces after
48+
49+
- name: Check types
50+
run: |
51+
uv run mypy --follow-imports=normal src/medcat_gliner --follow-untyped-imports
52+
53+
- name: Lint
54+
run: |
55+
uv run ruff check src/medcat_gliner --preview
56+
57+
- name: Test
58+
run: |
59+
uv run python -m unittest discover
60+
61+
publish-to-test-PyPI:
62+
runs-on: ubuntu-latest
63+
needs: tests
64+
steps:
65+
- name: Checkout main
66+
uses: actions/checkout@v6
67+
with:
68+
fetch-depth: 0 # fetch all history
69+
fetch-tags: true # fetch tags explicitly
70+
71+
- name: Install uv for Python 3.10
72+
uses: astral-sh/setup-uv@v7
73+
with:
74+
python-version: '3.10'
75+
enable-cache: true
76+
77+
- name: Set timestamp-based dev version
78+
run: |
79+
TS=$(date -u +"%Y%m%d%H%M%S")
80+
echo "SETUPTOOLS_SCM_PRETEND_VERSION_FOR_MEDCAT_GLINER=0.2.2.dev${TS}" >> $GITHUB_ENV
81+
82+
- name: Install dependencies
83+
run: |
84+
uv run python -m ensurepip
85+
86+
- name: Build package
87+
run: |
88+
uv build
89+
90+
- name: Publish distribution to TestPyPI
91+
uses: pypa/gh-action-pypi-publish@release/v1
92+
with:
93+
repository_url: https://test.pypi.org/legacy/
94+
packages_dir: medcat-plugins/medcat-gliner/dist
95+
96+
test-and-publish-to-PyPI:
97+
runs-on: ubuntu-latest
98+
if: startsWith(github.ref, 'refs/tags/')
99+
needs: tests
100+
steps:
101+
- name: Checkout main
102+
uses: actions/checkout@v6
103+
104+
- name: Install uv for Python 3.10
105+
uses: astral-sh/setup-uv@v7
106+
with:
107+
python-version: '3.10'
108+
enable-cache: true
109+
110+
- name: Install dependencies
111+
run: |
112+
uv run python -m ensurepip
113+
114+
- name: Build client package
115+
run: |
116+
uv build
117+
118+
- name: Publish production distribution to PyPI
119+
uses: pypa/gh-action-pypi-publish@release/v1
120+
with:
121+
packages_dir: medcat-plugins/medcat-gliner/dist
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# MedCAT-gliner
2+
3+
This provides [gliner](https://github.com/urchade/GLiNER) based NER step for MedCAT core library.
4+
5+
# Usage
6+
7+
First install from PyPI, e.g:
8+
```
9+
pip install medcat-gliner
10+
```
11+
Subsequently, if you have an existing model, you should be able to just change the NER component:
12+
```
13+
cat = CAT.load_model_pack("path/to/existing/model")
14+
# change component
15+
from medcat_gliner import GLiNERConfig
16+
cat.config.components.ner.comp_name = "gliner_ner"
17+
cat.config.components.ner.custom_cnf = GLiNERConfig()
18+
# recreate pipe with new NER component
19+
cat._recreate_pipe()
20+
# use as needed
21+
```
22+
23+
## NER recall comparison (linkable SNOMED entities)
24+
25+
The following results compare the existing NER (vocab based NER with spell checking) implementation with the gliner implementation when used as the NER component within MedCAT.
26+
Evaluation was performed on the **2023 SNOMED CT Linking Challenge** dataset.
27+
28+
> **Important caveat**
29+
> This is **not a measure of general NER quality**.
30+
> Recall is computed only with respect to annotated, linkable SNOMED CT entities present in the linking dataset.
31+
> Mentions outside the annotation scope are treated as false positives by construction, so precision is not meaningful here.
32+
33+
| Implementation | True Positives | False Negatives | Recall | Runtime |
34+
| ---------------------- | -------------- | --------------- | ------ | ------- |
35+
| Vocab based NER | 10,545 | 3,917 | 0.729 | ~5m 50s |
36+
| GliNER implementation | 7,971 | 6,491 | 0.551 | ~34m |
37+
38+
As we can see, for this dataset, GliNER is significantly slower and performs worse than the standard vocab based implementation. This is likely because the vocab based NER step has been configured and tuned to work best within the MedCAT pipeline. It is likely that with additional tuning the GliNER implementation could perform as good or better than the vocab based linker does.
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
[build-system]
2+
requires = ["setuptools>=61.0", "wheel", "setuptools_scm>=8"]
3+
build-backend = "setuptools.build_meta"
4+
5+
[project]
6+
name = "medcat_gliner"
7+
dynamic = ["version"]
8+
description = ""
9+
readme = "README.md"
10+
license = { text = "Apache-2.0" }
11+
authors = [
12+
{ name="Mart Ratas", email="mart.ratas@kcl.ac.uk" }
13+
]
14+
requires-python = ">=3.10"
15+
16+
keywords = ["NLP", "NER", "medical", "MedCAT", "gliner"]
17+
18+
classifiers = [
19+
"Development Status :: 3 - Alpha",
20+
"Intended Audience :: Science/Research",
21+
"Topic :: Scientific/Engineering :: Artificial Intelligence",
22+
"Programming Language :: Python :: 3",
23+
"Programming Language :: Python :: 3.10",
24+
"Programming Language :: Python :: 3.11",
25+
"Programming Language :: Python :: 3.12",
26+
"Programming Language :: Python :: 3.13",
27+
"License :: OSI Approved :: Apache Software License"
28+
]
29+
30+
dependencies = [
31+
"medcat>=2.5",
32+
"gliner",
33+
# transitive dependency of gliner that `uv` pins to and
34+
# doesn't work in python 3.10
35+
"onnxruntime<1.24; python_version < '3.11'",
36+
]
37+
38+
[project.optional-dependencies]
39+
dev = [
40+
"ruff",
41+
"mypy",
42+
]
43+
44+
# entry-points to add onto medcat
45+
[project.entry-points."medcat.plugins"]
46+
ner_gliner = "medcat_gliner"
47+
48+
[project.urls]
49+
Homepage = "https://github.com/CogStack/medcat-ops/tree/main/medcat-gliner"
50+
Repository = "https://github.com/CogStack/medcat-ops/tree/main/medcat-gliner"
51+
Issues = "https://github.com/CogStack/medcat-ops/issues"
52+
53+
[tool.setuptools_scm]
54+
root = "../.."
55+
tag_regex = "^medcat-gliner/v(?P<version>\\d+(?:\\.\\d+)*)(?:[ab]\\d+|rc\\d+)?$"
56+
version_scheme = "post-release"
57+
local_scheme = "no-local-version"
58+
git_describe_command = "git describe --dirty --tags --long --match 'medcat-gliner/v*'"
59+
60+
[tool.setuptools.packages.find]
61+
where = ["src"]
62+
63+
[tool.setuptools.package-data]
64+
"medcat_gliner" = ["py.typed"]
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
from .registration import do_registration as __register
2+
3+
__register()
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
from typing import Iterator
2+
import logging
3+
4+
from gliner import GLiNER
5+
6+
from medcat.components.types import AbstractEntityProvidingComponent
7+
from medcat.components.types import CoreComponentType
8+
from medcat.cdb import CDB
9+
from medcat.vocab import Vocab
10+
from medcat.config.config import ComponentConfig, SerialisableBaseModel
11+
from medcat.tokenizing.tokenizers import BaseTokenizer
12+
from medcat.tokenizing.tokens import MutableDocument, MutableEntity
13+
from medcat.tokenizing.tokens import MutableToken
14+
from medcat.components.ner.vocab_based_annotator import maybe_annotate_name
15+
16+
17+
logger = logging.getLogger(__name__)
18+
19+
20+
class GliNERConfig(SerialisableBaseModel):
21+
model_name: str = "urchade/gliner_base"
22+
"""The model to use.
23+
24+
See options:
25+
https://huggingface.co/models?library=gliner&sort=trending
26+
"""
27+
threshold: float = 0.5
28+
"""The threshold for the prediciton.
29+
30+
Higher will probably mean more false positives, while lower will
31+
probably mean missing some true positives (i.e more false negatives)
32+
"""
33+
chunking_overlap_tokens: int = 10
34+
"""If we're chunking the text for the HF model, what is the overlap we use.
35+
36+
For text longer than 384 words (which GLiNer people say is equivalent
37+
to 512 tokens) the text needs to chunked in order to be processed by
38+
the HF model. However, if the chunking has no overalp, some of the context
39+
will be lost and performance will probably deteriate.
40+
At the same time, larger overlaps will need more data will be processed
41+
which leads to lower throughput.
42+
"""
43+
44+
45+
# NOTE: They allow up to 384 WORDs. They say
46+
# that's equivalent 512 tokens, see:
47+
# https://github.com/urchade/GLiNER/issues/183#issuecomment-2330882600
48+
# However, it's easier for us to just count the tokens.
49+
# Also, they say you can increase it at the expense of performance
50+
# so I'd rather do the splitting manually at that point.
51+
# I tried 500 tokens, and it was consistently too many
52+
# so went down to 400
53+
MAX_TOKENS_2_GLINER_AT_ONCE = 400
54+
55+
56+
class GlinerNER(AbstractEntityProvidingComponent):
57+
name = 'gliner_ner'
58+
59+
def __init__(self, tokenizer: BaseTokenizer,
60+
cdb: CDB) -> None:
61+
super().__init__()
62+
self.tokenizer = tokenizer
63+
self.cdb = cdb
64+
self.config = self.cdb.config
65+
self._validate_cnf()
66+
self._init_model()
67+
68+
def _validate_cnf(self):
69+
cnf = self.config.components.ner.custom_cnf
70+
if not isinstance(cnf, GliNERConfig):
71+
logger.warning("No GliNERConfig was set - using default")
72+
cnf = self.config.components.ner.custom_cnf = GliNERConfig()
73+
self.gliner_cnf = cnf
74+
75+
def _init_model(self):
76+
logger.info("Init model for %s", self.gliner_cnf.model_name)
77+
self.model = GLiNER.from_pretrained(self.gliner_cnf.model_name)
78+
# init labels from type id
79+
self.labels = [
80+
tid.name for tid in
81+
self.cdb.type_id2info.values()]
82+
logger.info("Using labels: %s", self.labels)
83+
84+
def get_type(self) -> CoreComponentType:
85+
return CoreComponentType.ner
86+
87+
def predict_entities(self, doc: MutableDocument,
88+
ents: list[MutableEntity] | None = None
89+
) -> list[MutableEntity]:
90+
"""Detect candidates for concepts - linker will then be able
91+
to do the rest. It adds `entities` to the doc.entities and each
92+
entity can have the entity.link_candidates - that the linker
93+
will resolve.
94+
95+
Args:
96+
doc (MutableDocument):
97+
Document to be annotated with named entities.
98+
ents (list[MutableEntity] | None):
99+
The entities given. This should be None.
100+
101+
Returns:
102+
list[MutableEntity]:
103+
The NER'ed entities.
104+
"""
105+
if ents is not None:
106+
ValueError(f"Unexpected entities sent to NER: {ents}")
107+
if self.cdb.has_changed_names:
108+
self.cdb._reset_subnames()
109+
self._init_model()
110+
text = doc.base.text.lower()
111+
all_tkns = list(doc)
112+
if len(all_tkns) > MAX_TOKENS_2_GLINER_AT_ONCE:
113+
return self._split_and_predict(doc, text, all_tkns)
114+
return self._predict(doc, text, 0)
115+
116+
def _create_splits(self, all_tkns: list[MutableToken],
117+
full_text: str) -> Iterator[tuple[str, int]]:
118+
overlap_tkns = self.gliner_cnf.chunking_overlap_tokens
119+
leftover_tokens = list(all_tkns)
120+
while leftover_tokens:
121+
cur_tokens = leftover_tokens[:MAX_TOKENS_2_GLINER_AT_ONCE]
122+
# keep overlap number of tokens
123+
leftover_tokens = leftover_tokens[
124+
MAX_TOKENS_2_GLINER_AT_ONCE - overlap_tkns:]
125+
start_char = cur_tokens[0].base.char_index
126+
end_char = cur_tokens[-1].base.char_index + len(
127+
cur_tokens[-1].base.text)
128+
cur_text = full_text[start_char:end_char]
129+
yield cur_text, start_char
130+
131+
def _split_and_predict(self, doc: MutableDocument,
132+
full_text: str,
133+
all_tkns: list[MutableToken]
134+
) -> list[MutableEntity]:
135+
all_out: list[MutableEntity] = []
136+
for cur_text, offset in self._create_splits(all_tkns, full_text):
137+
all_out.extend(self._predict(doc, cur_text, offset))
138+
return all_out
139+
140+
def _predict(self, doc: MutableDocument, text: str,
141+
char_offset: int) -> list[MutableEntity]:
142+
ner_ents: list[MutableEntity] = []
143+
for ent_dict in self.model.predict_entities(
144+
text, self.labels, self.gliner_cnf.threshold):
145+
start_char = ent_dict["start"] + char_offset
146+
end_char = ent_dict["end"] + char_offset
147+
# TODO: check the "text"?
148+
# value = ent_dict["text"]
149+
tokens = doc.get_tokens(start_char, end_char-1)
150+
tokens_str = [tkn.base.lower for tkn in tokens]
151+
preprocessed_name = self.config.general.separator.join(tokens_str)
152+
if preprocessed_name not in self.cdb.name2info:
153+
continue
154+
ent = maybe_annotate_name(
155+
self.tokenizer, preprocessed_name, tokens,
156+
doc, self.cdb, self.config, len(ner_ents))
157+
if ent:
158+
ner_ents.append(ent)
159+
return ner_ents
160+
161+
@classmethod
162+
def create_new_component(
163+
cls, cnf: ComponentConfig, tokenizer: BaseTokenizer,
164+
cdb: CDB, vocab: Vocab, model_load_path: str | None
165+
) -> 'GlinerNER':
166+
return cls(tokenizer, cdb)

0 commit comments

Comments
 (0)