-
Notifications
You must be signed in to change notification settings - Fork 8
match_dict.json format
melisa-qordoba edited this page Sep 23, 2020
·
1 revision
Here is a minimal match_dict.json:
{
"extract-revenge": {
"patterns": [
{
"LEMMA": "extract",
"TEMPLATE_ID": 1
}
],
"suggestions": [
[
{
"TEXT": "exact",
"FROM_TEMPLATE_ID": 1
}
]
],
"match_hook": [
{
"name": "succeeded_by_phrase",
"args": "revenge",
"match_if_predicate_is": true
}
],
"test": {
"positive": [
"And at the same time extract revenge on those he so despises?",
"Watch as Tampa Bay extracts revenge against his former Los Angeles Rams team."
],
"negative": ["Mother flavours her custards with lemon extract."]
}
}
}-
The top-level key,
extract-revengemust be unique (as must any dictionary key). The name is used as a unique identifier, but never shown. -
The inner keys are as follows
-
patterns- A list of spaCy Matcher patterns (actually, a superset of a spaCy matcher pattern), which may look like e.g.[{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]. The added syntax which makes it a superset is being able to add"TEMPLATE_ID": intto some of the dicts. This labels that part of the match as a template to be inflected, such as a verb to conjugate or a noun to pluralize. In the above example, we label the lemmaextractas havingTEMPLATE_IDof1. -
suggestions- a list of lists of dicts. The dicts have 1-2 keys:- just
"TEXT" (str), which will be used in the suggestion, - just
"PATTERN_REF" (int), which will copy thePATTERN_REF's token from the matched text, - both
"TEXT": "sometext"and"FROM_TEMPLATE_ID": int, which will apply the conjugation/pluralization of theTEMPLATE_IDwith valueintto"TEXT". In the above example, suggestions is[[{"TEXT":"exact","FROM_TEMPLATE_ID":1}]], which means we will match the conjugation ofexactto the conjugation ofextracts, from the step above, - both
"PATTERN_REF" (int)and"INFLECTION" (str), an explicit POS tag. Used when you want to reference thePATTERN_REF's token from the pattern, but conjugate to a different form (so far I have only seen this used for grammar rules). Example:{"PATTERN_REF": 1, "INFLECTION": "VBN"}will take the second token from the matched pattern and conjugate it into the past particible.
- just
-
match_hook- (despite the singular name) A list of "match hooks". These are Python functions which refine matches. See the following section. -
test- haspositiveandnegativekeys.positiveis a list of strings which this rule SHOULD match against,negativeis a list of strings which SHOULD NOT match. Used for testing now, but we have plans to infer rules from this section. - (optional)
comment- a string for other humans to read; ignored by replaCy - (optional)
anything- you can add any extra structure here, and replaCy will attempt to tag matching spans with this information using the spaCy custom extension attributes namespacespan._(spaCy docs). For example, you can add the keyooglywith value"boogly"for the match"LOWER": "secret password". Then if you callspan = rmatcher("This is the secret password.")[0], thenspan._.oogly == "boogly". replaCy tries to be cool about default values with user-defined extensions. If you have a match with the key-value pair"coolnes": 10, replaCy will infer thatcoolnessis anint. When it addscoolnessto all spaCy spans, it will make it sospan._.coolnessdefaults to0. This way, you can check all spans forif span._.coolness > THRESHOLDand not cause anAttributeError. You can change this the way you would change any spaCy custom attribute, e.g.
from spacy.tokens import Span Span.set_extension("coolness", default=9000)
-
Between match hooks and custom span attributes, replaCy is incredibly powerful, and allows you to control your NLP application's behavior from a single JSON file.