Skip to content

Commit 4bc2bcd

Browse files
committed
BM25RestrictedContextRetriever and reproducibility instructions
1 parent 729a7cb commit 4bc2bcd

2 files changed

Lines changed: 173 additions & 0 deletions

File tree

README.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,129 @@
11
# Conivel: CONtext In noVELs
2+
3+
## Installing dependencies
4+
5+
Use `poetry install` to install dependencies. You can then use `poetry shell` to obtain a shell with the created virtual environment activated.
6+
7+
8+
# The Role of Global and Local Context in Named Entity Recognition
9+
10+
## Reproducing Results
11+
12+
For all the scripts presented below, results can be found under the `runs` directory. To be able to reproduce plots for Figure 1, 2 and 4, results must be placed under `runs/short/` and named correctly.
13+
14+
### No Retrieval
15+
16+
The `no retrieval` baseline for experiments found in Figure 1, 2 and 4
17+
can be reproduced by using the following bash script:
18+
19+
```sh
20+
#!/bin/bash
21+
22+
python xp_bare.py with\
23+
k=5\
24+
shuffle_kfolds_seed=0\
25+
batch_size=8\
26+
save_models=False\
27+
runs_nb=3\
28+
ner_epochs_nb=2\
29+
ner_lr=2e-5\
30+
dataset_name="dekker"
31+
```
32+
33+
34+
### Retrieval Heuristics
35+
36+
37+
To reproduce the experiments presented in Figure 1, one can use:
38+
39+
```sh
40+
#!/bin/bash
41+
42+
for heuristic in "left" "right" "neighbors" "random" "bm25" "samenoun"; do
43+
44+
sents_nb_list="[1, 2, 3, 4, 5, 6]"
45+
if [[ "${heuristic}" = "neighbors" ]]; then
46+
sents_nb_list="[2, 4, 6]"
47+
fi
48+
49+
python xp_kfolds.py with\
50+
k=5\
51+
shuffle_kfolds_seed=0\
52+
batch_size=8\
53+
save_models=False\
54+
runs_nb=3\
55+
context_retriever="${heuristic}"\
56+
context_retriever_kwargs='{}'\
57+
sents_nb_list="${sents_nb_list}"\
58+
ner_epochs_nb=2\
59+
ner_lr=2e-5\
60+
dataset_name="dekker"
61+
62+
done
63+
```
64+
65+
The `plot_mean_test_f1.py` can then be used to plot the curves found
66+
in the paper.
67+
68+
69+
### Oracle Versions of Retrieval Heuristics
70+
71+
Experiments found in Figure 2 can be reproduced with the following code:
72+
73+
```sh
74+
#!/bin/bash
75+
76+
for heuristic in "left" "right" "neighbors" "random" "bm25" "samenoun"; do
77+
78+
sents_nb_list="[1, 2, 3, 4, 5, 6]"
79+
if [[ "${heuristic}" = "neighbors" ]]; then
80+
sents_nb_list="[2, 4, 6]"
81+
fi
82+
83+
python xp_ideal_neural_retriever.py with\
84+
k=5\
85+
shuffle_kfolds_seed=0\
86+
batch_size=8\
87+
save_models=False\
88+
runs_nb=3\
89+
retrieval_heuristic="${heuristic}"\
90+
retrieval_heuristic_inference_kwargs='{"sents_nb": 16}'\
91+
sents_nb_list="${sents_nb_list}"\
92+
ner_epochs_nb=2\
93+
ner_lr=2e-5
94+
95+
done
96+
```
97+
98+
`plot_mean_test_f1.py -r` can be used to reproduce Figure 2.
99+
100+
101+
### Retrieved Sentences Distance Distribution
102+
103+
Experiments in Figure 3 can be reproduced used `xp_dist.py -r -o dists.json`, and the plot in the paper can then be reproduced with `plot_dist.py -i dists.json`.
104+
105+
106+
### Restricted BM25 heuristic
107+
108+
To reproduce the experiment found in Figure 4, use:
109+
110+
```sh
111+
python xp_ideal_neural_retriever.py with\
112+
k=5\
113+
shuffle_kfolds_seed=0\
114+
batch_size=8\
115+
save_models=False\
116+
runs_nb=3\
117+
retrieval_heuristic="bm25_restricted"\
118+
retrieval_heuristic_inference_kwargs='{"sents_nb": 16}'\
119+
sents_nb_list='[1,2,3,4,5,6]'\
120+
ner_epochs_nb=2\
121+
ner_lr=2e-5
122+
```
123+
124+
The plot can be reproduced with `plot_mean_test_f1 -e`.
125+
126+
127+
### Appendix: Dataset Details
128+
129+
Figure 5 can be reproduced using `plot_dekker_books_len.py`.

conivel/datas/context.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -266,6 +266,50 @@ def retrieve(
266266
]
267267

268268

269+
class BM25RestrictedContextRetriever(ContextRetriever):
270+
"""A context selector that selects sentences according to BM25 ranking formula."""
271+
272+
def __init__(self, sents_nb: Union[int, List[int]]) -> None:
273+
"""
274+
:param sents_nb: number of context sentences to select. If a
275+
list, the number of context sentences to select will be
276+
picked randomly among this list at call time.
277+
"""
278+
super().__init__(sents_nb)
279+
280+
@staticmethod
281+
def _get_bm25_model(document: List[NERSentence]) -> BM25Okapi:
282+
return BM25Okapi([sent.tokens for sent in document])
283+
284+
def retrieve(
285+
self, sent_idx: int, document: List[NERSentence]
286+
) -> List[ContextRetrievalMatch]:
287+
if isinstance((sents_nb := self.sents_nb), list):
288+
sents_nb = random.choice(sents_nb)
289+
290+
bm25_model = BM25ContextRetriever._get_bm25_model(document)
291+
query = document[sent_idx].tokens
292+
sent_scores = bm25_model.get_scores(query)
293+
sent_scores[sent_idx] = -1 # don't retrieve self
294+
# HACK: exclude close sentences
295+
for i in range(1, 7):
296+
try:
297+
sent_scores[sent_idx + i] = -1
298+
except IndexError:
299+
pass
300+
if sent_idx - i > 0:
301+
sent_scores[sent_idx - i] = -1
302+
topk_values, topk_indexs = torch.topk(
303+
torch.tensor(sent_scores), k=min(sents_nb, len(sent_scores)), dim=0
304+
)
305+
return [
306+
ContextRetrievalMatch(
307+
document[index], index, "left" if index < sent_idx else "right", value
308+
)
309+
for value, index in zip(topk_values.tolist(), topk_indexs.tolist())
310+
]
311+
312+
269313
@dataclass(frozen=True)
270314
class ContextRetrievalExample:
271315
"""A context selection example, to be used for training a context selector."""
@@ -918,6 +962,7 @@ def retrieve(
918962
"left": LeftContextRetriever,
919963
"right": RightContextRetriever,
920964
"bm25": BM25ContextRetriever,
965+
"bm25_restricted": BM25RestrictedContextRetriever,
921966
"samenoun": SameNounRetriever,
922967
"random": RandomContextRetriever,
923968
}

0 commit comments

Comments
 (0)