litsearchr-tutorial/litsearchr_tutorial.Rmd at master · luketudge/litsearchr-tutorial · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
---
title: "litsearchr tutorial"
output:
  html_document:
    toc: TRUE
    toc_float: TRUE
    df_print: "paged"
---

The R package `litsearchr` provides various functions to help with planning a systematic search of the scientific literature on a given topic. This tutorial gives an example of how to use `litsearchr`, along with some brief explanations of its workings.

`litsearchr` was created by the amazing [Eliza Grames](https://elizagrames.github.io), and a lot of this tutorial is an elaboration of the existing [vignette](https://elizagrames.github.io/litsearchr/litsearchr_vignette_v041.html) for the package.

## Setup

As well as `litsearchr` itself, we will use a few other R packages in the tutorial. Let's load them.

```{r, message=FALSE}
library(dplyr)
library(ggplot2)
library(ggraph)
library(igraph)
library(readr)
```

## Installation

To use the `litsearchr` package, we need to install it first.

`litsearchr` isn't (yet) stored in the package repository of [CRAN](https://cran.r-project.org/web/packages/index.html), the Comprehensive R Archive Network. This means that the usual `install.packages()` function or the Install button in RStudio won't find it. Instead, we can install the package from Eliza's GitHub repository.

The `devtools` package provides a function for installing R packages from GitHub instead of from CRAN. So we can use this.

```{r, eval=FALSE}
library(devtools)
install_github("elizagrames/litsearchr", ref="main")
```

Now that we have installed `litsearchr` we can load it.

```{r}
library(litsearchr)
```

`litsearchr` is a new package and is currently in development. So we should keep track of which version we are using, in case we later work with a newer version and find that the examples in this tutorial no longer work.

```{r}
packageVersion("litsearchr")
```

## Example

Let's imagine that we want to retrieve journal articles about treating phobias with a combination of medication and cognitive-behavioral therapy. Our overall topic consists of the three sub-topics *medication*, *CBT* and *phobia*, and we want to get articles that are about all three.

The starting point in `litsearchr` is an existing search that we have already carried out. We first conduct a quick, cursory search for our topic of interest. This is known as the 'naive search'. `litsearchr` will then take the results of our naive search and suggest improvements that might capture more articles that are relevant to our topic (and hopefully fewer that are not relevant).

We start by going to the [PubMed search page](https://pubmed.ncbi.nlm.nih.gov/advanced) and enter the search in the **Query box** there. For this example, I entered the following three search terms, combined with AND, like this:

> `(medication) AND (CBT) AND (phobia)`

If you click on the **Send to** button above the PubMed search results, and then select to send them to **Citation manager**, you will be prompted to download all the results to a file that `litsearchr` can read. I have downloaded the example results to a file called *pubmed-medication-set.nbib*.

## Loading results

We can load results from a file using the `litsearchr` function `import_results()`. We give the file name as the `file` argument.

```{r}
naive_results <- import_results(file="pubmed-medication-set.nbib")
```

`import_results()` gives us a dataframe in which each result is a row. We can see from the number of rows how many results our search got.

```{r}
nrow(naive_results)
```

We can take a look at the first few.

```{r}
naive_results
```

There are columns for the title, authors, date, abstract, and so on. We can check the names of all the columns to see all the information we have on each search result.

```{r}
colnames(naive_results)
```

And just as a check, let's take a look at the title of the first result.

```{r}
naive_results[1, "title"]
```

## Getting potential search terms

The next step is to analyze the search results that we already have and look to see whether there are more search terms in them that might be related to our topic. If so, we can use these additional terms in a new search to get more relevant results.

There are two different ways of searching for new terms.

### Keywords

Many authors or journals already provide a list of keywords attached to the article. So the simplest way to get new search terms is just to look at what keywords were already provided in the articles we already found.

As an example, we can take a look at the keywords for the first article in our results.

```{r}
naive_results[1, "keywords"]
```

The keywords are missing (`NA`) from this article. In fact, they are missing from quite a few. The fourth article in our results is the first that has any keywords supplied.

```{r}
naive_results[4, "keywords"]
```

How many articles are missing keywords? We can count up the number of `NA` values to find out.

```{r}
sum(is.na(naive_results[, "keywords"]))
```

More than half of them. So relying on the provided keywords might not always be such a great approach. But for the purposes of demonstration let's see how we could use them in `litsearchr`.

`litsearchr` has a function `extract_terms()` that can gather the keywords from this column of our search results. The `keywords` argument is where we put the column of keywords from our results dataframe. The `method="tagged"` argument lets `extract_terms()` know that we are getting keywords that article authors themselves have provided (or 'tagged' the article with).

```{r}
extract_terms(keywords=naive_results[, "keywords"], method="tagged")
```

We seem to have got only multi-word phrases, no single words. And we didn't get very many. As is often the case in programming, we need to check the documentation for `extract_terms()` to see what its default behavior is. (We can get the documentation in RStudio by typing `? extract_terms` at the console).

We see that `extract_terms()` has a few default arguments:

* `min_freq=2`. Only get keywords that appear at least twice in the full set of results. This is good for making sure that we are only getting keywords that are related to more than just one article in our field of interest. But it might also miss out some important extra suggestions.
* `min_n=2`. Only get keywords that consist of at least two words. This is why we only see multi-word phrases in the keywords we just got.
* `max_n=5`. Get keywords up to five words long. Maybe this is longer than we need.

We can experiment with changing some of these arguments. For example, let's try including single words. This time, let's store the result in a variable, as we will use it later on.

```{r}
keywords <- extract_terms(keywords=naive_results[, "keywords"], method="tagged", min_n=1)

keywords
```

This gets us more search terms. Some of these might be useful new terms to include in our literature search. But others are clearly too broad, for example *outcome*, or are from tangential topics, for example *virtual reality*.

You could try narrowing the terms down a bit, for example by asking only for those that are provided for at least three of the articles in our search. Where there are additional arguments to a function like this, it is a good idea to try out a few variations to get some idea of how they work. You can give this a go. For now, let's move on to a different method of getting search terms from our naive search.

### Titles and abstracts

We saw above that not all articles even provide keywords. And maybe the authors of the articles themselves can't always be relied upon to tag their articles with all the keywords that might link their article to relevant topics. So an alternative (or supplementary) method of finding new relevant search terms is instead to search in the titles and/or abstracts of the articles.

Here we will search in the titles only, to keep the example manageable. The article abstracts are much longer and contain many additional irrelevant words. Filtering these out would take a lot of work.

Titles are fairly short and should contain mostly relevant terms. But we will still need to filter them down a bit to get only the 'interesting' words. Somebody has already invented a method for doing this, called Rapid Automatic Keyword Extraction (RAKE). `litsearchr` can apply this method if we tell it to (the argument for doing so has the curious name `"fakerake"`, because in fact `litsearchr` uses a slightly simplified version of the full RAKE technique).

Just as when we searched among the author-provided keywords, here too we can give arguments `min_n`, and `max_n` to choose whether we include single words or multi-word phrases, and we can give `min_freq` to exclude words that occur in too few of the titles in our original search.

```{r}
extract_terms(text=naive_results[, "title"], method="fakerake", min_freq=3, min_n=2)
```

Searching in the titles gets us more search terms than we got from just the author-provided keywords. In part this is because every article has a title whereas not every article supplies keywords. And in part it is because the titles also contain additional irrelevant words, whereas the keywords have been specially chosen for their relevance. The title search probably gets us many more terms than we need.

Some of the phrases that appear in a title are general science or data analysis terms and are not related to the specific topic of the article. In language analysis, such frequently-occurring but uninformative words are often called 'stopwords'. If we are going to work with `litsearchr` a lot and will often need to filter out the same set of stopwords, then it can be handy to keep these words in a text file. We can then read this file into R when we need them, and we can add to the file each time we encounter more words that we know are irrelevant.

I have prepared a (non-exhaustive) list of generic clinical psychology words that occurred in this set of articles. They are stored in the file **clin_psy_stopwords.txt**. Let's read it in and take a look.

```{r}
clinpsy_stopwords <- read_lines("clin_psy_stopwords.txt")

clinpsy_stopwords
```

The `extract_terms()` function provides a `stopwords` argument that we can use to filter out words. `litsearchr` also provides a general list of stopwords for English (and for some other languages) via the `get_stopwords()` function, so we can also add this to our own more specific stopwords.

```{r}
all_stopwords <- c(get_stopwords("English"), clinpsy_stopwords)
```

Now let's use this big list of stopwords to have another go at extracting only relevant search terms.

```{r}
title_terms <- extract_terms(
  text=naive_results[, "title"],
  method="fakerake",
  min_freq=3, min_n=2,
  stopwords=all_stopwords
)

title_terms
```

This looks a bit better. We should be careful not to accidentally exclude any potentially relevant words at this stage, and as usual exploration is important. Try out different lists of stopwords and different values of `min_freq`, `min_n`, and `max_n` to get a clear impression of the articles in the naive search.

Let's finish by adding together the search terms we got from the article titles and those we got from the keywords earlier, removing duplicates.

```{r}
terms <- unique(c(keywords, title_terms))
```

## Network analysis

Our list of new search terms is looking fairly good. But there are probably still some in there that are unrelated to the others and to our topic of interest. These perhaps only occur in a small number of articles that do not mention many of the other search terms. We would like some systematic way of identifying these 'isolated' search terms.

One way to do this is to analyze the search terms as a network. The idea behind this is that terms are linked to each other by virtue of appearing in the same articles. If we can find out which terms tend to occur together in the same article, we can pick out groups of terms that are probably all referring to the same topic (because they occur in the same subset of articles), and we can filter out terms that do not often occur together with any of the main groups of terms.

We don't have the full texts of the articles, and in any case checking for search terms in a large number of full texts would be a big task for my puny computer. So we will take the title and abstract of each article as the 'content' of that article, and count a term as having occurred in that article if it is to be found in either the title or abstract. For this, we need to join the title of each article to its abstract.

```{r}
docs <- paste(naive_results[, "title"], naive_results[, "abstract"])
```

Let's just check the first one to make sure we did this right.

```{r}
docs[1]
```

We now create a matrix that records which terms appear in which articles. The `litsearchr` function `create_dfm()` does this. 'DFM' stands for 'document-feature matrix', where the 'documents' are our articles and the 'features' are the search terms. The `elements` argument is the list of documents. The `features` argument is the list of terms whose relationships we want to analyze within that set of documents.

```{r}
dfm <- create_dfm(elements=docs, features=terms)
```

The rows of our matrix represent the articles (their titles and abstracts), and the columns represent the search terms. Each entry in the matrix records how many times that article contains that term. For example, if we look at the first three articles we see that *adherence* does not occur in any of them, *adolescents* occurs in the third, *antidepressant* occurs in the first two, and *anxiety* occurs in all of them.

```{r}
dfm[1:3, 1:4]
```

We can then turn this matrix into a network of linked search terms, using the `litsearchr` function `create_network()`. This function has an argument `min_studies` that excludes terms that occur in fewer than a given number of articles.

```{r}
g <- create_network(dfm, min_studies=3)
```

### ggraph

We should try to make a picture of our network of terms to get a better idea of its structure. This isn't easy, as we need to show a large number of terms plus all the links between them. The `ggraph` package provides some great tools for drawing networks. There is a lot to learn and discover about drawing networks, and you can look at the [tutorials for `ggraph`](https://www.data-imaginist.com/2017/ggraph-introduction-layouts) if you want to learn more. But here network visualization is just an aside, so we will keep it simple.

The `ggraph()` function takes the network that we got from `create_network()` as its argument and draws it as a graph. In addition, we can specify a layout for the graph. Here we use the 'Kamada and Kawai' layout. To sum it up very simply, this layout draws terms that are closely linked close together, and those that are less closely linked further away from each other. We add some labels showing what the actual terms are, using `geom_node_text()`. Since there are far too many terms to show all of them without completely cluttering the figure, we show just an arbitrary subset of them that do not overlap on the figure, using the argument `check_overlap=TRUE`. Finally, we also add lines linking the terms, using `geom_edge_link()`. We color these lines so there are more solid lines linking terms that appear in more articles together (this is called the 'weight' of the link).

```{r}
ggraph(g, layout="stress") +
  coord_fixed() +
  expand_limits(x=c(-3, 3)) +
  geom_edge_link(aes(alpha=weight)) +
  geom_node_point(shape="circle filled", fill="white") +
  geom_node_text(aes(label=name), hjust="outward", check_overlap=TRUE) +
  guides(edge_alpha=FALSE)
```

Given the way in which we drew the graph, it tells us a few things about our search terms.

Terms that appear near the center of the graph and that are linked to each other by darker lines are probably more important for our overall topic. Here these include for example *cbt*, *phobia*, and *behavioral therapy*.

Terms that appear at the periphery of the graph and linked to it only by faint lines are not closely related to any other terms. These are mostly tangential terms that are related to, but not part of, our main topic, for example *functional magnetic resonance imaging* and *emotion regulation*.

Interestingly, we see some anomalies at the periphery of the graph. The clearly very relevant terms *cognitive behavior therapy* and *cognitive behavioural therapy* appear far from the center despite being variants of one of our main naive search terms. This is probably because they are minority variations on the terminology. One involves the British spelling of *behaviour*, and the other uses the word *behavior* where most authors prefer *behavioral*. We will need to treat the British spelling issue as a special case later on, and this is a good illustration of the importance of looking carefully at our results.

### Pruning

Now let's use the network to rank our search terms by importance, with the aim of pruning away some of the least important ones.

The 'strength' of each term in the network is the number of other terms that it appears together with. We can get this information from our network using the `strength()` function from the `igraph` package (behind the scenes, `litsearchr` uses `igraph` for some of the workings of its network analyses). If we then arrange the terms in ascending order of strength we see those that might be the least important.

```{r}
strengths <- strength(g)

data.frame(term=names(strengths), strength=strengths, row.names=NULL) %>%
  mutate(rank=rank(strength, ties.method="min")) %>%
  arrange(strength) ->
  term_strengths

term_strengths
```

At the top are the terms that are most weakly linked to the others. For some of them you can compare this with their positions on the graph visualization above, where they appear near the margins of the figure. In most cases, terms like these are completely irrelevant and have occurred in a few of the articles in the naive search for arbitrary reasons, for example *virtual reality*. Some are perhaps still relevant but are just very rarely used.

We would like to discard some of the terms that only rarely occur together with the others. What rule can we use to make a decision about which to discard? To get an idea of how we might approach this question, let's visualize the strengths of the terms.

```{r}
cutoff_fig <- ggplot(term_strengths, aes(x=rank, y=strength, label=term)) +
  geom_line() +
  geom_point() +
  geom_text(data=filter(term_strengths, rank>5), hjust="right", nudge_y=20, check_overlap=TRUE)

cutoff_fig
```

The figure shows the terms in ascending order of strength from left to right. Again, only an arbitrary subset of the terms are labeled, so as not to clutter the figure. We can see for example that there are five terms (starting with *behavioral therapy*) that have much higher strengths than the others.

Let's use this figure to visualize two methods that `litsearchr` offers for pruning away the search terms least closely linked to the others. Both of these involve finding a cutoff value for term strength, such that we discard terms with a strength below that value. The `find_cutoff()` function implements these methods.

#### Cumulatively

One simple way to decide on a cutoff is to choose to retain a certain proportion of the total strength of the network of search terms, for example 80%. If we supply the argument `method="cumulative"` to the `find_cutoff()` function, we get the cutoff strength value according to this method. The `percent` argument specifies what proportion of the total strength we would like to retain.

```{r}
cutoff_cum <- find_cutoff(g, method="cumulative", percent=0.8)

cutoff_cum
```

Let's see this on our figure.

```{r}
cutoff_fig +
  geom_hline(yintercept=cutoff_cum, linetype="dashed")
```

Once we have found a cutoff value, the `reduce_graph()` function applies it and prunes away the terms with low strength. The arguments are the original network and the cutoff. The `get_keywords()` function then gets the remaining terms from the reduced network.

```{r}
get_keywords(reduce_graph(g, cutoff_cum))
```

#### Changepoints

Looking at the figure above, another method of pruning away terms suggests itself. There are certain points along the ranking of terms where the strength of the next strongest term is much greater than that of the previous one (places where the ascending line 'jumps up'). We could use these places as cutoffs, since the terms below them have much lower strength than those above. There may of course be more than one place where term strength jumps up like this, so we will have multiple candidates for cutoffs. The same `find_cutoff()` function with the argument `method="changepoint"` will find these cutoffs. The `knot_num` argument specifies how many 'knots' we wish to slice the keywords into.

```{r}
cutoff_change <- find_cutoff(g, method="changepoint", knot_num=3)

cutoff_change
```

This time we get several suggested cutoffs. Let's put them on our figure, where we can see that they cut off the search terms just before large increases in term strength.

```{r}
cutoff_fig +
  geom_hline(yintercept=cutoff_change, linetype="dashed")
```

After doing this, we can apply the same `reduce_graph()` function that we did for the cumulative strength method. The only difference is that we have to pick one of the cutoffs in our vector.

```{r}
g_redux <- reduce_graph(g, cutoff_change[1])
selected_terms <- get_keywords(g_redux)

selected_terms
```

Remember the issue with the British spelling of *behaviour*. We should add this term back in. It is also probably a good idea to add back in the terms of our original naive search if they did not turn up in our final set of terms.

```{r}
extra_terms <- c(
  "medication",
  "cognitive-behavioural therapy",
  "cognitive behavioural therapy"
)

selected_terms <- c(selected_terms, extra_terms)

selected_terms
```

## Grouping

Now that we have got a revised list of search terms from the results of our naive search, we want to turn them into a new search query that we can use to get more articles relevant to the same topic. For this new, hopefully more rigorous, search we will need a combination of `OR` and `AND` operators. The `OR` operator should combine search terms that are all about the same subtopic, so that we get articles that contain any one of them. The `AND` operator should combine these groups of search terms so that we get only articles that mention at least one term from each of the subtopics that we are interested in.

In our starting example we had three subtopics, medication, cognitive-behavioral therapy, and phobias. So the first step is to take the terms that we have found and group them into clusters related to each of these topics. (Or it may be the case that our work with `litsearchr` has led us to rethink our subtopics or to add new ones.)

There are methods for automatically grouping networks into clusters, but these are not always so reliable; computers don't understand what words mean. Currently the `litsearchr` documentation recommends doing this step manually. We look at our search terms, and put them into a list of separate vectors, one for each subtopic.

```{r}
grouped_terms <-list(
  medication=selected_terms[c(14, 28)],
  cbt=selected_terms[c(4, 6, 7, 17, 24, 25, 26, 29, 30)],
  phobia=selected_terms[c(2, 3, 9, 12, 15, 19, 20, 21, 23, 27)]
)

grouped_terms
```

## Writing a new search

The `write_search()` function takes our list of grouped search terms and writes the text of a new search. There are quite a few arguments to take care of for this function:

* `languages` provides a list of languages to translate the search into, in case we want to get articles in multiple languages.
* `exactphrase` controls whether terms that consist of more than one word should be matched exactly rather than as two separate words. If we have phrases that are only relevant as a whole phrase, then we should set this to `TRUE`, so that for example *social phobia*  will not also catch all the articles containing the word *social*.
* `stemming` controls whether words are stripped down to the smallest meaningful part of the word (its 'stem') so that we make sure to catch all variants of the word, for example catching both *behavior* and *behavioral*.
* `closure` controls whether partial matches are matched at the left end of a word (`"left"`), at the right (`"right"`), only as exact matches (`"full"`) or as any word containing a term (`"none"`).
* `writesearch` controls whether we would like to write the search text to a file.

```{r, eval=FALSE}
write_search(
  grouped_terms,
  languages="English",
  exactphrase=TRUE,
  stemming=FALSE,
  closure="left",
  writesearch=TRUE
)
```

Let's read in the contents of the text file that we just wrote, to see what our search text looks like.

```{r}
cat(read_file("search-inEnglish.txt"))
```

We can now go back to the search site and copy the contents of this text file into the search field to conduct a new search.

## Checking the new search

I ran our new search and downloaded the results to a file called *pubmed-pharmacoth-set.nbib*. Let's load this file with `litsearchr`. This may take a moment, as this file contains a lot more results than the naive search.

```{r}
new_results <- import_results(file="pubmed-pharmacoth-set.nbib")
```

How many did we get?

```{r}
nrow(new_results)
```

We now need to check whether the new results seem to be relevant to our chosen topic. There are a few basic things that we can check.

### Against the naive search

We can first check whether all of the results of the naive search are in the new search. Since we conducted the naive search using the most important terms that occurred to us for our topic, and since we included these same terms or very similar in our new search, we ought to get the same articles back among our new results, at least if they were really relevant.

```{r}
naive_results %>%
  mutate(in_new_results=title %in% new_results[, "title"]) ->
  naive_results

naive_results %>%
  filter(!in_new_results) %>%
  select(title, keywords)
```

We didn't miss any of the articles from the naive search. Good.

### Against gold standard results

If we have started reviewing the literature on our chosen topic, we may already know the titles of some important articles that have been written on the subject. So another way of checking our new search is to check whether it includes specific important results. `litsearchr` provides a function called `check_recall()` for searching for specific titles. We do not have to get the titles of our desired articles exactly right. If the capitalization or punctuation is slightly different from the version in our results dataframe, `check_recall()` will find the closest match.

Let's try it.

```{r}
important_titles <- c(
  "Efficacy of treatments for anxiety disorder: A meta-analysis",
  "Cognitive behaviour therapy for health anxiety: A systematic review and meta-analysis",
  "A systematic review and meta-analysis of treatments for agrophobia"
)

data.frame(check_recall(important_titles, new_results[, "title"]))
```

All of the best matches in the output table are clearly the articles that we were looking for, so our new search has found these.

That's the end of our tutorial. You can try to refine the example search further, or try one of your own.