-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathindex.qmd
More file actions
360 lines (284 loc) · 12.3 KB
/
index.qmd
File metadata and controls
360 lines (284 loc) · 12.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
---
title: "ORCID API Example"
format:
html:
theme: cosmo
toc: true
code-block-background: true
df-print: kable
css: styles.css
editor: source
---
Greg Janée, March 2024
Example of querying the [ORCID Public API](https://info.orcid.org/documentation/features/public-api/) from R using the [rorcid](https://cran.r-project.org/web/packages/rorcid/index.html) package. In this example our goal is to find all ORCID IDs belonging to people who are currently employed at UCSB, and to do some rudimentary analysis.
```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(rorcid)
library(stringdist)
```
## Getting access
Log in to [ORCID](https://orcid.org) and, in the menu under your name, visit "Developer tools" to create a client ID. The web form is intended for developers registering OAuth applications. For the purpose of one-off API access it doesn't seem to matter what values you enter.
There's no need to stash the returned client ID and client secret anywhere; they can always be viewed on that page.
Next step is to obtain a token that will allow querying (and only querying) against the public API. Presumably writing to ORCID profiles requires a different kind of token. Run this curl command from the Bash command line:
```{.bash exec=FALSE}
curl -H 'Accept: application/json' \
-d grant_type=client_credentials \
-d scope=/read-public \
-d client_id=... \
-d client_secret=... \
https://orcid.org/oauth/token
```
A JSON document is returned. Copy the value of the `access_token` element and record it in a `.Renviron` file, located either in your home directory or in your R project's root directory, like so:
```{.bash exec=FALSE}
ORCID_TOKEN="..."
```
Restart R for it to take effect. Neither the client ID nor the token mention an expiration date, so perhaps they last forever?
# Query process overview
The basic idea is to supply a query in the form of a string expression and get back a 3-column dataframe. For example:
```{r}
result <- orcid("email:*@ucsb.edu")
head(result)
```
Well as can be seen two of the columns are redundant. Let's define a function to return a cleaner result.
```{r}
orcid_query <- function(query, start=NULL, rows=NULL) {
orcid(query, start, rows) %>%
as_tibble %>%
select(id=`orcid-identifier.path`)
}
result <- orcid_query("email:*@ucsb.edu")
head(result)
```
The number of rows returned may be explicitly limited (that's the purpose of the `rows` argument above; more on pagination later), but even if not, ORCID limits the rows returned to 1,000 per call. Furthermore, there's an overall limit of 10,000 rows maximum for any given query. A nice feature is that, regardless of how many rows are returned, the total number is always returned as the `found` attribute:
```{r}
attr(result, "found")
```
(Only `r attr(result, "found")` ORCID IDs that have a `@ucsb.edu` email address?! Remember we're using the public API, so only publicly available information is available to us. Clearly most people are not making their email addresses public via ORCID.)
# Formulating the query
ORCID internally maintains structured records, and in principle its database could be queried for people whose current employment is UCSB. But that level of granularity is not exposed through the API. The only queries supported are freetext search over entire profiles and search over a handful of [named fields](https://info.orcid.org/ufaqs/which-fields-does-the-orcid-search-api-support/) such as `email` as shown previously. For our purpose, `affiliation-org-name` is the relevant field. Note that this field aggregates all affiliations, including employment, education, and perhaps other types. It is also possible to search by various types of organization identifiers (GRID, ROR, RINGGOLD), but these are unlikely to be entered by laypeople. If they appear, they will have been auto-populated by publishers or institutional integrations (the latter not applicable to UCSB), or by ORCID itself when an organization is selected from the menu that pops up when somebody starts typing some text. So, we stick with searching over the `affiliation-org-name` text field.
Here are the names for UCSB we'll look for:
```{r}
ucsb_names <- c(
"University of California, Santa Barbara", # official
"University of California at Santa Barbara",
"University of California Santa Barbara",
"UC Santa Barbara",
"UCSB"
)
```
Just out of curiosity, how many IDs are returned for each name?
```{r}
tibble(
name=ucsb_names,
count=map_int(
ucsb_names,
function(name) {
query <- paste('affiliation-org-name:"', name, '"', sep="")
attr(orcid_query(query), "found")
}
)
)
```
# Results pagination
Query results are paginated. To retrieve all rows we write a function that requests 200 rows at a time and concatenates them into a single dataframe.
```{r}
orcid_query_all_results <- function(query) {
num_results <- attr(orcid_query(query, rows=1), "found")
reduce(
map(
seq(0, num_results, 200),
function(offset) {
orcid_query(query, offset, 200)
}
),
bind_rows
)
}
```
# Final query
Here's our final query. We look for IDs that have an affiliation that matches any of our UCSB names. This query takes only a few seconds, but for good netiquette we cache the results.
```{r}
cache_file <- "ids.RData"
if (file.exists(cache_file)) {
load(cache_file)
} else {
ucsb_ids <- orcid_query_all_results(
paste(
paste('affiliation-org-name:"', ucsb_names, '"', sep=""),
collapse=" OR "
)
)
save(ucsb_ids, file=cache_file)
}
```
How many IDs did we get?
```{r}
attr(ucsb_ids, "found")
```
Again, these are ORCID IDs that contain any kind of affiliation with UCSB, not necessarily employment, and not necessarily current employment.
# Employment data
Getting employment data is super easy: just pass the entire list of ORCID IDs in one batch. This takes awhile, on the order of 10 minutes, so we cache the 80MB of data received.
```{r}
cache_file <- "employment.RData"
if (file.exists(cache_file)) {
load(cache_file)
} else {
employment_data <- orcid_employments(ucsb_ids$id)
save(employment_data, file=cache_file)
}
```
The return is a hierarchical list of dataframes whose values contain lists of dataframes whose values contain lists of... the R version of JSON? Fortunately, `pluck` is able to pick out, for each employment affiliation: the organization name; the department name and position title; and the affiliation end date. We then filter for those records that mention some form of the UCSB name. We also filter for those records that *don't* have an affiliation end date, the theory being that those represent current employment. (Of course, it's entirely possible that people neglected to update their ORCID profiles when they left UCSB, just as they might never have added any dates at all.)
```{r}
df <- reduce(
map(employment_data, pluck, "affiliation-group", "summaries"),
bind_rows
) %>%
as_tibble %>%
mutate(id = str_sub(`employment-summary.path`, 2, 20)) %>%
select(
id,
institution=`employment-summary.organization.name`,
department=`employment-summary.department-name`,
title=`employment-summary.role-title`,
end_date=`employment-summary.end-date.year.value`,
) %>%
filter(institution %in% ucsb_names & is.na(end_date)) %>%
select(id, department, title) %>%
arrange(id)
head(df)
```
How many records did we get?
```{r}
nrow(df)
```
I.e., of the `r attr(ucsb_ids, "found")` ORCID IDs found that have some kind of UCSB affiliation, `r nrow(df)` of those reflect current employment at UCSB.
Well that's not exactly correct because `r nrow(df)` is the count of employment records returned, not a count of IDs. If somebody has multiple concurrent positions at UCSB they might have multiple employment records. Or, as noted previously, they might have had serial UCSB employments and neglected to add end dates or any dates at all. The number of unique ORCID IDs among the employment records is:
```{r}
length(unique(df$id))
```
So, the vast majority of IDs have one current UCSB employment recorded.
# Grouping by department
Let's group employment records by department to get a sense of the distribution of ORCID IDs across campus. (The table below is scrollable.)
```{r eval=FALSE}
df %>%
group_by(department) %>%
summarize(count=n()) %>%
arrange(desc(count), department, .locale="en")
```
```{r echo=FALSE}
#| class: long-output
df %>%
group_by(department) %>%
summarize(count=n()) %>%
arrange(desc(count), department, .locale="en")
```
\
Well, that's the usual freetext mess. If you look closely, multiple variants of the same department name occur, there are varying abbreviations and typos, multiple departments are listed in the same record, and so forth.
# Cleaning up department names
We can clean up the department names using classification against a list of known, good names (obtained mostly from [here](https://www.ucsb.edu/academics/academic-departments-and-programs)). For a distance metric we use Levenshtein editing distance.
```{r eval=FALSE}
seen_names = df %>%
select(name=department) %>%
drop_na %>%
filter(str_detect(name, "[a-z]")) %>% # remove pure acronyms
mutate(name_lc=str_to_lower(name)) %>%
mutate(name_lc=str_replace(name_lc, "department( of)?", ""))
good_names <- read_csv("departments.csv") %>%
mutate(name_lc=str_to_lower(name))
m <- stringdistmatrix(
seen_names$name_lc,
good_names$name_lc,
method="lv"
)
by_row <- 1
seen_names$classified = good_names$name[apply(m, by_row, which.min)]
seen_names %>%
select(department=classified) %>%
group_by(department) %>%
summarize(count=n()) %>%
arrange(desc(count))
```
Here's the cleaned-up list. (The table below is scrollable.)
```{r echo=FALSE, message=FALSE}
#| class: long-output
seen_names = df %>%
select(name=department) %>%
drop_na %>%
filter(str_detect(name, "[a-z]")) %>% # remove pure acronyms
mutate(name_lc=str_to_lower(name)) %>%
mutate(name_lc=str_replace(name_lc, "department( of)?", ""))
good_names <- read_csv("departments.csv") %>%
mutate(name_lc=str_to_lower(name))
m <- stringdistmatrix(
seen_names$name_lc,
good_names$name_lc,
method="lv"
)
by_row <- 1
seen_names$classified = good_names$name[apply(m, by_row, which.min)]
seen_names %>%
select(department=classified) %>%
group_by(department) %>%
summarize(count=n()) %>%
arrange(desc(count))
```
# Grouping by title
We can similarly group the employment records by position title, which might give us a sense of the extent to which different groups of people are using ORCID. Note that role/title isn't populated as frequently as department in ORCID profiles. (The table below is scrollable.)
```{r eval=FALSE}
df %>%
group_by(title) %>%
summarize(count=n()) %>%
arrange(desc(count))
```
```{r echo=FALSE}
#| class: long-output
df %>%
group_by(title) %>%
summarize(count=n()) %>%
arrange(desc(count))
```
\
Let's consolidate these varying descriptions into a few broad categories as follows.
```{r}
match <- Vectorize(
function(title, patterns) {
# Return TRUE if `title` matches any of the given patterns
any(
map_lgl(
patterns,
\(p) str_like(title, paste("%", p, "%", sep=""))
)
)
},
"title"
)
df %>%
drop_na(title) %>%
mutate(
category=case_when(
match(
title,
c("professor", "lecturer", "instructor", "dean")
) ~ "faculty",
match(
title,
c("student", "graduate", "teaching", "TA", "PhD",
"candidate")
) ~ "student",
match(
title,
c("post", "fellow")
) ~ "postdoc",
match(
title,
c("research", "specialist", "scientist", "director",
"coordinator", "librarian", "curator", "associate",
"manager", "engineer", "developer")
) ~ "staff",
.default="other"
)
) %>%
group_by(category) %>%
summarize(count=n()) %>%
arrange(desc(count))
```