amRdata/README.Rmd at main · JRaviLab/amRdata · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# amRdata

<!-- badges: start -->
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
<!-- badges: end -->

amRdata is the first package in the [amR suite](https://github.com/JRaviLab/amR)
for antimicrobial resistance (AMR) prediction. It takes a user‑provided species or
taxon ID, downloads the corresponding genomes and AST data from BV‑BRC, constructs
pangenomes, extracts features at multiple molecular scales, and prepares a unified
Parquet‑backed DuckDB file for downstream ML modeling in **amRml**.

The workflow is comprised of 6 primary processes:

1. BV‑BRC metadata (isolate metadata + AMR phenotypic labels) →
2. BV-BRC genomes (sequence data) →
3. Panaroo pangenome (genes, struct) →
4. CD‑HIT protein clusters (proteins) →
5. Pfam domain extraction (domains) →
6. Database formatting


## Overview

amRdata includes functions to:

- Query and download bacterial genome data from BV-BRC
- Acquire paired antimicrobial susceptibility testing (AST) results
- Extract molecular features across scales:
  - Gene clusters (Panaroo pangenome analysis)
  - Protein clusters (CD-HIT sequence similarity)
  - Protein domains (Pfam annotations)
  - Structural variants (Panaroo pangenome rearrangements)
- Store all data in highly efficient Parquet and DuckDB formats

See the [package vignette](https://jravilab.github.io/amRdata/articles/intro.html) for detailed usage.

## Installation

```r
# Install from GitHub
if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")

remotes::install_github("JRaviLab/amRdata")
```

## Quick start

```r

library(amRdata)

# Step 1: Download and prepare genomes with paired AST data from BV-BRC
prepareGenomes(
  user_bacs = c("Shigella flexneri"),
  base_dir  = "data/Shigella_flexneri",
  method    = "ftp",   # or "cli"
  verbose   = TRUE
)

# Step 2: Run full feature extraction (Panaroo → CD-HIT → InterProScan → metadata cleaning)
runDataProcessing(
  duckdb_path = "data/Shigella_flexneri/Sfl.duckdb",
  output_path = "data/Shigella_flexneri",
  threads     = 16,
  ref_file_path = "data_raw/"
)

# A final Parquet-backed DuckDB is created:
#   data/Shigella_flexneri/Sfl_parquet.duckdb

This contains data for feature presence/absence and counts across scales in genome by feature matrices, as well as all available sample metadata.

```

## Package features

### Data curation
1. BV‑BRC data access
amRdata uses the BV‑BRC CLI (via Docker) or FTP server to access:

- Genome metadata
- AMR phenotype data
- Genome assemblies (`.fna`, `.faa`, `.gff`)

Functions involved:

```
.updateBVBRCdata()
.retrieveCustomQuery()
.retrieveQueryIDs()
retrieveGenomes()
.filterGenomes()
prepareGenomes()
```

After initial download, all BV-BRC metadata is cached automatically under:
`data/bvbrc/bvbrcData.duckdb`

The package interfaces with BV-BRC (Bacterial and Viral Bioinformatics Resource Center) to access bacterial genome sequences and antimicrobial susceptibility testing data either using FTP or the BV-BRC CLI wrapped in a Docker container for reproducible access:

- Query isolate metadata with flexible filtering
- Download genome files (`.fna`, `.faa`, `.gff`)
- Retrieve AST results linking genotypes to phenotypes
- Apply quality control filters (assembly quality, metadata completeness)

### Feature extraction

Features are extracted at four complementary molecular scales:

#### 1. Gene clusters
Panaroo is executed inside a container using `.runPanaroo()`.

Our pangenome creation approach:

- Allows end-to-end single pangenome runs
- Offers parallelized multi-batch pangenomes for large isolate sets (>5,000 genomes)
  - Supports automated pangenome merging through `.mergePanaroo()`
- Generates gene presence/absence and count matrices per isolate
- Identifies structural variants (gene triplets indicating genome rearrangements)

Outputs are written into the per-taxon DuckDB for efficient storage and querying.

#### 2. Protein clusters
CD-HIT is executed inside a container using `.runCDHIT()`.

Our protein clustering approach:

- Clusters proteins across all isolates from BV-BRC .faa files
- Creates protein presence/absence and count matrices per isolate
- Saves cluster names and annotations

#### 3. Pfam domains
InterProScan is executed inside a container using `domainFromIPR()`.

Our Pfam domain extraction approach:

- Automatically configures InterPro's databases for use
- Runs parallelized and containerized domain annotation
- Maps domain presence/absence and counts to genomes and proteins
- Provides another functional annotation layer

#### 4. Data cleaning and storage
Final data formatting and storage is executed using `cleanData()`.

Our final data storage script:

- Harmonizes drug names, classes, and countries in BV-BRC metadata
- Generates temporal bins to stratify analysis across time
- Summarizes AMR information across the dataset
- Writes all data into highly compressed data structures
  - **Parquet**: Binary, columnar storage for large matrices
    - These can be made human-readable by calling `arrow::read_parquet`
  - **DuckDB**: SQL-queryable database for rapid filtering of linked Parquets

## Workflow example
An example of the process for downloading and processing all data and metadata
for _Shigella flexneri_ genomes with paired AST metadata.

```
library(amRdata)

# 1. Download & filter genomes
prepareGenomes(
  user_bacs  = c("Shigella flexneri"),
  base_dir   = "data/Shigella_flexneri",
  method     = "ftp"
)

# 2. Run multi-scale feature extraction

runDataProcessing(
  duckdb_path    = "data/Shigella_flexneri.duckdb",
  output_path    = "data/Shigella_flexneri",
  threads        = 8, # Or whatever your system supports
  ref_file_path  = "data_raw/"
)

# 3. Load final data

library(DBI)
library(arrow)

# To view all attached data tables in the database

con <- DBI::dbConnect(duckdb::duckdb(), "Shigella_flexneri/Sfl_parquet.duckdb")
DBI::dbListTables(con)


# To load human-readable data tables into R

# e.g., Looking at gene cluster counts per isolate
Sfl_gene_counts <- arrow::read_parquet("data/Shigella_flexneri/gene_count.parquet")

  # To connect gene cluster IDs to their annotated names
  Sfl_gene_names <- arrow::read_parquet("data/Shigella_flexneri/gene_names.parquet")
```

## Data requirements

External dependencies (managed through Docker) <br>

- BV‑BRC CLI
- Panaroo
- CD‑HIT
- InterProScan
- DuckDB
- Arrow (Parquet)

The user does not need to install these manually.

The package requires:

- An internet connection to access BV-BRC data and metadata
- A local Docker installation
  - Containers for internal tools are pulled automatically and do not require configuration
  - Make sure Docker is running before you start processing data!
- Sufficient storage for databases, downloaded files, and processed output (we recommend 20GB+)
- Multicore processing and sufficient (16GB+) of RAM are highly recommended
  - Species with many isolates may run poorly or fail to complete on older hardware

### Output

Feature matrices dimensions depend on species:

- Rows: Number of isolates (typically <10,000)
- Columns: Number of features (ballpark estimates)
  - Genes: 5,000-50,000
  - Proteins: 5,000-50,000
  - Domains: 500-10,000
  - Structural variants: 1,000-10,000

### External dependencies

The package uses established bioinformatics tools:

- **Panaroo** (≥1.3.0): Pangenome analysis
- **CD-HIT** (≥4.8.1): Protein clustering
- **InterProScan** (≥5.0): Domain annotation
- **Docker**: For BV-BRC CLI container

These are automatically managed through the Docker container.

## Performance

Processing times vary by species and isolate count:

- Data download: 0-1 hours
- Pangenome construction: 0-6 hours
- Protein clustering: 0-3 hours
- Domain annotation: 0-1 hours
- Total: 1-12 hours for a complete species analysis

- These numbers will all vary greatly based on isolate number, genome complexity, and available hardware.
- Parallelization significantly reduces processing time when multiple cores are available.

### Integration with amR suite

amRdata is designed to work seamlessly with other amR packages:

```r
library(amRdata)
library(amRml)
library(amRshiny)

# 1. Curate data
prepareGenomes("Shigella flexneri")
runDataProcessing("amRdata/data/Shigella_flexneri/Sfl.duckdb")

# 2. Train models
runMLmodels("amRdata/data/Shigella_flexneri/Sfl_parquet.duckdb")

# 3. Visualize ### To add
launch_dashboard()
```

## Related packages

- [amR](https://github.com/JRaviLab/amR): Suite metapackage
- [amRml](https://github.com/JRaviLab/amRml): ML for AMR prediction
- [amRshiny](https://github.com/JRaviLab/amRshiny): Interactive dashboard

## Citation
If you use `amRdata` in your research, please cite:

```
Brenner E, Ghosh A, Wolfe E, Boyer E, Vang C, Lesiyon R, Mayer D, Ravi J. (2026).
amR: an R package suite to predict antimicrobial resistance in bacterial pathogens.
R package version 0.99.0.
https://github.com/JRaviLab/amR
```

## Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## Reporting issues

Report bugs and request features at: https://github.com/JRaviLab/amRml/issues

## License

BSD 3-Clause License. See [LICENSE](LICENSE) for details.

## Contact

**Corresponding author**: Janani Ravi (janani.ravi@cuanschutz.edu)

**Lab website**: https://jravilab.github.io

## Code of conduct

Please note that `amRml` is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.