Skip to content

Commit c3906eb

Browse files
authored
Merge pull request #49 from NeotomaDB/dev
Merge Dev to Main for Initial Data Product Share with Sponsor
2 parents ba1903b + 38b4f66 commit c3906eb

103 files changed

Lines changed: 46568 additions & 710 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
# ignore files in models folder but keep .gitkeep
22
models/ner/*
3-
results/ner/*
43
!.gitkeep
54

65
# exclude all txt files in data
@@ -11,6 +10,10 @@ data/**/*.json
1110
data/**/*.csv
1211
!data/entity-extraction/raw/taxa.csv
1312

13+
# ignore files in models folder but keep .gitkeep
14+
models/ner/*
15+
!.gitkeep
16+
1417
# Byte-compiled / optimized / DLL files
1518
__pycache__/
1619
*.py[cod]

README.md

Lines changed: 109 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -4,68 +4,111 @@
44
[![Issues][issues-shield]][issues-url]
55
[![MIT License][license-shield]][license-url]
66

7-
# MetaExtractor: Finding Fossils in the Literature
8-
9-
This project aims to identify research articles which are relevant to the [Neotoma Paleoecological Database](http://neotomadb.org) (Neotoma), extract data relevant to Neotoma from the article, and provide a mechanism for the data to be reviewed by Neotoma data stewards then submitted to Neotoma. It is being completed as part of the University of British Columbia (UBC) [Masters of Data Science (MDS) program](https://masterdatascience.ubc.ca/) in partnership with the [Neotoma Paleoecological Database](http://neotomadb.org).
7+
# **MetaExtractor: Finding Fossils in the Literature**
8+
9+
This project aims to identify research articles which are relevant to the [_Neotoma Paleoecological Database_](http://neotomadb.org) (Neotoma), extract data relevant to Neotoma from the article, and provide a mechanism for the data to be reviewed by Neotoma data stewards then submitted to Neotoma. It is being completed as part of the _University of British Columbia (UBC)_ [_Masters of Data Science (MDS)_](https://masterdatascience.ubc.ca/) program in partnership with the [_Neotoma Paleoecological Database_](http://neotomadb.org).
10+
11+
**Table of Contents**
12+
13+
- [**MetaExtractor: Finding Fossils in the Literature**](#metaextractor-finding-fossils-in-the-literature)
14+
- [**Article Relevance Prediction**](#article-relevance-prediction)
15+
- [**Data Extraction Pipeline**](#data-extraction-pipeline)
16+
- [**Data Review Tool**](#data-review-tool)
17+
- [How to use this repository](#how-to-use-this-repository)
18+
- [Entity Extraction Model Training](#entity-extraction-model-training)
19+
- [Data Review Tool](#data-review-tool-1)
20+
- [Data Requirements](#data-requirements)
21+
- [Article Relevance Prediction](#article-relevance-prediction-1)
22+
- [Data Extraction Pipeline](#data-extraction-pipeline-1)
23+
- [Development Workflow Overview](#development-workflow-overview)
24+
- [Analysis Workflow Overview](#analysis-workflow-overview)
25+
- [System Requirements](#system-requirements)
26+
- [**Directory Structure and Description**](#directory-structure-and-description)
27+
- [**Contributors**](#contributors)
28+
- [Tips for Contributing](#tips-for-contributing)
1029

1130
There are 3 primary components to this project:
12-
1. **Article Relevance Prediction** - get the latest articles published, predict which ones are relevant to Neotoma and submit for processing
13-
2. **MetaData Extraction Pipeline** - extract relevant metadata from the article including geographic locations, taxa present, etc.
14-
3. **Data Review Tool** - this takes the extracted data and allows a user to review and correct it for submission to Neotoma
31+
32+
1. **Article Relevance Prediction** - get the latest articles published, predict which ones are relevant to Neotoma and submit for processing.
33+
2. **MetaData Extraction Pipeline** - extract relevant entities from the article including geographic locations, taxa, etc.
34+
3. **Data Review Tool** - this takes the extracted data and allows the user to review and correct it for submission to Neotoma.
1535

1636
![](assets/project-flow-diagram.png)
1737

18-
## Article Relevance Prediction
38+
## **Article Relevance Prediction**
39+
40+
The goal of this component is to monitor and identify new articles that are relevant to Neotoma. This is done by using the public [xDD API](https://geodeepdive.org/) to regularly get recently published articles. Article metadata is queried from the [CrossRef API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) to obtain data such as journal name, title, abstract and more. The article metadata is then used to predict whether the article is relevant to Neotoma or not.
41+
42+
The model was trained on ~900 positive examples (a sample of articles currently contributing to Neotoma) and ~3500 negative examples (a sample of articles unrrelated or closely related to Neotoma). Logistic regression model was chosen for its outstanding performance and interpretability.
43+
44+
Articles predicted to be relevant will then be submitted to the Data Extraction Pipeline for processing.
45+
46+
![](assets/article_prediction_flow.png)
1947

20-
The goal of this component is to monitor and identify new articles that are relevant to Neotoma. This is done by using the public [xDD API](https://geodeepdive.org/) to regularly get recently published articles. Article metadata is queried from the [CrossRef API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) to obtain data such as journal name, title, abstract and more. The article metadata is then used to predict whether the article is relevant to Neotoma or not. The predicted articles are then submitted to the MetaData Extraction Pipeline for processing.
48+
To run the Docker image for article relevance prediction pipeline, please refer to the instruction [here](docker/article-relevance/README.md)
2149

22-
## MetaData Extraction Pipeline
50+
## **Data Extraction Pipeline**
2351

24-
The predicted relevant articles have their full text provided by the xDD team and a custom trained Named Entity Recognition (NER) model is used to extract relevant data from the article.
52+
The full text is provided by the xDD team for the articles that are deemed to be relevant and a custom trained **Named Entity Recognition (NER)** model is used to extract entities of interest from the article.
2553

26-
The entities detected by this model are:
27-
- **AGE**: when historical ages are mentioned such as 1234 AD or 4567 BP (before present)
28-
- **TAXA**: plant or animal taxa names indicating what samples contained
29-
- **GEOG**: geographic coordinates indicating where samples were excavated from, e.g. 12'34"N 34'23"W
30-
- **SITE**: site names for where samples were excavated from
31-
- **REGION**: more general regions to provide context for where sites are located
32-
- **EMAIL**: researcher emails in the articles able to be used for follow-up contact
33-
- **ALTI**: altitudes of sites from where samples were excavated, e.g. 123 m a.s.l (above sea level)
54+
The entities extracted by this model are:
3455

35-
The model was trained on ~40 existing Paleoecology articles manually annotated by the team consisting of ~60,000 tokens with ~4,500 tagged entities.
56+
- **SITE**: name of the excavation site
57+
- **REGION**: more general regions names to provide context for where sites are located
58+
- **TAXA**: plant or animal fossil names
59+
- **AGE**: historical age of the fossils, eg. 1234 AD, 4567 BP
60+
- **GEOG**: geographic coordinates indicating the location of the site, eg. 12'34"N 34'23"W
61+
- **EMAIL**: researcher emails referenced in the articles
62+
- **ALTI**: altitudes of sites, eg. 123 m a.s.l (above sea level)
3663

37-
The trained model is available for inference and re-use on huggingface.co [here](https://huggingface.co/finding-fossils/metaextractor).
64+
The model was trained on ~40 existing Paleoecology articles manually annotated by the team consisting of **~60,000 tokens** with **~4,500 tagged entities**.
65+
66+
The trained model is available for inference and further development on huggingface.co [here](https://huggingface.co/finding-fossils/metaextractor).
3867
![](assets/hugging-face-metaextractor.png)
3968

40-
## Data Review Tool
69+
## **Data Review Tool**
4170

42-
Finally, the extracted data is loaded into the Data Review Tool where members of the Neotoma community can review the data and make any corrections necessary before submitting to Neotoma. The Data Review Tool is a web application built using the [Plotly Dash](https://dash.plotly.com/) framework. The tool allows users to view the extracted data, make corrections, and submit the data to be entered into Neotoma.
71+
Finally, the extracted data is loaded into the Data Review Tool where members of the Neotoma community can review the data and make any corrections necessary before submitting to Neotoma. The Data Review Tool is a web application built using the [Plotly Dash](https://dash.plotly.com/) framework. The tool allows users to view the extracted data, make corrections, and submit the data to be entered into Neotoma.
4372

4473
![](assets/data-review-tool.png)
45-
## Contributors
4674

47-
This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a [code of conduct](https://github.com/NeotomaDB/MetaExtractor/blob/main/CODE_OF_CONDUCT.md). Please review and follow this code of conduct as part of your contribution.
75+
## How to use this repository
4876

49-
The UBC MDS project team consists of:
50-
- Ty Andrews
51-
- Kelly Wu
52-
- Jenit Jain
53-
- Shaun Hutchinson
77+
First, begin by installing the requirements and Docker if not already installed ([Docker install instructions](https://docs.docker.com/get-docker/))
5478

55-
Sponsors from Neotoma supporting the project are:
56-
* [![ORCID](https://img.shields.io/badge/orcid-0000--0002--7926--4935-brightgreen.svg)](https://orcid.org/0000-0002-7926-4935) [Socorro Dominguez Vidana](https://ht-data.com/)
57-
* [![ORCID](https://img.shields.io/badge/orcid-0000--0002--2700--4605-brightgreen.svg)](https://orcid.org/0000-0002-2700-4605) [Simon Goring](http://www.goring.org)
79+
```bash
80+
pip install -r requirements.txt
81+
```
5882

59-
### Tips for Contributing
83+
A conda environment file will be provided in the final release.
6084

61-
Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through pull requests to [project forks](https://github.com/NeotomaDB/MetaExtractor/network/members) or [project branches](https://github.com/NeotomaDB/MetaExtractor/branches).
85+
### Entity Extraction Model Training
6286

63-
All products of the Neotoma Paleoecology Database are licensed under an [MIT License](LICENSE) unless otherwise noted.
87+
The Entity Extraction Models can be trained using the HuggingFace API by following the instructions in the [Entity Extraction Training README](src/entity_extraction/training/hf_token_classification/README.md).
6488

65-
## How to use this repository
89+
The spaCy model training documentation is a WIP.
6690

67-
WIP
91+
### Data Review Tool
92+
93+
The Data Review Tool can be launched by running the following command from the root directory of this repository:
94+
95+
```bash
96+
docker-compose up --build data-review-tool
97+
```
98+
99+
Once the image is built and the container is running, the Data Review Tool can be accessed at http://localhost:8050/. There is a sample "extracted entities" JSON file provided for demo purposes.
100+
101+
### Data Requirements
68102

103+
Each of the components of this project have different data requirements. The data requirements for each component are outlined below.
104+
105+
#### Article Relevance Prediction
106+
107+
The article relevance prediction component requires a list of journals that are relevant to Neotoma. This dataset used to train and develop the model is available for download HERE. TODO: Setup public link for data download from project GDrive.
108+
109+
#### Data Extraction Pipeline
110+
111+
As the full text articles provided by the xDD team are not publicly available we cannot create a public link to download the labelled training data. For access requests please contact Ty Andrews at ty.elgin.andrews@gmail.com.
69112

70113
### Development Workflow Overview
71114

@@ -79,16 +122,16 @@ WIP
79122

80123
WIP
81124

82-
### Data Requirements
83-
84-
WIP
85-
86-
### Directory Structure and Description
125+
### **Directory Structure and Description**
87126

88127
```
89128
├── .github/ <- Directory for GitHub files
90129
│ ├── workflows/ <- Directory for workflows
91130
├── assets/ <- Directory for assets
131+
├── docker/ <- Directory for docker files
132+
│ ├── article-relevance/ <- Directory for docker files related to article relevance prediction
133+
│ ├── data-review-tool/ <- Directory for docker files related to data review tool
134+
│ ├── entity-extraction/ <- Directory for docker files related to named entity recognition
92135
├── data/ <- Directory for data
93136
│ ├── entity-extraction/ <- Directory for named entity extraction data
94137
│ │ ├── raw/ <- Raw unprocessed data
@@ -113,13 +156,35 @@ WIP
113156
├── src/ <- Directory for source code
114157
│ ├── entity_extraction/ <- Directory for named entity recognition code
115158
│ ├── article_relevance/ <- Directory for article relevance prediction code
116-
│ └── data_review_tool/ <- Directory for data review tool code
159+
│ └── data_review_tool/ <- Directory for data review tool code
117160
├── reports/ <- Directory for reports
118161
├── tests/ <- Directory for tests
119162
├── Makefile <- Makefile with commands to perform analysis
120163
└── README.md <- The top-level README for developers using this project.
121164
```
122165

166+
## **Contributors**
167+
168+
This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a [code of conduct](https://github.com/NeotomaDB/MetaExtractor/blob/main/CODE_OF_CONDUCT.md). Please review and follow this code of conduct as part of your contribution.
169+
170+
The UBC MDS project team consists of:
171+
172+
- **Ty Andrews**
173+
- **Kelly Wu**
174+
- **Jenit Jain**
175+
- **Shaun Hutchinson**
176+
177+
Sponsors from Neotoma supporting the project are:
178+
179+
- [![ORCID](https://img.shields.io/badge/orcid-0000--0002--7926--4935-brightgreen.svg)](https://orcid.org/0000-0002-7926-4935) [Socorro Dominguez Vidana](https://ht-data.com/)
180+
- [![ORCID](https://img.shields.io/badge/orcid-0000--0002--2700--4605-brightgreen.svg)](https://orcid.org/0000-0002-2700-4605) [Simon Goring](http://www.goring.org)
181+
182+
### Tips for Contributing
183+
184+
Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through pull requests to [project forks](https://github.com/NeotomaDB/MetaExtractor/network/members) or [project branches](https://github.com/NeotomaDB/MetaExtractor/branches).
185+
186+
All products of the Neotoma Paleoecology Database are licensed under an [MIT License](LICENSE) unless otherwise noted.
187+
123188
[contributors-shield]: https://img.shields.io/github/contributors/NeotomaDB/MetaExtractor.svg?style=for-the-badge
124189
[contributors-url]: https://github.com/NeotomaDB/MetaExtractor/graphs/contributors
125190
[forks-shield]: https://img.shields.io/github/forks/NeotomaDB/MetaExtractor.svg?style=for-the-badge
@@ -129,4 +194,4 @@ WIP
129194
[issues-shield]: https://img.shields.io/github/issues/NeotomaDB/MetaExtractor.svg?style=for-the-badge
130195
[issues-url]: https://github.com/NeotomaDB/MetaExtractor/issues
131196
[license-shield]: https://img.shields.io/github/license/NeotomaDB/MetaExtractor.svg?style=for-the-badge
132-
[license-url]: https://github.com/NeotomaDB/MetaExtractor/blob/master/LICENSE.txt
197+
[license-url]: https://github.com/NeotomaDB/MetaExtractor/blob/master/LICENSE.txt

0 commit comments

Comments
 (0)