A comprehensive Robotic Process Automation (RPA) pipeline for curating systematic literature review datasets to enable Large Language Model (LLM) automation of article selection tasks.
This project addresses a critical challenge in systematic literature reviews: the time-intensive manual process of article selection and metadata curation. Traditional systematic reviews can take 1-3 years and require reviewing thousands of articles. Our solution creates high-quality annotated datasets from published systematic reviews using automated metadata extraction techniques.
- 16 systematic review datasets processed and curated
- 31,548 total articles with extracted metadata (from 32,646 source records, 1,098 excluded)
- 96.6% article recovery rate from academic databases
- 97% automation success rate for metadata extraction
- 8 academic database sources integrated
The curated datasets enable researchers to train and evaluate LLMs for automating systematic review processes, potentially reducing review time from years to weeks while maintaining scientific rigor.
| Component | Technology | Purpose |
|---|---|---|
| Web Automation | Selenium WebDriver | Browser automation and navigation |
| HTML Parsing | BeautifulSoup4 | Content extraction and parsing |
| Bibliography Processing | Pybtex | BibTeX format handling |
| Data Processing | Pandas | Data manipulation and analysis |
| Text Processing | NLTK | Natural language processing |
| Cross-platform Support | Python 3.8+ | Windows and Linux compatibility |
Scripts/
├── core/ # Core infrastructure
│ ├── SRProject.py # Base systematic review class
│ ├── os_path.py # Cross-platform path management
│ └── __init__.py
├── datasets/ # Individual dataset processors (16 datasets)
│ ├── ArchiML.py # Architecture & Machine Learning (2,723 articles)
│ ├── CodeClone.py # Code Clone Detection (9,700 articles)
│ ├── GameSE.py # Game Software Engineering (3,489 + 1,133 articles)
│ ├── ModelingAssist.py # Modeling Assistance (2,249 articles)
│ └── ... (12 more datasets)
├── extraction/ # Metadata extraction pipeline
│ ├── findMissingMetadata.py # Core extraction logic
│ ├── webScraping.py # Selenium-based scraping
│ ├── htmlParser.py # HTML content parsing
│ └── searchInSource.py # Source-specific search
├── specialized/ # Specialized processors
│ ├── GameSE_abstract.py # Abstract-level analysis
│ ├── GameSE_title.py # Title-level analysis
│ └── Demo.py, IFT3710.py # Course-specific demos
├── utilities/ # Helper scripts
│ ├── convert_encoding.py # Character encoding conversion
│ ├── get_non_matching_titles.py # Quality control
│ └── rename_html.py # File management
├── data/ # Data files and caches
├── testing/ # Test scripts
├── logs/ # Log files and documentation
└── main.py # Main pipeline entry point
- Python 3.8+ with pip
- Firefox browser (for Selenium WebDriver)
- Academic database access (institutional subscriptions recommended)
- Windows 10/11 or Ubuntu Linux
-
Clone the repository
git clone [repository-url] cd "Projet Curation des métadonnées"
-
Install dependencies
pip install -r requirements.txt
-
Configure paths
- Edit
Scripts/core/os_path.pyfor your system paths - Ensure Firefox and geckodriver are properly installed
- Edit
# Process a single dataset
python Scripts/main.py ArchiML
# Process multiple datasets
python Scripts/main.py CodeClone ModelingAssist GameSE
# Process all default datasets
python Scripts/main.py| Dataset | Domain | Articles | SLR Title |
|---|---|---|---|
| CodeClone | Code Analysis | 9,700 | A systematic literature review on source code similarity measurement and clone detection: Techniques, applications, and challenges |
| CodeCompr | Code Comprehension | 3,930 | A systematic literature review on the impact of formatting elements on code legibility |
| GameSE-title | Game SE | 3,489 | The consolidation of game software engineering: A systematic literature review of software engineering for industry-scale computer games (title screening) |
| GameSE-abstract | Game SE | 1,133 | The consolidation of game software engineering: A systematic literature review of software engineering for industry-scale computer games (abstract screening) |
| ArchiML | ML Architecture | 2,723 | Architecting ML-enabled systems: Challenges, best practices, and design decisions |
| ModelingAssist | Modeling Tools | 2,249 | Understanding the landscape of software modelling assistants: A systematic mapping |
| ModelGuidance | Model-Driven | 1,741 | Modelling guidance in software engineering: A systematic literature review |
| SmellReprod | Code Quality | 1,714 | How far are we from reproducible research on code smell detection? A systematic literature review |
| SecSelfAdapt | Security | 1,433 | A systematic review on security and safety of self-adaptive systems |
| ESPLE | Empirical SE | 963 | Empirical software product line engineering: A systematic literature review |
| OODP | Design Patterns | 708 | A mapping study of language features improving object-oriented design patterns |
| Behave | BDD | 590 | Behaviour driven development: A systematic mapping study |
| TrustSE | Trust & Security | 553 | A systematic literature review on trust in the software ecosystem |
| DTCPS | Cyber-Physical | 403 | Digital-twin-based testing for cyber-physical systems: A systematic literature review |
| ESM_2 | Adaptive UI | 114 | Adaptive user interfaces in systems targeting chronic disease: A systematic literature review |
| TestNN | Neural Testing | 105 | Testing and verification of neural-network-based safety-critical control software: A systematic literature review |
| Total | 31,548 |
Our RPA pipeline extracts metadata from 8 major academic databases:
- IEEE Xplore - Technical publications and conferences
- ACM Digital Library - Computing and information technology
- ScienceDirect - Elsevier's multidisciplinary publications
- SpringerLink - Academic books and journals
- Scopus - Citation and abstract database
- Web of Science - Multidisciplinary citation database
- arXiv - Preprint repository for STEM fields
- PubMed Central - Biomedical literature archive
# Enable/disable metadata extraction
do_extraction = True # Set to False for testing without web scraping
# Process specific datasets
args = ['ArchiML', 'CodeClone', 'ModelingAssist']
# Run identifier for batch processing
run = 999Edit Scripts/core/os_path.py for your environment:
# Main project path
MAIN_PATH = "C:\\Users\\...\\Projet Curation des métadonnées"
# Extracted content cache
EXTRACTED_PATH = "C:\\Users\\...\\Database"All datasets follow this standardized schema:
| Field | Type | Description |
|---|---|---|
key |
String | Unique article identifier |
project |
String | Dataset name |
title |
String | Article title |
abstract |
String | Article abstract |
keywords |
String | Article keywords (semicolon-separated) |
authors |
String | Author list (semicolon-separated) |
venue |
String | Publication venue |
doi |
String | Digital Object Identifier |
| Field | Type | Description |
|---|---|---|
screened_decision |
String | Initial screening decision |
final_decision |
String | Final inclusion decision |
mode |
String | Review mode (new_screen, snowballing) |
inclusion_criteria |
String | Inclusion criteria description |
exclusion_criteria |
String | Exclusion criteria description |
reviewer_count |
Integer | Number of reviewers |
| Field | Type | Description |
|---|---|---|
source |
String | Academic database source |
year |
String | Publication year |
meta_title |
String | Source dataset title |
link |
String | Source URL |
publisher |
String | Publisher information |
metadata_missing |
String | Missing metadata indicators |
# Load systematic review dataset
sr_project = ArchiML() # Example dataset- Duplicate title detection and resolution
- Data schema normalization
- Character encoding standardization
# Enable web scraping
do_extraction = True
completed_df = findMissingMetadata.main(sr_project.df, do_extraction, run, dataset_name)- Unicode character normalization
- Illegal character removal
- Format standardization
- Title matching validation
- Missing metadata reporting
- Statistical analysis
# Export to TSV format
ExportToCSV(sr_project)- Create dataset class in
Scripts/datasets/
class NewDataset(SRProject):
def __init__(self):
super().__init__()
self.project_name = "NewDataset"
# Define inclusion/exclusion criteria
# Set source file paths- Add to main.py
from Scripts.datasets.NewDataset import NewDataset
# Add to main() function
elif arg == "NewDataset":
sr_project = NewDataset()- Article Recovery: 96.6% of source records successfully retained (31,548 of 32,646)
- Metadata Extraction: 97% automation success rate
- Title Matching Accuracy: >95% using fuzzy matching algorithms
- Edit distance algorithms for title similarity
- Cross-reference verification when multiple sources available
- Format standardization across all datasets
- Comprehensive error logging for manual review
This project supports research in:
- Systematic Literature Review Automation
- Large Language Model Training for Academic Tasks
- Robotic Process Automation in Research
Note: This pipeline requires academic database access and appropriate institutional subscriptions for optimal functionality. The system is designed for research and educational purposes.