Psychology Today Webscraper

This project aims to build a scalable data pipeline to scrape, process, and analyze data from websites listing psychologists and therapists. The pipeline integrates multiple tools and frameworks, leveraging Spark/Databricks for ETL processing and analytics. End goal is to create groupings of therapists using the following:

Keyword Extraction and Matching
- Use Natural Language Processing (NLP) techniques to extract key terms from therapist descriptions.
- Group therapists based on predefined categories such as:
  - Target populations: "Black individuals," "LGBTQ+," "adolescents."
  - Specializations: "trauma," "grief," "anxiety."
  - Therapeutic approaches: "CBT," "mindfulness," "play therapy."
- Tools: Libraries like SpaCy or NLTK can help extract and match keywords from text.
Sentiment Analysis
- Assess the tone or emotional alignment of the therapists’ descriptions (e.g., warm and welcoming, energetic).
- Sentiment analysis might not directly address grouping by needs, but it could provide insights into how the therapist communicates their services.

Project Structure

data-pipeline-project
├── src
│   ├── web_scraping          # Contains web scraping scripts
│   ├── data_ingestion        # Contains data ingestion scripts
│   ├── etl_transformation    # Contains ETL transformation scripts
│   ├── analytics             # Contains analytics scripts
│   ├── visualization          # Contains visualization scripts
│   └── pipeline_automation    # Contains automation scripts
├── data
│   ├── raw                   # Stores raw scraped data
│   └── processed             # Stores processed data
├── notebooks                 # Contains Jupyter notebooks for analysis and ML
├── configs                   # Contains configuration files
├── requirements.txt          # Lists project dependencies
├── README.md                 # Project documentation
└── .gitignore                # Specifies files to ignore in version control

Setup Instructions

Clone the repository:

git clone <repository-url>
cd data-pipeline-project

Install dependencies:
```
pip install -r requirements.txt
```
Configure settings: Update the configs/config.yaml file with your specific configuration settings, such as database connections and API keys.

Usage

Web Scraping:
- Use the scripts in src/web_scraping to scrape data from target websites.
Data Ingestion:
- Run src/data_ingestion/ingestion.py to clean and load the raw data into the data lake.
ETL Transformation:
- Execute src/etl_transformation/transformation.py to transform the data for analysis.
Analytics:
- Use src/analytics/insights.py to generate insights from the processed data.
Visualization:
- Run src/visualization/dashboard.py to create visualizations of the insights.
Pipeline Automation:
- Schedule the pipeline using src/pipeline_automation/automation.py.

Notebooks

Explore data analysis and insights generation in notebooks/data_analysis.ipynb.
Build and evaluate machine learning models in notebooks/ml_model.ipynb.

Key Features

Scalability: Utilizes Spark for distributed processing of large datasets.
Resilience: Delta Lake ensures data consistency with ACID transactions.
Automation: Apache Airflow or Databricks Workflows for scheduling and monitoring.
Visualization: Real-time dashboards for actionable insights.

Example Output

Dataset:

Name	Location	Rating	Reviews	Specialty	Contact Info
Victoria J Davidson	Signal Mountain	4.8	50	Play Therapy, Anxiety	(423) 464-4354
Blair Cobb	Online Only	4.7	35	Mindfulness, Stress	(865) 813-8075

Dashboard:
- Review sentiment analysis heatmap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Psychology Today Webscraper

Project Structure

Setup Instructions

Usage

Notebooks

Key Features

Example Output

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
configs		configs
notebooks		notebooks
src		src
README.md		README.md
requirements.txt		requirements.txt

mansari2/psychology_today_webscraper

Folders and files

Latest commit

History

Repository files navigation

Psychology Today Webscraper

Project Structure

Setup Instructions

Usage

Notebooks

Key Features

Example Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages