This project aims to build a scalable data pipeline to scrape, process, and analyze data from websites listing psychologists and therapists. The pipeline integrates multiple tools and frameworks, leveraging Spark/Databricks for ETL processing and analytics. End goal is to create groupings of therapists using the following:
-
Keyword Extraction and Matching
- Use Natural Language Processing (NLP) techniques to extract key terms from therapist descriptions.
- Group therapists based on predefined categories such as:
- Target populations: "Black individuals," "LGBTQ+," "adolescents."
- Specializations: "trauma," "grief," "anxiety."
- Therapeutic approaches: "CBT," "mindfulness," "play therapy."
- Tools: Libraries like SpaCy or NLTK can help extract and match keywords from text.
-
Sentiment Analysis
- Assess the tone or emotional alignment of the therapists’ descriptions (e.g., warm and welcoming, energetic).
- Sentiment analysis might not directly address grouping by needs, but it could provide insights into how the therapist communicates their services.
data-pipeline-project
├── src
│ ├── web_scraping # Contains web scraping scripts
│ ├── data_ingestion # Contains data ingestion scripts
│ ├── etl_transformation # Contains ETL transformation scripts
│ ├── analytics # Contains analytics scripts
│ ├── visualization # Contains visualization scripts
│ └── pipeline_automation # Contains automation scripts
├── data
│ ├── raw # Stores raw scraped data
│ └── processed # Stores processed data
├── notebooks # Contains Jupyter notebooks for analysis and ML
├── configs # Contains configuration files
├── requirements.txt # Lists project dependencies
├── README.md # Project documentation
└── .gitignore # Specifies files to ignore in version control
-
Clone the repository:
git clone <repository-url> cd data-pipeline-project -
Install dependencies:
pip install -r requirements.txt -
Configure settings: Update the
configs/config.yamlfile with your specific configuration settings, such as database connections and API keys.
-
Web Scraping:
- Use the scripts in
src/web_scrapingto scrape data from target websites.
- Use the scripts in
-
Data Ingestion:
- Run
src/data_ingestion/ingestion.pyto clean and load the raw data into the data lake.
- Run
-
ETL Transformation:
- Execute
src/etl_transformation/transformation.pyto transform the data for analysis.
- Execute
-
Analytics:
- Use
src/analytics/insights.pyto generate insights from the processed data.
- Use
-
Visualization:
- Run
src/visualization/dashboard.pyto create visualizations of the insights.
- Run
-
Pipeline Automation:
- Schedule the pipeline using
src/pipeline_automation/automation.py.
- Schedule the pipeline using
- Explore data analysis and insights generation in
notebooks/data_analysis.ipynb. - Build and evaluate machine learning models in
notebooks/ml_model.ipynb.
- Scalability: Utilizes Spark for distributed processing of large datasets.
- Resilience: Delta Lake ensures data consistency with ACID transactions.
- Automation: Apache Airflow or Databricks Workflows for scheduling and monitoring.
- Visualization: Real-time dashboards for actionable insights.
-
Dataset:
Name Location Rating Reviews Specialty Contact Info Victoria J Davidson Signal Mountain 4.8 50 Play Therapy, Anxiety (423) 464-4354 Blair Cobb Online Only 4.7 35 Mindfulness, Stress (865) 813-8075 -
Dashboard:
- Review sentiment analysis heatmap.