- Overview
- Features
- Project Structure
- Technical Details
- API Routes
- Installation
- Environment Variables
- Usage
CodeWhisper is a Flask-based web application designed to help students and educators with various study-related tasks. The application integrates multiple technologies including ASR, OCR, and intelligent chatbot capabilities.
- PDF Text Extraction: Convert PDF documents(including images) into clean, formatted text.
- Speech-to-Text: Convert audio recordings into text transcriptions
- Keyword Search: Extract keywords from text and search related GitHub repositories
- AI Teaching Assistant: Interactive chatbot for computer science education
- Multiple Study Tools:
- Code Whisper
- Notes Helper
- Slide to Note
- Speech to Note
- Additional Resources
project/
βββ main.py
βββ templates/
β βββ CodeWhisper.html
β βββ NotesHelper.html
β βββ SlideToNote.html
β βββ SpeechToNote.html
β βββ AdditionalResources.html
β βββ TeachingAssistant.html
β βββ results.html
βββ uploads/
- Flask
- PyMuPDF (fitz)
- SpeechBrain
- PyTorch
- Transformers
- PyGithub
- rake-nltk
- OpenAI
- Torchaudio
- pytesseract
- pdf2image
- opencv-python
- numpy
- Pillow
- librosa
- speechbrain
- Loads and resamples audio files:
- Converts audio to 16kHz sampling rate for consistent processing.
- Automatic silence removal:
- Trims unnecessary silence or noise at the beginning and end of audio.
- Normalizes audio:
- Adjusts volume levels for uniform loudness across files.
- Handles audio formats:
- Supports common formats like WAV, MP3, and FLAC.
- Converts stereo to mono:
- Ensures compatibility with single-channel processing models.
- Trims unnecessary silence or noise at the beginning and end of audio.
- Converts raw audio to features:
- Uses Wav2Vec2Processor to convert audio signals into input tensors.
- Handles varying audio lengths:
- Pads or truncates sequences to match model input requirements.
- Noise reduction:
- Reduces background noise for better recognition accuracy.
- Dynamic range compression:
- Ensures uniform audio dynamics for model stability.
- Speech-to-Text (STT) Inference:
- Model-driven transcription:
- Utilizes Wav2Vec2ForCTC for automatic speech recognition.
- Batch processing support:
- Handles multiple languages:
- Extracts text from both digital and scanned PDF documents
- Implements intelligent OCR detection and processing
- Supports multiple languages recognition
- Image preprocessing for better OCR accuracy:
- Automatic image enhancement
- Noise reduction
- Contrast optimization
- Sharpening
- Handles hybrid PDFs (mix of digital and scanned content)
- Cleans and formats extracted text
- Handles formatting for titles, subtitles, and bullet points
- Removes excessive whitespace and line breaks
- Extracts text from PDF documents
- Cleans and formats text content
- Handles formatting for titles, subtitles, and bullet points
- Removes excessive whitespace and line breaks
- Searches repositories based on extracted keywords
- Ranks results by stars
- Returns top 5 most relevant repositories
- Uses OpenAI's GPT-4 model
- Implements computer science teaching assistant functionality
- Provides contextualized responses to student queries
/- Main landing page/NotesHelper- Notes assistance tool/SlideToNote- Slide conversion tool/SpeechToNote- Speech-to-text tool/AdditionalResources- Additional learning resources/TeachingAssistant- AI teaching assistant interface
POST /process_pdf
Content-Type: multipart/form-data- Accepts PDF files
- Returns cleaned and formatted text
POST /process_audio
Content-Type: multipart/form-data- Accepts multiple audio formats
- Returns transcribed text
POST /keyword_search
Content-Type: application/x-www-form-urlencoded- Accepts text input
- Returns keywords and related GitHub repositories
POST /chat
Content-Type: application/json- Accepts user messages
- Returns AI assistant responses with timestamps
-
Clone the repository
-
Install Python dependencies:
pip install flask pymupdf speechbrain torch torchaudio transformers PyGithub rake-nltk openai pytesseract pdf2image opencv-python numpy Pillow- Install Tesseract OCR Engine:
For Ubuntu/Debian:
sudo apt-get update
sudo apt-get install tesseract-ocr
# Optional: Install additional language packs
sudo apt-get install tesseract-ocr-chi-sim # Simplified Chinese
sudo apt-get install tesseract-ocr-chi-tra # Traditional ChineseFor MacOS:
brew install tesseract
# Optional: Install language packs
brew install tesseract-langFor Windows:
- Download Tesseract installer from: https://github.com/UB-Mannheim/tesseract/wiki
- Install to default location (Usually
C:\Program Files\Tesseract-OCR) - Add to system PATH
- Install Poppler (required for pdf2image):
For Ubuntu/Debian:
sudo apt-get install poppler-utilsFor MacOS:
brew install popplerFor Windows:
- Download from: http://blog.alivate.com.au/poppler-windows/
- Extract to a suitable location
- Add bin directory to system PATH
- Set up environment variables
- Create an
uploadsdirectory in the project root
GITHUB_TOKEN=your_github_token
OPENAI_API_KEY=your_openai_api_key
# For Windows users only:
TESSERACT_PATH=C:\Program Files\Tesseract-OCR\tesseract.exe-
Add these codes into main.py
GITHUB_TOKEN = os.getenv('GITHUB_TOKEN', 'your_github_token') #please replace 'your_github_token' with your GitHub token chatbot_bp = Blueprint('chatbot', __name__, template_folder='templates') client = OpenAI(api_key='your_api_key') # please replace 'your_api_key' with your OpenAI API key # For Windows users, uncomment and update the following line: # pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
-
Start the Flask server:
python app.py-
Access the application at
http://localhost:5000 -
Use different tools:
- Upload PDFs for text extraction (now supports scanned documents)
- Record or upload audio for transcription
- Submit text for keyword extraction and GitHub repository search
- Interact with the AI teaching assistant
- The application runs in debug mode by default
- Ensure sufficient disk space for uploaded files
- Regularly clean the uploads directory
- Monitor API usage limits for GitHub and OpenAI services
- OCR processing may take longer for large scanned documents
- For optimal OCR results:
- Ensure good quality scans
- Use appropriate language packs
- Consider preprocessing settings for specific document types