To begin, the text from the PDFs are extracted, split into sentences, and mapped into a 384 dimensional dense vector space. The PDF sentences along with their vectors are stored into MongoDB Atlas. An Atlas Vector Search index then allows the PDFs to be queried, finding the PDFs that are relevant to the query.
the text extractor reads the PDFs from a local directory.
Open params.py and configure your connection to Atlas, along with the name of the database and collection you'd like to store your text.
Install the requirements. This implementation uses:
- PyPDF2 Python library for text extraction
- Hugging Face sentence-transformers/all-MiniLM-L6-v2 pretrained model for the dense vector mapping
- pymongo - the Python driver for MongoDB
pip install -r requirements.txtRun the upload_files.py
python3 upload_files.pyCreate a default search index on the collection:
{
"fields": [
{
"numDimensions": 384,
"path": "documentVector",
"similarity": "cosine",
"type": "vector"
}
]
}Your query will be mapped using the same sentence transformer that was used to encode the data and then submitted to Atlas Search, returning the top 3 matches.
For example:
The following PDFs may contain the answers you seek:
----------------------------------------------------
PDF: Resume1.pdf
PDF: Resume2.pdf
PDF: Resume3.pdfThis is the simple query passed to MongoDB:
[
{
"$vectorSearch": {
"index": "vector_index",
"path": "documentVector",
"queryVector": query_vector,
"numCandidates": 150,
"limit": 3
}
}
]