mongoDB PDF Vector Seeach

Introduction

To begin, the text from the PDFs are extracted, split into sentences, and mapped into a 384 dimensional dense vector space. The PDF sentences along with their vectors are stored into MongoDB Atlas. An Atlas Vector Search index then allows the PDFs to be queried, finding the PDFs that are relevant to the query.

Setup

the text extractor reads the PDFs from a local directory.

Atlas

Open params.py and configure your connection to Atlas, along with the name of the database and collection you'd like to store your text.

Extract and Encode the PDFs

Install the requirements. This implementation uses:

PyPDF2 Python library for text extraction
Hugging Face sentence-transformers/all-MiniLM-L6-v2 pretrained model for the dense vector mapping
pymongo - the Python driver for MongoDB

pip install -r requirements.txt

Run the upload_files.py

python3 upload_files.py

Create Search Index

Create a default search index on the collection:

{
  "fields": [
    {
      "numDimensions": 384,
      "path": "documentVector",
      "similarity": "cosine",
      "type": "vector"
    }
  ]
}

Demo

Your query will be mapped using the same sentence transformer that was used to encode the data and then submitted to Atlas Search, returning the top 3 matches.

For example:

The following PDFs may contain the answers you seek:
----------------------------------------------------
PDF:      Resume1.pdf

PDF:      Resume2.pdf

PDF:      Resume3.pdf

The Search Query

This is the simple query passed to MongoDB:

[
   {
    "$vectorSearch": {
      "index": "vector_index", 
      "path": "documentVector", 
      "queryVector": query_vector,
      "numCandidates": 150, 
      "limit": 3
    }
   }
]

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
__pycache__		__pycache__
.gitignore		.gitignore
README.md		README.md
find_pdf.py		find_pdf.py
functions.py		functions.py
jd.txt		jd.txt
params.py		params.py
requirements.txt		requirements.txt
server.py		server.py
test.json		test.json
update_existing.py		update_existing.py
upload_files.py		upload_files.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mongoDB PDF Vector Seeach

Introduction

Setup

Atlas

Extract and Encode the PDFs

Create Search Index

Demo

The Search Query

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mongoDB PDF Vector Seeach

Introduction

Setup

Atlas

Extract and Encode the PDFs

Create Search Index

Demo

The Search Query

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages