Information-Retrieval

Abstract:

In this project, a parser and an inverter was made to parse HTML pages and create inverted index. Four search algorithms (Okapi-TF, Okapi-TFIDF, Okapi-BM25 and Language Model with Jelinek Mercer Smoothing) were also implemented for document retrieval.

How to run:

Files should be run in the following order

parser.py

python parser.py <folder containing HTML files>
uses stoplist.txt, files in folder (contains HTML files) provided while execution
creates docids.txt, termids.txt, doc_index.txt

inverter.py

python inverter.py
uses docids.txt, termids.txt, doc_index.txt
creates term_info.txt, term_index.txt

docLengthCalculator.py

python docLengthCalculator.py
uses doc_index.txt
creates doc_lengths.txt

query.py

python query.py --score <score function> --query <search query>
available score functions: TF, TF-IDF, BM25, JM
uses docids.txt, termids.txt, stoplist.txt, term_index.txt, doc_lengths.txt

Contact

You can get in touch with me on my LinkedIn Profile: Farhan Shoukat

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docLengthCalculator.py		docLengthCalculator.py
inverter.py		inverter.py
parser.py		parser.py
query(old).py		query(old).py
query.py		query.py
stoplist.txt		stoplist.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information-Retrieval

Abstract:

How to run:

parser.py

inverter.py

docLengthCalculator.py

query.py

Contact

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Information-Retrieval

Abstract:

How to run:

parser.py

inverter.py

docLengthCalculator.py

query.py

Contact

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages