The goal is to build a semantic search engine to find relevant documents from a list of ArXiv PDF links based on a given query. This project encapsulates a pipeline to build a semantic search system capable of finding relevant documents from a set of ArXiv PDFs based on a given query, leveraging modern NLP techniques and efficient similarity search algorithms. Through this project, we addressed challenges like text extraction from PDFs, text embedding for semantic similarity, and efficient similarity search to build a functioning semantic search engine.
Library Installation:
transformers, Annoy, PyMuPDF, requests, and sentence-transformers for various tasks like text extraction, text embedding, and similarity search.Text Extraction:
PyMuPDF library.Text Embedding:
transformers library to generate text embeddings. However, to improve the semantic relevance of the embeddings, later switched to using the Sentence Transformers library which is tailored for creating semantically meaningful sentence embeddings.Vector Database Creation:
Annoy library to store the embeddings of the documents. This database facilitates efficient similarity search to find the most relevant documents for a given query.Embedding Population:
Index Building:
Semantic Search Function:
Query Execution: