ArXivExplorer: Semantic Search with Vector Database

Goal

The goal is to build a semantic search engine to find relevant documents from a list of ArXiv PDF links based on a given query. This project encapsulates a pipeline to build a semantic search system capable of finding relevant documents from a set of ArXiv PDFs based on a given query, leveraging modern NLP techniques and efficient similarity search algorithms. Through this project, we addressed challenges like text extraction from PDFs, text embedding for semantic similarity, and efficient similarity search to build a functioning semantic search engine.

Description

Library Installation:
- Installed necessary libraries including transformers, Annoy, PyMuPDF, requests, and sentence-transformers for various tasks like text extraction, text embedding, and similarity search.
Text Extraction:
- Defined a function to download PDFs from given ArXiv links and extract text using the PyMuPDF library.
Text Embedding:
- Initially attempted to use the GPT-2 model from the transformers library to generate text embeddings. However, to improve the semantic relevance of the embeddings, later switched to using the Sentence Transformers library which is tailored for creating semantically meaningful sentence embeddings.
Vector Database Creation:
- Created a vector database using the Annoy library to store the embeddings of the documents. This database facilitates efficient similarity search to find the most relevant documents for a given query.
Embedding Population:
- Iterated through the list of document links, extracted text from each document, generated embeddings using the Sentence Transformers model, and populated the vector database with these embeddings.
Index Building:
- Built the Annoy index to enable efficient similarity search.
Semantic Search Function:
- Defined a search function to take a query, generate an embedding for the query, perform a similarity search on the vector database to find the nearest document embeddings, and return the corresponding document URLs.
Query Execution:
- Demonstrated how to use the search function to execute a query and retrieve a list of relevant document URLs.
Demo

Google Colaboratory

Goal

Description

Demo