Goal

The goal is to build a semantic search engine to find relevant documents from a list of ArXiv PDF links based on a given query. This project encapsulates a pipeline to build a semantic search system capable of finding relevant documents from a set of ArXiv PDFs based on a given query, leveraging modern NLP techniques and efficient similarity search algorithms. Through this project, we addressed challenges like text extraction from PDFs, text embedding for semantic similarity, and efficient similarity search to build a functioning semantic search engine.

Description

  1. Library Installation:

  2. Text Extraction:

  3. Text Embedding:

  4. Vector Database Creation:

  5. Embedding Population:

  6. Index Building:

  7. Semantic Search Function:

  8. Query Execution:

    Demo

    Google Colaboratory