Building a Powerful API for Data Retrieval and Reranking

Building a Powerful API for Data Retrieval and Reranking

Over the past few weeks, I’ve been working on an exciting project that tackles one of the most common yet challenging problems in data science: retrieving relevant information from large datasets. The project evolved into an API for Data Retrieval and Reranking, and it’s built using Python Flask. I wanted to take a moment to walk you through the process, share what I’ve learned, and explain how this API works.

You can check out the full project on my GitHub here.

The Problem: Searching for Meaning in Large Datasets

We live in an age where data is everywhere, and the ability to extract meaningful information from it has never been more critical. Whether it’s research papers, legal documents, or business reports, the challenge is the same — finding the exact information you need without having to wade through irrelevant data.

I initially came across the Sentence Transformers’ retrieve & rerank example (which you can find here) and found it really inspiring. However, I knew I could make some improvements and tailor it to a more efficient, practical use case. So, I set out to build my own upgraded version of this retrieval and reranking system, focusing on handling PDF documents and performing semantic search.

The Solution: A Two-Stage Retrieval System

Stage 1: Retrieval

At its core, my system starts by storing encoded representations of documents in a vector database. For now, I focused on PDF documents, but this could easily be extended to other formats. The API ingests these documents, encodes their content, and stores the vectors in a dense retrieval system using a bi-encoder.

When a user submits a search query, the system performs an initial retrieval of potentially relevant documents. It retrieves a list of candidates (usually around 15 or so), which it considers the best matches based on the query. However, as we all know, not every retrieved document will be equally relevant. Some might be good matches for certain keywords but irrelevant when it comes to the actual content.

Stage 2: Reranking

That’s where the second stage comes in. Using a cross-encoder, the system reranks the retrieved documents based on their actual relevance to the search query. This cross-encoder evaluates how well the document truly matches the query, rather than relying purely on the initial dense vector search.

In short, this two-stage system helps ensure that the top results presented to the user are the most relevant ones, not just the ones that matched the keywords but those that actually address the user’s needs.

The Architecture: Python Flask API

To make this functionality accessible to developers, I decided to expose it as a RESTful API using Python Flask. Here’s how the architecture works:

  1. Document Ingestion: The user uploads a PDF document to the API, and the document is encoded and stored in a vector database (I used ChromaDB for this). Each document is stored as a collection, and each collection is essentially a group of encoded documents.

  2. Search Query: When the user sends a query to the API, it fetches the relevant documents from the specified collection using dense retrieval. The system retrieves a list of candidates that are potentially relevant.

  3. Reranking: The cross-encoder then reranks the documents based on their relevance, and the API returns the top 3 most relevant results to the user.

Endpoints

  • Create Collection: Allows you to upload a PDF, encode it, and store it in a collection.

  • Endpoint:

POST /create/<file_path>/<collection_name>
  • Example:
POST /create/sample.pdf/my_collection
  • Query Collection: Takes a user query and a collection name and returns the top 3 most relevant results.

  • Endpoint:

GET /query/<user_query>/<collection_name>
  • Example:
GET /query/what is artificial intelligence/my_collection
  • Get Collection Names: Returns a list of all collections currently in the database.

  • Endpoint:

GET /collections/

Building the System

Choosing the Right Tools

For this project, I leaned on a few key libraries that made the development process smoother:

  • Flask: For building the API.

  • PyPDF2: To handle PDF extraction.

  • SentenceTransformer: To generate the dense vector representations of the documents and perform cross-encoding.

  • BM25: For initial dense retrieval.

  • ChromaDB: A vector database to store the document embeddings.

  • sklearn, tqdm: Various utilities to handle the backend processes.

After defining the architecture, I started implementing the code. Flask made it easy to create the API routes and handle incoming requests. I wrote a function to extract text from a PDF using PyPDF2, then used SentenceTransformer to encode the document’s text into vector form. After encoding, the vectors were stored in ChromaDB, which allows for efficient querying.

The retrieval and reranking stages were handled using a combination of the rank-bm25 algorithm for initial retrieval and cross-encoder models for reranking. I aimed to keep the system both scalable and efficient, ensuring that it could handle large datasets without significant slowdowns.

Why a Two-Stage Approach?

One of the major lessons I learned during this project was the importance of combining different models to achieve the best results. While bi-encoders are fast and effective for dense retrieval, they often fall short when it comes to precise ranking of results. The cross-encoder, on the other hand, significantly improves the accuracy of the top results, but it’s slower and not practical for the initial retrieval of large datasets.

By combining these two approaches, I was able to build a system that finds the balance between speed and accuracy.

Example Workflow

Let me walk you through a typical workflow with this API:

  1. Upload a Document: The user uploads a PDF to the API and creates a collection for it.

  2. Example:

POST /create/Anime.pdf/Anime
  1. Query the Collection: After the document has been stored, the user can submit a query and retrieve the top 3 most relevant documents.

  2. Example:

GET /query/when was the first anime created/Anime
  1. Explore Available Collections: You can also get a list of all collections currently in the vector database.

  2. Example:

GET /collections/

Conclusion

Working on this project has been a challenging but incredibly rewarding experience. I’ve learned so much about semantic search, vector databases, and the importance of reranking in information retrieval. This project is a foundation that can be extended and improved in many ways, from supporting more file formats to incorporating more advanced models for better performance.

If you’re looking to build something similar or have an interest in data retrieval or semantic search, I hope this post gives you some insight into the process. You can check out the full project on my GitHub here. Feel free to fork the project, play around with it, and contribute your own improvements.

Happy coding!