Modules

Server_Architecture

Indexing Module and Retrieval Module

Indexing Module

This module integrates external resources and prepares the collected data for loading into a vector database. It consists of connectors and an indexer. The connectors are abstract Python classes designed to gather data from external sources and format it into a structured format using LangChain Documents. This structured data is then used by the indexer, which creates embeddings from the documents and loads them into the vector database. Each connector is specific to a particular type of data resource (e.g., a geojson/OSM connector for reading OSM-type geojsons). Therefore, there is a unique connector for each resource type. In contrast, the indexer is generic and can handle the outputs from all connectors, meaning only one indexer is needed for all connectors. For more detail about the connectors see the documentation here.

Indexer Class

The Indexer class is responsible for creating and managing embeddings of documents for efficient retrieval and loading them into a vector database. It utilizes the LangChain library for embedding and indexing operations. The Indexer class is used by both the Indexing Module to load documents into the vector database and the Retrieval Module to retrieve documents based on queries.

Key Features

  • Embeddings: Supports both GPT-4 All and HuggingFace models for generating embeddings from documents. The default model is nomic-ai/nomic-embed-text-v1.5-GGUF by Nomic AI. Any other embedding model available via HuggingFace can also be used. To use a HuggingFace model, set the use_hf_model flag to true and specify the model name with the embedding_model parameter (e.g., embedding_model='sentence-transformers/all-MiniLM-L6-v2').

  • Vector Storage: Uses ChromaDB for storing and managing vectors persistently.

  • Record Management: Integrates with an SQLRecordManager for efficient indexing and retrieval operations, maintaining a schema in a local SQLite database.

  • Document Retrieval: Provides a flexible retriever for similarity searches with configurable parameters for the number of results (k) and score threshold.

Usage

The Indexer class offers several methods to interact with the indexed data:

  • _index(documents: List[Document]): Indexes a list of documents.

  • _clear(): Clears the current index.

  • _get_doc_by_id(_id: str): Retrieves a document by its ID.

  • _delete_doc_from_index(_id: str): Deletes a document from the index and updates the record manager accordingly.

Example Initialization

from your_module import Indexer  # Replace with your actual module path

indexer = Indexer(
    index_name="my_index",
    persist_directory="./chroma_db",
    embedding_model="nomic-embed-text-v1.5.f16.gguf",
    use_hf_model=False,
    k=3,
    score_treshold=0.6
)

:information_source: The score_threshold argument sets the minimum relevance score that documents must have to be retrieved for a query. Only documents with a relevance score above this threshold will be returned.

Retrieval Module

The Retrieval Module uses the Indexer class to retrieve documents from the vector database based on user queries. It enables contextual retrieval through dense retrieval techniques, ensuring relevant documents are fetched efficiently.

Endpoints

Each index created by the Indexing Module can be accessed through specific retrieval endpoints. These endpoints are dynamically created based on the available indexes. This setup ensures that each index can be queried individually, providing a flexible and scalable retrieval mechanism.

Example Usage

Below is an example of how to set up the retrieval endpoints:

# Create a dictionary of indexes
indexes = {
    "pygeoapi": Indexer(index_name="pygeoapi"),
}

# Add retrieval routes for each index
for index_name, index_instance in indexes.items():
    add_routes(app, index_instance.retriever, path=f"/retrieve_{index_name}")

These endpoints can be accessed with POST requests, allowing you to retrieve documents for specific queries. For example, to retrieve documents from the pygeoapi index, you can use the following endpoint:

curl -X POST \
  http://localhost:8000/retrieve_pygeoapi/invoke \
  -H 'Content-Type: application/json' \
  -d '{
    "input": "Precipitation"
}'

Search Criteria and Answer Generation Modules

Search Criteria Module

The Search Criteria Module is designed to handle user inputs in a chat-bot manner, assisting users in finding geospatial data. It extracts search criteria from user inputs and can generate follow-up questions if the provided information is insufficient. The module uses a state graph to manage the flow of user interactions and determine the necessary actions to refine the search criteria.

Answer Generation Module

The Answer Generation Module uses the refined search criteria and retrieved documents to generate contextual answers for user queries. It leverages a combination of retrieval and language generation techniques to provide accurate and useful responses.

State Graph

The SpatialRetrieverGraph class orchestrates the flow of user interactions and processes search criteria. Here are the key nodes and their functions within the graph. The module column shows which of the modules the nodes belong to:

Node

Description

Module

decide_if_data_search

Determines if the user query is related to data search or off-topic.

Search Criteria Module

early_end

Ends the interaction if the query is off-topic.

Answer Generation Module

process_query

Processes the user query to extract initial search criteria.

Search Criteria Module

temporal_parser

Checks if a temporal dimension is necessary for the search.

Search Criteria Module

analyze_search_dict

Analyzes the search criteria to determine if sufficient information is available for retrieval.

Search Criteria Module

follow_up_gen

Generates follow-up questions if more information is needed.

Answer Generation Module

search

Initiates the search in the vector database using the refined search criteria.

Retrieval Module

final_answer

Generates the final answer based on the retrieved documents.

Answer Generation Module

See the complete graph workflow here:

current_workflow