Modules
Indexing Module and Retrieval Module
Indexing Module
This module integrates external resources and prepares the collected data for loading into a vector database. It consists of connectors and an indexer. The connectors are abstract Python classes designed to gather data from external sources and format it into a structured format using LangChain Documents. This structured data is then used by the indexer, which creates embeddings from the documents and loads them into the vector database. Each connector is specific to a particular type of data resource (e.g., a geojson/OSM connector for reading OSM-type geojsons). Therefore, there is a unique connector for each resource type. In contrast, the indexer is generic and can handle the outputs from all connectors, meaning only one indexer is needed for all connectors. For more detail about the connectors see the documentation here.
Indexer Class
The Indexer class is responsible for creating and managing embeddings of documents for efficient retrieval and loading them into a vector database. It utilizes the LangChain library for embedding and indexing operations. The Indexer class is used by both the Indexing Module to load documents into the vector database and the Retrieval Module to retrieve documents based on queries.
Key Features
Embeddings: Supports both GPT-4 All and HuggingFace models for generating embeddings from documents. The default model is
nomic-ai/nomic-embed-text-v1.5-GGUFby Nomic AI. Any other embedding model available via HuggingFace can also be used. To use a HuggingFace model, set theuse_hf_modelflag totrueand specify the model name with theembedding_modelparameter (e.g.,embedding_model='sentence-transformers/all-MiniLM-L6-v2').Vector Storage: Uses ChromaDB for storing and managing vectors persistently.
Record Management: Integrates with an SQLRecordManager for efficient indexing and retrieval operations, maintaining a schema in a local SQLite database.
Document Retrieval: Provides a flexible retriever for similarity searches with configurable parameters for the number of results (
k) and score threshold.
Usage
The Indexer class offers several methods to interact with the indexed data:
_index(documents: List[Document]): Indexes a list of documents._clear(): Clears the current index._get_doc_by_id(_id: str): Retrieves a document by its ID._delete_doc_from_index(_id: str): Deletes a document from the index and updates the record manager accordingly.
Example Initialization
from your_module import Indexer # Replace with your actual module path
indexer = Indexer(
index_name="my_index",
persist_directory="./chroma_db",
embedding_model="nomic-embed-text-v1.5.f16.gguf",
use_hf_model=False,
k=3,
score_treshold=0.6
)
:information_source: The score_threshold argument sets the minimum relevance score that documents must have to be retrieved for a query. Only documents with a relevance score above this threshold will be returned.
Retrieval Module
The Retrieval Module uses the Indexer class to retrieve documents from the vector database based on user queries. It enables contextual retrieval through dense retrieval techniques, ensuring relevant documents are fetched efficiently.
Endpoints
Each index created by the Indexing Module can be accessed through specific retrieval endpoints. These endpoints are dynamically created based on the available indexes. This setup ensures that each index can be queried individually, providing a flexible and scalable retrieval mechanism.
Example Usage
Below is an example of how to set up the retrieval endpoints:
# Create a dictionary of indexes
indexes = {
"pygeoapi": Indexer(index_name="pygeoapi"),
}
# Add retrieval routes for each index
for index_name, index_instance in indexes.items():
add_routes(app, index_instance.retriever, path=f"/retrieve_{index_name}")
These endpoints can be accessed with POST requests, allowing you to retrieve documents for specific queries. For example, to retrieve documents from the pygeoapi index, you can use the following endpoint:
curl -X POST \
http://localhost:8000/retrieve_pygeoapi/invoke \
-H 'Content-Type: application/json' \
-d '{
"input": "Precipitation"
}'
Search Criteria and Answer Generation Modules
Search Criteria Module
The Search Criteria Module is designed to handle user inputs in a chat-bot manner, assisting users in finding geospatial data. It extracts search criteria from user inputs and can generate follow-up questions if the provided information is insufficient. The module uses a state graph to manage the flow of user interactions and determine the necessary actions to refine the search criteria.
Answer Generation Module
The Answer Generation Module uses the refined search criteria and retrieved documents to generate contextual answers for user queries. It leverages a combination of retrieval and language generation techniques to provide accurate and useful responses.
State Graph
The SpatialRetrieverGraph class orchestrates the flow of user interactions and processes search criteria. Here are the key nodes and their functions within the graph. The module column shows which of the modules the nodes belong to:
Node |
Description |
Module |
|---|---|---|
decide_if_data_search |
Determines if the user query is related to data search or off-topic. |
|
early_end |
Ends the interaction if the query is off-topic. |
|
process_query |
Processes the user query to extract initial search criteria. |
|
temporal_parser |
Checks if a temporal dimension is necessary for the search. |
|
analyze_search_dict |
Analyzes the search criteria to determine if sufficient information is available for retrieval. |
|
follow_up_gen |
Generates follow-up questions if more information is needed. |
|
search |
Initiates the search in the vector database using the refined search criteria. |
|
final_answer |
Generates the final answer based on the retrieved documents. |
See the complete graph workflow here: