DMP Midpoint Evaluation
Project 1: Query Method Evaluation System on Vector Databases for Grievance Data
Overview
The Query Method Evaluation System is a robust framework designed to identify and recommend the most effective query methods for various categories within a grievance management system. It leverages RAGAS metrics (Context Precision, Context Recall, and Response Relevancy) to provide a comprehensive evaluation of different search techniques on a large dataset of structured grievance information.
Project Scope and Impact:
The project involved:
Structuring 15,000 grievance records: Transforming raw grievance data into a structured, query-ready format. This step is crucial for consistent and accurate information retrieval.
Synthetic Query Generation: Creating 15,000 synthetic queries, one for each structured grievance. This expanded the dataset for thorough evaluation and benchmarking.
Evaluation and Benchmarking: Systematically testing and comparing various query methods against the structured data and synthetic queries using RAGAS metrics.
Features
The system boasts a comprehensive set of features to facilitate in-depth analysis and optimization of query performance:
9 Query Methods Evaluated:
Semantic Search
Hybrid Search
Keyword Search (BM25)
Generative Search (RAG)
Vector Similarity Search
Multiple Target Vectors
Reranking (Hybrid + Rerank)
Aggregate Data
Filtered Search
3 Core RAGAS Metrics:
Context Precision: Measures the proportion of relevant information within the retrieved contexts. A good score (0.7) indicates less noise in results.
Context Recall: Measures how many relevant pieces of information were successfully retrieved. A good score (0.8) indicates fewer missed relevant results.
Response Relevancy: Measures how well the generated or retrieved response addresses the user's query. A good score (0.7) indicates better alignment between query and response.
Category-wise Analysis: Provides tailored recommendations for the best query method for each specific grievance category, ensuring optimized performance across diverse complaint types.
Comprehensive Reporting: Generates detailed CSV exports of raw evaluation results and JSON recommendations for easy integration and analysis.
Scalable Evaluation: Designed to efficiently handle large datasets, making it suitable for real-world grievance systems with extensive data.
Setup Instructions
To set up and run the evaluation system, follow these steps:
Environment Setup:
Create a .env file in your project root with the following variables:
WEAVIATE_URL=https://your-cluster.weaviate.network WEAVIATE_API_KEY=your-weaviate-api-key OPENAI_API_KEY=your-openai-api-keyInstall Dependencies:Bash
Use npm to install the necessary packages:
npm install weaviate-client dotenv node-fetch csv-writerData Preparation:JSON
Ensure your JSON data adheres to the specified structure:
[ { "department_code": "DOTEL", "department_name": "Telecommunications", "category": "Mobile Related", "sub_category_1": "Call Drop", "description": "Detailed description of the issue...", "user_queries": [ "My phone calls keep getting dropped. What can I do?", "Why does my cell phone keep disconnecting during calls?" ] } ]Run Data Preparation:Bash
Execute the data preparation script:
node data_preparation.jsRun Evaluation:Bash
Initiate the query evaluation process:
node query_evaluation.js
Query Methods Evaluated and Their Best Use Cases
Each query method is suited for different types of user queries and data characteristics:
1. Semantic Search (nearText):
Best for: Natural language queries, conceptual searches.
Use case: When users ask questions in conversational language (e.g., "My internet is slow").
2. Hybrid Search:
Best for: A balanced approach combining semantic understanding and keyword matching.
Use case: General-purpose queries with mixed search patterns.
3. Keyword Search (BM25):
Best for: Exact term matching, technical queries.
Use case: When users search for specific terms or codes (e.g., "error code 404").
4. Generative Search (RAG):
Best for: Complex queries requiring synthesized answers.
Use case: When users need explanations or detailed responses (e.g., "What should I do about dropped calls?").
5. Vector Similarity Search:
Best for: Semantic similarity without text preprocessing.
Use case: Finding conceptually similar content based on raw vector embeddings.
6. Multiple Target Vectors:
Best for: Multi-aspect queries or complex categorization.
Use case: When different aspects of content need different vector representations (e.g., technical content + emotional sentiment vectors).
7. Reranking (Hybrid + Rerank):
Best for: High-precision requirements, quality over quantity.
Use case: When accuracy is more important than speed (e.g., legal or medical queries).
8. Aggregate Data:
Best for: Statistical queries, counting, summarization.
Use case: Analytics and reporting queries (e.g., "How many network issues were reported?").
9. Filtered Search:
Best for: Category-specific searches, scoped queries.
Use case: When search should be limited to specific categories (e.g., only searching within "Mobile Related" issues).
Output Files
The evaluation system generates the following output files for detailed analysis:
evaluation_results.csv:Detailed results for each query and method combination.
Includes Department, Category, Sub-category, User Query, Scores for each method (Semantic, Hybrid, etc.), and individual RAGAS metric scores.
recommendations.json:Category-wise recommendations for the best-performing query method.
Example structure:JSON
{ "Mobile Related > Call Drop": { "bestMethod": "hybrid", "bestScore": 0.856, "methodPerformance": { "hybrid": { "averageScore": 0.856, "contextPrecision": 0.89, "contextRecall": 0.82, "responseRelevancy": 0.86 } } } }
Console Output:
Provides real-time progress and a summary during the evaluation process.
Evaluation Process Phases
The evaluation proceeds through three distinct phases:
🧩 Phase 1: Initialization
This phase:
Loads env variables
Connects to Weaviate
Loads evaluation JSON
Creates evaluator instance
🧩 Phase 2: Evaluate All Methods
This is the core loop:
For each category and query
Run all query methods
Calculate 3 metrics (Precision, Recall, Relevancy)
Log & store results
🧩 Phase 3: Result Analysis + Export
This phase:
Aggregates scores across all queries
Calculates final average P, R, RR, Score
Finds the best method
Writes to CSV and JSON
Detailed Pipeline : https://excalidraw.com/
Consolidated Evaluation Metrics Table (Excluding Generative Search)
The following table summarizes the performance of various query methods across different grievance subcategories, excluding Generative Search which often shows distinct performance characteristics due to its nature of synthesizing answers.
Mobile Related > Call Drop
semantic
0.949
1.00
1.00
0.846
hybrid
0.949
1.00
1.00
0.846
keyword
0.943
1.00
1.00
0.830
vector
0.911
1.00
1.00
0.732
reranking
0.944
1.00
1.00
0.833
filtered
0.949
1.00
1.00
0.846
Mobile Related > Improper Network Coverage
semantic
0.934
0.96
1.00
0.842
hybrid
0.930
0.96
1.00
0.829
keyword
0.943
1.00
1.00
0.829
vector
0.847
0.84
0.986
0.716
reranking
0.943
1.00
1.00
0.829
filtered
0.937
1.00
1.00
0.812
Mobile Related > Data Speed lower than committed
semantic
0.956
1.00
1.00
0.868
hybrid
0.956
1.00
1.00
0.868
keyword
0.943
0.96
1.00
0.868
vector
0.877
0.92
0.985
0.726
reranking
0.939
1.00
1.00
0.817
filtered
0.933
1.00
1.00
0.800
Mobile Related > Mobile Number Portability (MNP)
semantic
0.941
1.00
1.00
0.822
hybrid
0.938
1.00
1.00
0.814
keyword
0.936
1.00
1.00
0.808
vector
0.907
1.00
1.00
0.722
reranking
0.753
0.80
0.80
0.659
filtered
0.929
1.00
1.00
0.786
Mobile Related > UCC related complaints
semantic
0.921
0.96
1.00
0.804
hybrid
0.921
0.96
1.00
0.804
keyword
0.922
0.96
1.00
0.806
vector
0.885
0.96
0.96
0.736
reranking
0.747
0.80
0.80
0.641
filtered
0.926
1.00
1.00
0.779
Mobile Related > Activation/Deactivation of VAS
semantic
0.751
0.80
0.80
0.653
hybrid
0.751
0.80
0.80
0.653
keyword
0.751
0.80
0.80
0.653
vector
0.903
0.96
1.00
0.750
reranking
0.939
1.00
1.00
0.816
filtered
0.923
1.00
1.00
0.770
Mobile Related > Activation/Deactivation/Fault of SIM
semantic
0.945
1.00
1.00
0.834
hybrid
0.754
0.80
0.80
0.663
keyword
0.944
1.00
1.00
0.832
vector
0.912
1.00
1.00
0.737
reranking
0.944
1.00
1.00
0.831
filtered
0.925
1.00
1.00
0.775
Tabular report for each query t
Summary Evaluation Report and Analysis
Output Files Summary:
evaluation_data.json: Contains metadata and user queries for telecom grievance categories.evaluation_results.csv: Provides raw evaluation scores per query per method.recommendations.json: Offers aggregated performance metrics (Precision, Recall, Response Relevancy, and Average Score) for each category and query method.
Evaluation Analysis Report: Weaviate Search Method Performance
📌 Overview:
The evaluation rigorously compares various search and retrieval methods on telecommunications grievance data. The primary objective is to assess each method's effectiveness in retrieving relevant and helpful results, leveraging the three crucial RAGAS metrics: Context Precision, Context Recall, and Response Relevancy.
📂 Categories Evaluated:
Mobile Related
Call Drop
5
Mobile Related
Improper Network Coverage
5
Mobile Related
Data Speed lower than committed
5
Mobile Related
Mobile Number Portability (MNP)
5
Mobile Related
UCC related complaints
5
Mobile Related
Activation/Deactivation of Value Added Services
5
Mobile Related
Activation/Deactivation/Fault of SIM Card
5
Summary of Best Methods Per Category (Including Generative Search):
This summary highlights the top-performing methods, with Generative Search demonstrating exceptional performance across the board.
Call Drop
Generative
0.961
1.0
1.0
0.882
Improper Network Coverage
Generative
0.946
0.96
1.0
0.878
Data Speed lower than committed
Generative
0.958
1.0
1.0
0.874
Mobile Number Portability (MNP)
Generative
0.958
1.0
1.0
0.874
UCC related complaints
Generative
0.947
0.96
1.0
0.881
Activation/Deactivation of VAS
Generative
0.965
1.0
1.0
0.896
SIM Card Activation/Deactivation/Fault
Generative
0.967
1.0
1.0
0.902
It is important to note that Generative search consistently outperforms all other methods across every category evaluated.
Key Metrics Definitions:
Context Precision: Proportion of relevant chunks in retrieved contexts.
Context Recall: How many relevant pieces of information were retrieved.
Response Relevancy: How well the response addresses the user's query.
Detailed Category-wise Breakdown (Excluding Generative Search):
For a more granular view of non-generative methods, the following breakdown provides insights into each method's performance within specific categories.
Category: Mobile Related > Call Drop
Best Method (Excluding Generative): Semantic / Hybrid / Filtered (Tie at 0.949)
Category: Mobile Related > Improper Network Coverage
Best Method (Excluding Generative): Keyword / Reranking (Tie at 0.943)
Category: Mobile Related > Data Speed Lower Than Committed
Best Method (Excluding Generative): Semantic / Hybrid (Tie at 0.956)
Category: Mobile Related > Mobile Number Portability (MNP)
Best Method (Excluding Generative): Semantic (0.941)
Category: Mobile Related > UCC Related Complaints
Best Method (Excluding Generative): Filtered (0.926)
Category: Mobile Related > VAS Activation/Deactivation Without Consent
Best Method (Excluding Generative): Reranking (0.939)
Category: Mobile Related > SIM Card Activation/Deactivation/Fault
Best Method (Excluding Generative): Semantic (0.945)
Overall Summary (Best Non-Generative Methods):
Call Drop
Semantic / Hybrid / Filtered
0.949
Improper Network Coverage
Keyword / Reranking
0.943
Data Speed Lower Than Committed
Semantic / Hybrid
0.956
Mobile Number Portability (MNP)
Semantic
0.941
UCC Related Complaints
Filtered
0.926
VAS Activation/Deactivation Without Consent
Reranking
0.939
SIM Card Activation/Deactivation/Fault
Semantic
0.945
📌 Insights from Evaluation:
Semantic Search Consistency: Semantic search consistently demonstrated strong performance, particularly in scenarios where the context was straightforward and vocabulary overlap was high.
Filtered Search Effectiveness: Filtered search showed surprisingly good results in categories like "UCC related complaints" and "Call Drop," likely attributable to precise taxonomy and well-labeled data within those categories.
Vector Search Limitations: While sometimes achieving perfect precision/recall, Vector search generally lagged in overall response relevancy. This suggests that while it can retrieve relevant contexts, the directness or utility of the generated response might be lower without further processing (like reranking or generative augmentation).
Reranking for Precision: Reranking methods proved valuable in specific edge cases where an initial search might return good results, but a finer-grained prioritization was required for optimal accuracy and user satisfaction. This is particularly relevant for sensitive or critical queries.
Project 2: Structuring Amendment Documents for Legal Justice
Overview
This project focuses on the crucial task of structuring amendment documents to build a robust platform for tracking legal amendments. The primary goal is to provide legal professionals and the general public with a clearer, more efficient view of changes within legal articles, ultimately contributing to better legal justice.
Project Scope and Impact:
The current legal landscape is characterized by frequent amendments to articles and laws, making it challenging for lawyers, judges, and citizens to keep track of the latest versions and their implications. This project aims to address this by:
Extracting and Structuring Amendment Information: Developing a system to parse and structure data from legal amendment documents. This includes identifying:
The original article/section being amended.
The specific changes made (additions, deletions, modifications).
The effective date of the amendment.
The source or act introducing the amendment.
The context and rationale behind the amendment (if available).
Building a Track Record Platform: Designing and implementing a platform that centralizes this structured amendment information. This platform will:
Version Control: Maintain a comprehensive history of all amendments to any given article.
Comparisons: Allow users to easily compare different versions of an article side-by-side to highlight changes.
Searchability: Enable efficient searching for amendments based on keywords, dates, article numbers, or topics.
Contextualization: Provide tools to understand the impact of amendments on related articles or legal provisions.
User-Friendly Interface: Offer an intuitive interface for legal professionals and the public to navigate complex legal changes.
Objectives
Enhance Transparency: Make legal amendments more accessible and understandable to a wider audience.
Improve Legal Accuracy: Reduce the risk of misinterpreting or misapplying laws due to outdated information.
Streamline Legal Research: Significantly cut down the time and effort required for legal professionals to research statutory changes.
Support Legal Aid and Justice: Empower lawyers and legal aid organizations to provide more informed and effective assistance.
Foster Compliance: Help individuals and organizations stay compliant with the latest legal requirements.
Technical Approach (Anticipated)
While specific technical details for this project were not provided, a typical approach for structuring legal documents would involve:
Natural Language Processing (NLP): Utilizing NLP techniques for entity recognition (e.g., article numbers, dates, legal terms), relationship extraction (e.g., which amendment affects which article), and change detection (identifying specific textual modifications).
Machine Learning (ML): Employing ML models for classification of amendment types, sentiment analysis of changes (if applicable), and potentially for predicting the impact of new amendments.
Knowledge Graph/Ontology Development: Building a knowledge graph to represent legal articles, acts, and their interconnections, allowing for more intelligent querying and contextual understanding of amendments.
Database Design: Designing a robust database schema to store the structured amendment data, supporting versioning and efficient retrieval.
Web Development Frameworks: Using modern web development frameworks to build the user-facing platform with interactive features.
Expected Outcomes
The successful implementation of the Amendment Document Structuring Platform is expected to yield the following significant outcomes:
A centralized, searchable repository of all legal amendments.
Tools for visualizing legislative changes over time.
Improved efficiency for legal professionals in tracking and analyzing amendments.
Enhanced accuracy in legal advice and judicial decisions.
Greater public understanding and access to up-to-date legal information.
A foundation for future legal tech innovations, such as AI-powered legal assistants.
Last updated