DMP Midpoint Evaluation


Project 1: Query Method Evaluation System on Vector Databases for Grievance Data

Overview

The Query Method Evaluation System is a robust framework designed to identify and recommend the most effective query methods for various categories within a grievance management system. It leverages RAGAS metrics (Context Precision, Context Recall, and Response Relevancy) to provide a comprehensive evaluation of different search techniques on a large dataset of structured grievance information.

Project Scope and Impact:

The project involved:

  1. Structuring 15,000 grievance records: Transforming raw grievance data into a structured, query-ready format. This step is crucial for consistent and accurate information retrieval.

  2. Synthetic Query Generation: Creating 15,000 synthetic queries, one for each structured grievance. This expanded the dataset for thorough evaluation and benchmarking.

  3. Evaluation and Benchmarking: Systematically testing and comparing various query methods against the structured data and synthetic queries using RAGAS metrics.

Features

The system boasts a comprehensive set of features to facilitate in-depth analysis and optimization of query performance:

  • 9 Query Methods Evaluated:

    • Semantic Search

    • Hybrid Search

    • Keyword Search (BM25)

    • Generative Search (RAG)

    • Vector Similarity Search

    • Multiple Target Vectors

    • Reranking (Hybrid + Rerank)

    • Aggregate Data

    • Filtered Search

  • 3 Core RAGAS Metrics:

    • Context Precision: Measures the proportion of relevant information within the retrieved contexts. A good score (0.7) indicates less noise in results.

    • Context Recall: Measures how many relevant pieces of information were successfully retrieved. A good score (0.8) indicates fewer missed relevant results.

    • Response Relevancy: Measures how well the generated or retrieved response addresses the user's query. A good score (0.7) indicates better alignment between query and response.

  • Category-wise Analysis: Provides tailored recommendations for the best query method for each specific grievance category, ensuring optimized performance across diverse complaint types.

  • Comprehensive Reporting: Generates detailed CSV exports of raw evaluation results and JSON recommendations for easy integration and analysis.

  • Scalable Evaluation: Designed to efficiently handle large datasets, making it suitable for real-world grievance systems with extensive data.

Setup Instructions

To set up and run the evaluation system, follow these steps:

  1. Environment Setup:

    Create a .env file in your project root with the following variables:

    WEAVIATE_URL=https://your-cluster.weaviate.network WEAVIATE_API_KEY=your-weaviate-api-key OPENAI_API_KEY=your-openai-api-key

  2. Install Dependencies:Bash

    Use npm to install the necessary packages:

    npm install weaviate-client dotenv node-fetch csv-writer

  3. Data Preparation:JSON

    Ensure your JSON data adheres to the specified structure:

    [ { "department_code": "DOTEL", "department_name": "Telecommunications", "category": "Mobile Related", "sub_category_1": "Call Drop", "description": "Detailed description of the issue...", "user_queries": [ "My phone calls keep getting dropped. What can I do?", "Why does my cell phone keep disconnecting during calls?" ] } ]

  4. Run Data Preparation:Bash

    Execute the data preparation script:

    node data_preparation.js

  5. Run Evaluation:Bash

    Initiate the query evaluation process:

    node query_evaluation.js

Query Methods Evaluated and Their Best Use Cases

Each query method is suited for different types of user queries and data characteristics:

  • 1. Semantic Search (nearText):

    • Best for: Natural language queries, conceptual searches.

    • Use case: When users ask questions in conversational language (e.g., "My internet is slow").

  • 2. Hybrid Search:

    • Best for: A balanced approach combining semantic understanding and keyword matching.

    • Use case: General-purpose queries with mixed search patterns.

  • 3. Keyword Search (BM25):

    • Best for: Exact term matching, technical queries.

    • Use case: When users search for specific terms or codes (e.g., "error code 404").

  • 4. Generative Search (RAG):

    • Best for: Complex queries requiring synthesized answers.

    • Use case: When users need explanations or detailed responses (e.g., "What should I do about dropped calls?").

  • 5. Vector Similarity Search:

    • Best for: Semantic similarity without text preprocessing.

    • Use case: Finding conceptually similar content based on raw vector embeddings.

  • 6. Multiple Target Vectors:

    • Best for: Multi-aspect queries or complex categorization.

    • Use case: When different aspects of content need different vector representations (e.g., technical content + emotional sentiment vectors).

  • 7. Reranking (Hybrid + Rerank):

    • Best for: High-precision requirements, quality over quantity.

    • Use case: When accuracy is more important than speed (e.g., legal or medical queries).

  • 8. Aggregate Data:

    • Best for: Statistical queries, counting, summarization.

    • Use case: Analytics and reporting queries (e.g., "How many network issues were reported?").

  • 9. Filtered Search:

    • Best for: Category-specific searches, scoped queries.

    • Use case: When search should be limited to specific categories (e.g., only searching within "Mobile Related" issues).

Output Files

The evaluation system generates the following output files for detailed analysis:

  1. evaluation_results.csv:

    • Detailed results for each query and method combination.

    • Includes Department, Category, Sub-category, User Query, Scores for each method (Semantic, Hybrid, etc.), and individual RAGAS metric scores.

  2. recommendations.json:

    • Category-wise recommendations for the best-performing query method.

    • Example structure:JSON

      { "Mobile Related > Call Drop": { "bestMethod": "hybrid", "bestScore": 0.856, "methodPerformance": { "hybrid": { "averageScore": 0.856, "contextPrecision": 0.89, "contextRecall": 0.82, "responseRelevancy": 0.86 } } } }

  3. Console Output:

    • Provides real-time progress and a summary during the evaluation process.

Evaluation Process Phases

The evaluation proceeds through three distinct phases:

🧩 Phase 1: Initialization

This phase:

  • Loads env variables

  • Connects to Weaviate

  • Loads evaluation JSON

  • Creates evaluator instance

🧩 Phase 2: Evaluate All Methods

This is the core loop:

  • For each category and query

  • Run all query methods

  • Calculate 3 metrics (Precision, Recall, Relevancy)

  • Log & store results

🧩 Phase 3: Result Analysis + Export

This phase:

  • Aggregates scores across all queries

  • Calculates final average P, R, RR, Score

  • Finds the best method

  • Writes to CSV and JSON

Detailed Pipeline : https://excalidraw.com/

The following table summarizes the performance of various query methods across different grievance subcategories, excluding Generative Search which often shows distinct performance characteristics due to its nature of synthesizing answers.

Category
Method
Avg Score
Precision
Recall
Relevancy

Mobile Related > Call Drop

semantic

0.949

1.00

1.00

0.846

hybrid

0.949

1.00

1.00

0.846

keyword

0.943

1.00

1.00

0.830

vector

0.911

1.00

1.00

0.732

reranking

0.944

1.00

1.00

0.833

filtered

0.949

1.00

1.00

0.846

Mobile Related > Improper Network Coverage

semantic

0.934

0.96

1.00

0.842

hybrid

0.930

0.96

1.00

0.829

keyword

0.943

1.00

1.00

0.829

vector

0.847

0.84

0.986

0.716

reranking

0.943

1.00

1.00

0.829

filtered

0.937

1.00

1.00

0.812

Mobile Related > Data Speed lower than committed

semantic

0.956

1.00

1.00

0.868

hybrid

0.956

1.00

1.00

0.868

keyword

0.943

0.96

1.00

0.868

vector

0.877

0.92

0.985

0.726

reranking

0.939

1.00

1.00

0.817

filtered

0.933

1.00

1.00

0.800

Mobile Related > Mobile Number Portability (MNP)

semantic

0.941

1.00

1.00

0.822

hybrid

0.938

1.00

1.00

0.814

keyword

0.936

1.00

1.00

0.808

vector

0.907

1.00

1.00

0.722

reranking

0.753

0.80

0.80

0.659

filtered

0.929

1.00

1.00

0.786

Mobile Related > UCC related complaints

semantic

0.921

0.96

1.00

0.804

hybrid

0.921

0.96

1.00

0.804

keyword

0.922

0.96

1.00

0.806

vector

0.885

0.96

0.96

0.736

reranking

0.747

0.80

0.80

0.641

filtered

0.926

1.00

1.00

0.779

Mobile Related > Activation/Deactivation of VAS

semantic

0.751

0.80

0.80

0.653

hybrid

0.751

0.80

0.80

0.653

keyword

0.751

0.80

0.80

0.653

vector

0.903

0.96

1.00

0.750

reranking

0.939

1.00

1.00

0.816

filtered

0.923

1.00

1.00

0.770

Mobile Related > Activation/Deactivation/Fault of SIM

semantic

0.945

1.00

1.00

0.834

hybrid

0.754

0.80

0.80

0.663

keyword

0.944

1.00

1.00

0.832

vector

0.912

1.00

1.00

0.737

reranking

0.944

1.00

1.00

0.831

filtered

0.925

1.00

1.00

0.775

Tabular report for each query t

Summary Evaluation Report and Analysis

Output Files Summary:

  • evaluation_data.json: Contains metadata and user queries for telecom grievance categories.

  • evaluation_results.csv: Provides raw evaluation scores per query per method.

  • recommendations.json: Offers aggregated performance metrics (Precision, Recall, Response Relevancy, and Average Score) for each category and query method.

Evaluation Analysis Report: Weaviate Search Method Performance

📌 Overview:

The evaluation rigorously compares various search and retrieval methods on telecommunications grievance data. The primary objective is to assess each method's effectiveness in retrieving relevant and helpful results, leveraging the three crucial RAGAS metrics: Context Precision, Context Recall, and Response Relevancy.

📂 Categories Evaluated:

Category
Subcategory
No. of Queries

Mobile Related

Call Drop

5

Mobile Related

Improper Network Coverage

5

Mobile Related

Data Speed lower than committed

5

Mobile Related

Mobile Number Portability (MNP)

5

Mobile Related

UCC related complaints

5

Mobile Related

Activation/Deactivation of Value Added Services

5

Mobile Related

Activation/Deactivation/Fault of SIM Card

5

Summary of Best Methods Per Category (Including Generative Search):

This summary highlights the top-performing methods, with Generative Search demonstrating exceptional performance across the board.

Category
Best Method
Avg Score
Precision
Recall
Response Relevancy

Call Drop

Generative

0.961

1.0

1.0

0.882

Improper Network Coverage

Generative

0.946

0.96

1.0

0.878

Data Speed lower than committed

Generative

0.958

1.0

1.0

0.874

Mobile Number Portability (MNP)

Generative

0.958

1.0

1.0

0.874

UCC related complaints

Generative

0.947

0.96

1.0

0.881

Activation/Deactivation of VAS

Generative

0.965

1.0

1.0

0.896

SIM Card Activation/Deactivation/Fault

Generative

0.967

1.0

1.0

0.902

It is important to note that Generative search consistently outperforms all other methods across every category evaluated.

Key Metrics Definitions:

  • Context Precision: Proportion of relevant chunks in retrieved contexts.

  • Context Recall: How many relevant pieces of information were retrieved.

  • Response Relevancy: How well the response addresses the user's query.

Detailed Category-wise Breakdown (Excluding Generative Search):

For a more granular view of non-generative methods, the following breakdown provides insights into each method's performance within specific categories.

  • Category: Mobile Related > Call Drop

    • Best Method (Excluding Generative): Semantic / Hybrid / Filtered (Tie at 0.949)

  • Category: Mobile Related > Improper Network Coverage

    • Best Method (Excluding Generative): Keyword / Reranking (Tie at 0.943)

  • Category: Mobile Related > Data Speed Lower Than Committed

    • Best Method (Excluding Generative): Semantic / Hybrid (Tie at 0.956)

  • Category: Mobile Related > Mobile Number Portability (MNP)

    • Best Method (Excluding Generative): Semantic (0.941)

  • Category: Mobile Related > UCC Related Complaints

    • Best Method (Excluding Generative): Filtered (0.926)

  • Category: Mobile Related > VAS Activation/Deactivation Without Consent

    • Best Method (Excluding Generative): Reranking (0.939)

  • Category: Mobile Related > SIM Card Activation/Deactivation/Fault

    • Best Method (Excluding Generative): Semantic (0.945)

Overall Summary (Best Non-Generative Methods):

Category
Best Non-Generative Method(s)
Score

Call Drop

Semantic / Hybrid / Filtered

0.949

Improper Network Coverage

Keyword / Reranking

0.943

Data Speed Lower Than Committed

Semantic / Hybrid

0.956

Mobile Number Portability (MNP)

Semantic

0.941

UCC Related Complaints

Filtered

0.926

VAS Activation/Deactivation Without Consent

Reranking

0.939

SIM Card Activation/Deactivation/Fault

Semantic

0.945

📌 Insights from Evaluation:

  • Semantic Search Consistency: Semantic search consistently demonstrated strong performance, particularly in scenarios where the context was straightforward and vocabulary overlap was high.

  • Filtered Search Effectiveness: Filtered search showed surprisingly good results in categories like "UCC related complaints" and "Call Drop," likely attributable to precise taxonomy and well-labeled data within those categories.

  • Vector Search Limitations: While sometimes achieving perfect precision/recall, Vector search generally lagged in overall response relevancy. This suggests that while it can retrieve relevant contexts, the directness or utility of the generated response might be lower without further processing (like reranking or generative augmentation).

  • Reranking for Precision: Reranking methods proved valuable in specific edge cases where an initial search might return good results, but a finer-grained prioritization was required for optimal accuracy and user satisfaction. This is particularly relevant for sensitive or critical queries.

Overview

This project focuses on the crucial task of structuring amendment documents to build a robust platform for tracking legal amendments. The primary goal is to provide legal professionals and the general public with a clearer, more efficient view of changes within legal articles, ultimately contributing to better legal justice.

Project Scope and Impact:

The current legal landscape is characterized by frequent amendments to articles and laws, making it challenging for lawyers, judges, and citizens to keep track of the latest versions and their implications. This project aims to address this by:

  1. Extracting and Structuring Amendment Information: Developing a system to parse and structure data from legal amendment documents. This includes identifying:

    • The original article/section being amended.

    • The specific changes made (additions, deletions, modifications).

    • The effective date of the amendment.

    • The source or act introducing the amendment.

    • The context and rationale behind the amendment (if available).

  2. Building a Track Record Platform: Designing and implementing a platform that centralizes this structured amendment information. This platform will:

    • Version Control: Maintain a comprehensive history of all amendments to any given article.

    • Comparisons: Allow users to easily compare different versions of an article side-by-side to highlight changes.

    • Searchability: Enable efficient searching for amendments based on keywords, dates, article numbers, or topics.

    • Contextualization: Provide tools to understand the impact of amendments on related articles or legal provisions.

    • User-Friendly Interface: Offer an intuitive interface for legal professionals and the public to navigate complex legal changes.

Objectives

  • Enhance Transparency: Make legal amendments more accessible and understandable to a wider audience.

  • Improve Legal Accuracy: Reduce the risk of misinterpreting or misapplying laws due to outdated information.

  • Streamline Legal Research: Significantly cut down the time and effort required for legal professionals to research statutory changes.

  • Support Legal Aid and Justice: Empower lawyers and legal aid organizations to provide more informed and effective assistance.

  • Foster Compliance: Help individuals and organizations stay compliant with the latest legal requirements.

Technical Approach (Anticipated)

While specific technical details for this project were not provided, a typical approach for structuring legal documents would involve:

  • Natural Language Processing (NLP): Utilizing NLP techniques for entity recognition (e.g., article numbers, dates, legal terms), relationship extraction (e.g., which amendment affects which article), and change detection (identifying specific textual modifications).

  • Machine Learning (ML): Employing ML models for classification of amendment types, sentiment analysis of changes (if applicable), and potentially for predicting the impact of new amendments.

  • Knowledge Graph/Ontology Development: Building a knowledge graph to represent legal articles, acts, and their interconnections, allowing for more intelligent querying and contextual understanding of amendments.

  • Database Design: Designing a robust database schema to store the structured amendment data, supporting versioning and efficient retrieval.

  • Web Development Frameworks: Using modern web development frameworks to build the user-facing platform with interactive features.

Expected Outcomes

The successful implementation of the Amendment Document Structuring Platform is expected to yield the following significant outcomes:

  • A centralized, searchable repository of all legal amendments.

  • Tools for visualizing legislative changes over time.

  • Improved efficiency for legal professionals in tracking and analyzing amendments.

  • Enhanced accuracy in legal advice and judicial decisions.

  • Greater public understanding and access to up-to-date legal information.

  • A foundation for future legal tech innovations, such as AI-powered legal assistants.


Last updated