Initial Guide

Query Method Evaluation System

Overview

This evaluation system helps you determine the best query method for each category in your grievance system using RAGAS metrics (Context Precision, Context Recall, and Response Relevancy).

Features

9 Query Methods: Semantic, Hybrid, Keyword, Generative, Vector, Multiple Vectors, Reranking, Aggregate, and Filtered search
3 RAGAS Metrics: Context Precision, Context Recall, Response Relevancy
Category-wise Analysis: Find the best method for each grievance category
Comprehensive Reporting: CSV exports and JSON recommendations
Scalable Evaluation: Handle large datasets efficiently

Setup Instructions

1. Environment Setup

Create a .env file in your project root:

WEAVIATE_URL=https://your-cluster.weaviate.network
WEAVIATE_API_KEY=your-weaviate-api-key
OPENAI_API_KEY=your-openai-api-key

2. Install Dependencies

npm install weaviate-client dotenv node-fetch csv-writer

3. Data Preparation

Your JSON data should follow this structure:

[
  {
    "department_code": "DOTEL",
    "department_name": "Telecommunications",
    "category": "Mobile Related",
    "sub_category_1": "Call Drop",
    "description": "Detailed description of the issue...",
    "user_queries": [
      "My phone calls keep getting dropped. What can I do?",
      "Why does my cell phone keep disconnecting during calls?"
    ]
  }
]

4. Run Data Preparation

node data_preparation.js

5. Run Evaluation

node query_evaluation.js

Query Methods Evaluated

1. Semantic Search (nearText)

Best for: Natural language queries, conceptual searches
Use case: When users ask questions in conversational language
Example: "My internet is slow" → finds speed-related issues

2. Hybrid Search

Best for: Balanced approach combining semantic and keyword matching
Use case: General purpose queries with mixed search patterns
Example: Combines semantic understanding with exact keyword matches

3. Keyword Search (BM25)

Best for: Exact term matching, technical queries
Use case: When users search for specific terms or codes
Example: "error code 404" → finds exact technical matches

4. Generative Search (RAG)

Best for: Complex queries requiring synthesized answers
Use case: When users need explanations or detailed responses
Example: "What should I do about dropped calls?" → generates actionable advice

5. Vector Similarity Search

Best for: Semantic similarity without text preprocessing
Use case: Finding conceptually similar content
Example: Raw vector-based similarity matching

6. Multiple Target Vectors

Best for: Multi-aspect queries or complex categorization
Use case: When different aspects of content need different vector representations
Example: Technical content + emotional sentiment vectors

7. Reranking (Hybrid + Rerank)

Best for: High-precision requirements, quality over quantity
Use case: When accuracy is more important than speed
Example: Legal or medical queries requiring high precision

8. Aggregate Data

Best for: Statistical queries, counting, summarization
Use case: Analytics and reporting queries
Example: "How many network issues were reported?"

9. Filtered Search

Best for: Category-specific searches, scoped queries
Use case: When search should be limited to specific categories
Example: Only searching within "Mobile Related" issues

RAGAS Metrics

Context Precision

What it measures: Proportion of relevant chunks in retrieved contexts
Formula: (Number of relevant chunks) / (Total retrieved chunks)
Good score: > 0.7
Interpretation: Higher scores mean less noise in results

Context Recall

What it measures: How many relevant pieces of information were retrieved
Formula: (Attributable claims in reference) / (Total claims in reference)
Good score: > 0.8
Interpretation: Higher scores mean fewer missed relevant results

Response Relevancy

What it measures: How well the response addresses the user's query
Formula: Average cosine similarity between user input and generated questions
Good score: > 0.7
Interpretation: Higher scores mean better query-response alignment

Output Files

1. `evaluation_results.csv`

Detailed results for each query and method combination:

Department, Category, Sub-category
User Query
Scores for each method (Semantic, Hybrid, etc.)
Individual metric scores

2. `recommendations.json`

Category-wise recommendations:

{
  "Mobile Related > Call Drop": {
    "bestMethod": "hybrid",
    "bestScore": 0.856,
    "methodPerformance": {
      "hybrid": {
        "averageScore": 0.856,
        "contextPrecision": 0.89,
        "contextRecall": 0.82,
        "responseRelevancy": 0.86
      }
    }
  }
}

3. Console Output

Real-time progress and summary:

📊 Evaluating category: Mobile Related > Call Drop
  🔍 Query: "My phone calls keep getting dropped. What can I do?"
    ✅ SEMANTIC: Avg Score = 0.823
    ✅ HYBRID: Avg Score = 0.856
    ✅ KEYWORD: Avg Score = 0.734
    ...

PreviousDMP Midpoint Evaluation NextInterpretation