Initial Guide

Initial Guide

Query Method Evaluation System

Overview

This evaluation system helps you determine the best query method for each category in your grievance system using RAGAS metrics (Context Precision, Context Recall, and Response Relevancy).

Features

  • 9 Query Methods: Semantic, Hybrid, Keyword, Generative, Vector, Multiple Vectors, Reranking, Aggregate, and Filtered search

  • 3 RAGAS Metrics: Context Precision, Context Recall, Response Relevancy

  • Category-wise Analysis: Find the best method for each grievance category

  • Comprehensive Reporting: CSV exports and JSON recommendations

  • Scalable Evaluation: Handle large datasets efficiently

Setup Instructions

1. Environment Setup

Create a .env file in your project root:

WEAVIATE_URL=https://your-cluster.weaviate.network
WEAVIATE_API_KEY=your-weaviate-api-key
OPENAI_API_KEY=your-openai-api-key

2. Install Dependencies

npm install weaviate-client dotenv node-fetch csv-writer

3. Data Preparation

Your JSON data should follow this structure:

[
  {
    "department_code": "DOTEL",
    "department_name": "Telecommunications",
    "category": "Mobile Related",
    "sub_category_1": "Call Drop",
    "description": "Detailed description of the issue...",
    "user_queries": [
      "My phone calls keep getting dropped. What can I do?",
      "Why does my cell phone keep disconnecting during calls?"
    ]
  }
]

4. Run Data Preparation

node data_preparation.js

5. Run Evaluation

node query_evaluation.js

Query Methods Evaluated

1. Semantic Search (nearText)

  • Best for: Natural language queries, conceptual searches

  • Use case: When users ask questions in conversational language

  • Example: "My internet is slow" → finds speed-related issues

  • Best for: Balanced approach combining semantic and keyword matching

  • Use case: General purpose queries with mixed search patterns

  • Example: Combines semantic understanding with exact keyword matches

3. Keyword Search (BM25)

  • Best for: Exact term matching, technical queries

  • Use case: When users search for specific terms or codes

  • Example: "error code 404" → finds exact technical matches

4. Generative Search (RAG)

  • Best for: Complex queries requiring synthesized answers

  • Use case: When users need explanations or detailed responses

  • Example: "What should I do about dropped calls?" → generates actionable advice

  • Best for: Semantic similarity without text preprocessing

  • Use case: Finding conceptually similar content

  • Example: Raw vector-based similarity matching

6. Multiple Target Vectors

  • Best for: Multi-aspect queries or complex categorization

  • Use case: When different aspects of content need different vector representations

  • Example: Technical content + emotional sentiment vectors

7. Reranking (Hybrid + Rerank)

  • Best for: High-precision requirements, quality over quantity

  • Use case: When accuracy is more important than speed

  • Example: Legal or medical queries requiring high precision

8. Aggregate Data

  • Best for: Statistical queries, counting, summarization

  • Use case: Analytics and reporting queries

  • Example: "How many network issues were reported?"

  • Best for: Category-specific searches, scoped queries

  • Use case: When search should be limited to specific categories

  • Example: Only searching within "Mobile Related" issues

RAGAS Metrics

Context Precision

  • What it measures: Proportion of relevant chunks in retrieved contexts

  • Formula: (Number of relevant chunks) / (Total retrieved chunks)

  • Good score: > 0.7

  • Interpretation: Higher scores mean less noise in results

Context Recall

  • What it measures: How many relevant pieces of information were retrieved

  • Formula: (Attributable claims in reference) / (Total claims in reference)

  • Good score: > 0.8

  • Interpretation: Higher scores mean fewer missed relevant results

Response Relevancy

  • What it measures: How well the response addresses the user's query

  • Formula: Average cosine similarity between user input and generated questions

  • Good score: > 0.7

  • Interpretation: Higher scores mean better query-response alignment

Output Files

1. evaluation_results.csv

Detailed results for each query and method combination:

  • Department, Category, Sub-category

  • User Query

  • Scores for each method (Semantic, Hybrid, etc.)

  • Individual metric scores

2. recommendations.json

Category-wise recommendations:

{
  "Mobile Related > Call Drop": {
    "bestMethod": "hybrid",
    "bestScore": 0.856,
    "methodPerformance": {
      "hybrid": {
        "averageScore": 0.856,
        "contextPrecision": 0.89,
        "contextRecall": 0.82,
        "responseRelevancy": 0.86
      }
    }
  }
}

3. Console Output

Real-time progress and summary:

📊 Evaluating category: Mobile Related > Call Drop
  🔍 Query: "My phone calls keep getting dropped. What can I do?"
    ✅ SEMANTIC: Avg Score = 0.823
    ✅ HYBRID: Avg Score = 0.856
    ✅ KEYWORD: Avg Score = 0.734
    ...