Initial Guide
Initial Guide
Query Method Evaluation System
Overview
This evaluation system helps you determine the best query method for each category in your grievance system using RAGAS metrics (Context Precision, Context Recall, and Response Relevancy).
Features
9 Query Methods: Semantic, Hybrid, Keyword, Generative, Vector, Multiple Vectors, Reranking, Aggregate, and Filtered search
3 RAGAS Metrics: Context Precision, Context Recall, Response Relevancy
Category-wise Analysis: Find the best method for each grievance category
Comprehensive Reporting: CSV exports and JSON recommendations
Scalable Evaluation: Handle large datasets efficiently
Setup Instructions
1. Environment Setup
Create a .env file in your project root:
WEAVIATE_URL=https://your-cluster.weaviate.network
WEAVIATE_API_KEY=your-weaviate-api-key
OPENAI_API_KEY=your-openai-api-key
2. Install Dependencies
npm install weaviate-client dotenv node-fetch csv-writer
3. Data Preparation
Your JSON data should follow this structure:
[
{
"department_code": "DOTEL",
"department_name": "Telecommunications",
"category": "Mobile Related",
"sub_category_1": "Call Drop",
"description": "Detailed description of the issue...",
"user_queries": [
"My phone calls keep getting dropped. What can I do?",
"Why does my cell phone keep disconnecting during calls?"
]
}
]
4. Run Data Preparation
node data_preparation.js
5. Run Evaluation
node query_evaluation.js
Query Methods Evaluated
1. Semantic Search (nearText)
Best for: Natural language queries, conceptual searches
Use case: When users ask questions in conversational language
Example: "My internet is slow" → finds speed-related issues
2. Hybrid Search
Best for: Balanced approach combining semantic and keyword matching
Use case: General purpose queries with mixed search patterns
Example: Combines semantic understanding with exact keyword matches
3. Keyword Search (BM25)
Best for: Exact term matching, technical queries
Use case: When users search for specific terms or codes
Example: "error code 404" → finds exact technical matches
4. Generative Search (RAG)
Best for: Complex queries requiring synthesized answers
Use case: When users need explanations or detailed responses
Example: "What should I do about dropped calls?" → generates actionable advice
5. Vector Similarity Search
Best for: Semantic similarity without text preprocessing
Use case: Finding conceptually similar content
Example: Raw vector-based similarity matching
6. Multiple Target Vectors
Best for: Multi-aspect queries or complex categorization
Use case: When different aspects of content need different vector representations
Example: Technical content + emotional sentiment vectors
7. Reranking (Hybrid + Rerank)
Best for: High-precision requirements, quality over quantity
Use case: When accuracy is more important than speed
Example: Legal or medical queries requiring high precision
8. Aggregate Data
Best for: Statistical queries, counting, summarization
Use case: Analytics and reporting queries
Example: "How many network issues were reported?"
9. Filtered Search
Best for: Category-specific searches, scoped queries
Use case: When search should be limited to specific categories
Example: Only searching within "Mobile Related" issues
RAGAS Metrics
Context Precision
What it measures: Proportion of relevant chunks in retrieved contexts
Formula: (Number of relevant chunks) / (Total retrieved chunks)
Good score: > 0.7
Interpretation: Higher scores mean less noise in results
Context Recall
What it measures: How many relevant pieces of information were retrieved
Formula: (Attributable claims in reference) / (Total claims in reference)
Good score: > 0.8
Interpretation: Higher scores mean fewer missed relevant results
Response Relevancy
What it measures: How well the response addresses the user's query
Formula: Average cosine similarity between user input and generated questions
Good score: > 0.7
Interpretation: Higher scores mean better query-response alignment
Output Files
1. evaluation_results.csv
evaluation_results.csvDetailed results for each query and method combination:
Department, Category, Sub-category
User Query
Scores for each method (Semantic, Hybrid, etc.)
Individual metric scores
2. recommendations.json
recommendations.jsonCategory-wise recommendations:
{
"Mobile Related > Call Drop": {
"bestMethod": "hybrid",
"bestScore": 0.856,
"methodPerformance": {
"hybrid": {
"averageScore": 0.856,
"contextPrecision": 0.89,
"contextRecall": 0.82,
"responseRelevancy": 0.86
}
}
}
}
3. Console Output
Real-time progress and summary:
📊 Evaluating category: Mobile Related > Call Drop
🔍 Query: "My phone calls keep getting dropped. What can I do?"
✅ SEMANTIC: Avg Score = 0.823
✅ HYBRID: Avg Score = 0.856
✅ KEYWORD: Avg Score = 0.734
...