Report query methods evaluation

Report: query methods evaluation

Summary Evaluation Report :

  1. evaluation_data.json – Contains metadata and user queries for telecom grievance categories.

  2. evaluation_results.csv – Raw evaluation scores per query per method.

  3. recommendations.json – Aggregated performance metrics (Precision, Recall, Response Relevancy, and Average Score) for each category and query method.


📊 Evaluation Analysis Report: Weaviate Search Method Performance

📌 Overview

The evaluation compares multiple search and retrieval methods applied over grievance categories from the telecommunications sector, focusing on how well each method retrieves relevant and helpful results. Each method is scored using three RAGAS metrics:

  • Context Precision – How many retrieved contexts were relevant?

  • Context Recall – Do the contexts cover all factual claims?

  • Response Relevancy – Is the generated or retrieved response answering the user query?

Goal: Determine the best-performing query method for each grievance category.


📂 Categories Evaluated

Category
Subcategory
No. of Queries

Mobile Related

Call Drop

5

Mobile Related

Improper Network Coverage

5

Mobile Related

Data Speed lower than committed

5

Mobile Related

Mobile Number Portability (MNP)

5

Mobile Related

UCC related complaints

5

Mobile Related

Activation/Deactivation of Value Added Services

5

Mobile Related

Activation/Deactivation/Fault of SIM Card

5


Summary of Best Methods Per Category

Category
Best Method
Avg Score
Precision
Recall
Response Relevancy

Call Drop

Generative

0.961

1.0

1.0

0.882

Improper Network Coverage

Generative

0.946

0.96

1.0

0.878

Data Speed lower than committed

Generative

0.958

1.0

1.0

0.874

Mobile Number Portability (MNP)

Generative

0.958

1.0

1.0

0.874

UCC related complaints

Generative

0.947

0.96

1.0

0.881

Activation/Deactivation of VAS

Generative

0.965

1.0

1.0

0.896

SIM Card Activation/Deactivation/Fault

Generative

0.967

1.0

1.0

0.902

Semantic search outperforms all other methods across every category evaluated.


(Excluding Generative Search)


Key Metrics Definitions


Metric Breakdown

Method
Context Precision
Context Recall
Response Relevancy
Avg Score

Semantic

1.00

1.00

0.846

0.949

Hybrid

1.00

1.00

0.846

0.949

Keyword

1.00

1.00

0.830

0.943

Vector

1.00

1.00

0.732

0.911

Reranking

1.00

1.00

0.833

0.944

Filtered

1.00

1.00

0.846

0.949

Best Method (Excluding Generative):

Semantic / Hybrid / Filtered (Tie at 0.949)


Method
Context Precision
Context Recall
Response Relevancy
Avg Score

Semantic

0.96

1.00

0.842

0.934

Hybrid

0.96

1.00

0.829

0.930

Keyword

1.00

1.00

0.829

0.943

Vector

0.84

0.986

0.716

0.847

Reranking

1.00

1.00

0.829

0.943

Filtered

1.00

1.00

0.812

0.937

Best Method (Excluding Generative):

Keyword / Reranking (Tie at 0.943)


Method
Context Precision
Context Recall
Response Relevancy
Avg Score

Semantic

1.00

1.00

0.868

0.956

Hybrid

1.00

1.00

0.868

0.956

Keyword

0.96

1.00

0.868

0.943

Vector

0.92

0.985

0.726

0.877

Reranking

1.00

1.00

0.817

0.939

Filtered

1.00

1.00

0.800

0.933

Best Method (Excluding Generative):

Semantic / Hybrid (Tie at 0.956)


Method
Context Precision
Context Recall
Response Relevancy
Avg Score

Semantic

1.00

1.00

0.822

0.941

Hybrid

1.00

1.00

0.814

0.938

Keyword

1.00

1.00

0.808

0.936

Vector

1.00

1.00

0.722

0.907

Reranking

0.80

0.80

0.659

0.753

Filtered

1.00

1.00

0.786

0.929

Best Method (Excluding Generative):

Semantic (0.941)


Method
Context Precision
Context Recall
Response Relevancy
Avg Score

Semantic

0.96

1.00

0.804

0.921

Hybrid

0.96

1.00

0.804

0.921

Keyword

0.96

1.00

0.806

0.922

Vector

0.96

0.96

0.736

0.885

Reranking

0.80

0.80

0.641

0.747

Filtered

1.00

1.00

0.779

0.926

Best Method (Excluding Generative):

Filtered (0.926)


Method
Context Precision
Context Recall
Response Relevancy
Avg Score

Semantic

0.80

0.80

0.653

0.751

Hybrid

0.80

0.80

0.653

0.751

Keyword

0.80

0.80

0.653

0.751

Vector

0.96

1.00

0.750

0.903

Reranking

1.00

1.00

0.816

0.939

Filtered

1.00

1.00

0.770

0.923

Best Method (Excluding Generative):

Reranking (0.939)


Method
Context Precision
Context Recall
Response Relevancy
Avg Score

Semantic

1.00

1.00

0.834

0.945

Hybrid

0.80

0.80

0.663

0.754

Keyword

1.00

1.00

0.832

0.944

Vector

1.00

1.00

0.737

0.912

Reranking

1.00

1.00

0.831

0.944

Filtered

1.00

1.00

0.775

0.925

Best Method (Excluding Generative):

Semantic (0.945)


Overall Summary

Category
Best Non-Generative Method(s)
Score

Call Drop

Semantic / Hybrid / Filtered

0.949

Improper Network Coverage

Keyword / Reranking

0.943

Data Speed Lower Than Committed

Semantic / Hybrid

0.956

Mobile Number Portability (MNP)

Semantic

0.941

UCC Related Complaints

Filtered

0.926

VAS Activation/Deactivation Without Consent

Reranking

0.939

SIM Card Activation/Deactivation/Fault

Semantic

0.945


📌 Insights

  • Semantic search consistently performs well, especially in scenarios where the context is straightforward and vocabulary overlaps.

  • Filtered search performed surprisingly well in categories like UCC and Call Drop, possibly due to precise taxonomy and labeled data.

  • Vector search generally lags behind in response relevancy, despite having perfect precision/recall in some cases.

  • Reranking methods shine in edge cases where initial search is good but needs finer-grained prioritization.




Last updated