This phase:
Loads env variables
Connects to Weaviate
Loads evaluation JSON
Creates evaluator instance
This is the core loop:
For each category and query
Run all query methods
Calculate 3 metrics (Precision, Recall, Relevancy)
Log & store results
Aggregates scores across all queries
Calculates final average P, R, RR, Score
Finds the best method
Writes to CSV and JSON