← projects · topic

Evaluation.

Evaluating retrieval, generation, and agentic systems — benchmarks, metrics, and the gaps between them. From offline IR evaluation with sparse labels to live benchmarks for generative research synthesis.

Project write-ups for this topic are coming soon. In the meantime, see DeepScholar-Bench, Frechet Distance, Set-Based Text-to-Image Evaluation, and others on my publications page.