I specialize in evaluating the conversational capabilities of frontier Large Language Models (LLMs) such as GPT-4, Claude, and LLaMA. My work includes creating dialogue tasks and systematically assessing the responses of two AI agents' side-by-side for fluency, relevance, coherence, and alignment. This helps teams compare model behavior, optimize prompts, and ensure consistent performance in AI-driven applications.