I evaluate AI-generated and translated content for accuracy, consistency, tone, and contextual alignment, with a focus on identifying subtle failure modes in model outputs.
My work focuses on cases where outputs appear correct at first glance but break down in nuance, intent, or context — particularly in multilingual, dialogue-based, or instruction-sensitive scenarios.
This includes:
– identifying ambiguity, hallucination patterns, and context loss
– evaluating tone, intent, and instruction adherence
– detecting inconsistencies across outputs or datasets
– assessing coherence, readability, and linguistic quality
– highlighting edge cases and systematic failure patterns
With a background in translation, localisation, and editorial QA, I approach language as a structured system rather than isolated sentences. This allows me to identify recurring issues and patterns that impact model reliability at scale.
This service is particularly useful for:
– LLM output evaluation and dataset QA
– multilingual model testing (EN–DE)
– prompt/response quality analysis
– improving consistency in AI-generated content
– supporting RLHF-style evaluation and annotation workflows
I focus on clear, structured feedback that helps improve both individual outputs and overall system performance.