AI Training Data & Cultural Heritage Dataset Specialist (LLMs / Multimodal Models)
Project Overview
We are developing a large-scale, high-quality dataset of approximately 3 million images covering art, culture, and history (paintings, objects, architecture, historical scenes, etc.) with the goal of licensing this collection to leading AI companies (OpenAI, Meta, Google, Amazon, Apple, and others) for LLM and multimodal model training.
We are focused on structuring it to meet modern LLM training, alignment, retrieval, and multimodal reasoning requirements.
We are seeking a specialist who understands exactly what AI model developers look for in training datasets, and who can help us structure, annotate, document, and package this collection for commercial licensing.
Responsibilities
The selected specialist will:
Advise on what types of image + metadata pairs LLM and multimodal model developers value most
Define or refine a scalable metadata schema (e.g., LIDO / CDWA aligned, JSON-based) suitable for:
Vision-language models (VLMs)
Retrieval-augmented generation (RAG)
Fine-tuning and alignment
Evaluate and improve our existing schema (currently ~17 fields per image, including visual, cultural, technical, and legal metadata)
Recommend annotation standards for:
Visual objects and composition
Style, period, and cultural context
Captioning at multiple levels of complexity
VQA (Visual Question Answering) pairs
Synthetic negatives for robustness
Advise on dataset packaging and delivery formats preferred by major AI labs (JSONL, Parquet, TFRecord, etc.)
Ensure FAIR data principles (Findable, Accessible, Interoperable, Reusable)
Provide guidance on licensing language, provenance, and legal compliance for AI training use
Help position the dataset as a premium, commercially licensable AI training asset
We are looking for someone with demonstrated experience in one or more of the following:
LLM or multimodal model training data (vision-language, CLIP-style datasets, VQA, DETR, etc.)
Dataset design for AI labs, research institutions, or large-scale ML pipelines
Cultural heritage, museum, or archival metadata standards (LIDO, CDWA, CIDOC CRM, Getty AAT)
Structuring datasets for zero-shot learning, cross-modal reasoning, and multilingual AI
Data licensing for AI training (public domain, CC0, custom licenses)
AI data evaluation, benchmarking, or alignment
...