Posted 2 Days Ago Job ID: 2114960 49 quotes received

Dataset Specialist LLMs Multimodal Modal

Fixed Price or Hourly
Quotes (49)  ·  Premium Quotes (5)  ·  Invited (0)  ·  Hired (0)

  Send before: February 12, 2026

Send a Quote

AI Training Data & Cultural Heritage Dataset Specialist (LLMs / Multimodal Models)


 Project Overview 


We are developing a large-scale, high-quality dataset of approximately 3 million images covering art, culture, and history (paintings, objects, architecture, historical scenes, etc.) with the goal of licensing this collection to leading AI companies (OpenAI, Meta, Google, Amazon, Apple, and others) for LLM and multimodal model training.

We are focused on structuring it to meet modern LLM training, alignment, retrieval, and multimodal reasoning requirements.

We are seeking a specialist who understands exactly what AI model developers look for in training datasets, and who can help us structure, annotate, document, and package this collection for commercial licensing.


 Responsibilities 


The selected specialist will:

  • Advise on what types of image + metadata pairs LLM and multimodal model developers value most

  • Define or refine a scalable metadata schema (e.g., LIDO / CDWA aligned, JSON-based) suitable for:

    • Vision-language models (VLMs)

    • Retrieval-augmented generation (RAG)

    • Fine-tuning and alignment

  • Evaluate and improve our existing schema (currently ~17 fields per image, including visual, cultural, technical, and legal metadata)

  • Recommend annotation standards for:

    • Visual objects and composition

    • Style, period, and cultural context

    • Captioning at multiple levels of complexity

    • VQA (Visual Question Answering) pairs

    • Synthetic negatives for robustness

  • Advise on dataset packaging and delivery formats preferred by major AI labs (JSONL, Parquet, TFRecord, etc.)

  • Ensure FAIR data principles (Findable, Accessible, Interoperable, Reusable)

  • Provide guidance on licensing language, provenance, and legal compliance for AI training use

  • Help position the dataset as a premium, commercially licensable AI training asset


 Required Expertise 


We are looking for someone with demonstrated experience in one or more of the following:

  • LLM or multimodal model training data (vision-language, CLIP-style datasets, VQA, DETR, etc.)

  • Dataset design for AI labs, research institutions, or large-scale ML pipelines

  • Cultural heritage, museum, or archival metadata standards (LIDO, CDWA, CIDOC CRM, Getty AAT)

  • Structuring datasets for zero-shot learning, cross-modal reasoning, and multilingual AI

  • Data licensing for AI training (public domain, CC0, custom licenses)

  • AI data evaluation, benchmarking, or alignment


Nice to have:


  • Prior work with OpenAI, Meta, Google, Amazon, Apple, or comparable AI organizations

  • Experience with large image datasets (100k+ items)

  • Familiarity with RAG systems and embedding-based retrieval

  • Experience preparing datasets for commercial sale or enterprise clients


 Deliverables 


Depending on engagement scope, deliverables may include:

  • Written recommendations on dataset structure and field definitions

  • Improved or finalized metadata schema (JSON examples included)

  • Annotation and captioning guidelines

  • Dataset documentation suitable for enterprise AI buyers

  • Optional: review of a sample subset of images and metadata


... Show more
Michael V Canada