Posted 2 Days Ago Job ID: 2114960 49 quotes received

Dataset Specialist LLMs Multimodal Modal

Fixed Price or Hourly

Quotes (49) · Premium Quotes (5) · Invited (0) · Hired (0)

Send before: February 12, 2026

Send a Quote

Programming & Development Industry Specific Expertise

Metadata Modeling Information Technology Artificial Intelligence Meta Language (Ml) Large Language Models

AI Training Data & Cultural Heritage Dataset Specialist (LLMs / Multimodal Models)

Project Overview

We are developing a large-scale, high-quality dataset of approximately 3 million images covering art, culture, and history (paintings, objects, architecture, historical scenes, etc.) with the goal of licensing this collection to leading AI companies (OpenAI, Meta, Google, Amazon, Apple, and others) for LLM and multimodal model training.

We are focused on structuring it to meet modern LLM training, alignment, retrieval, and multimodal reasoning requirements.

We are seeking a specialist who understands exactly what AI model developers look for in training datasets, and who can help us structure, annotate, document, and package this collection for commercial licensing.

Responsibilities

The selected specialist will:

Advise on what types of image + metadata pairs LLM and multimodal model developers value most
Define or refine a scalable metadata schema (e.g., LIDO / CDWA aligned, JSON-based) suitable for:
- Vision-language models (VLMs)
- Retrieval-augmented generation (RAG)
- Fine-tuning and alignment
Evaluate and improve our existing schema (currently ~17 fields per image, including visual, cultural, technical, and legal metadata)
Recommend annotation standards for:
- Visual objects and composition
- Style, period, and cultural context
- Captioning at multiple levels of complexity
- VQA (Visual Question Answering) pairs
- Synthetic negatives for robustness
Advise on dataset packaging and delivery formats preferred by major AI labs (JSONL, Parquet, TFRecord, etc.)
Ensure FAIR data principles (Findable, Accessible, Interoperable, Reusable)
Provide guidance on licensing language, provenance, and legal compliance for AI training use
Help position the dataset as a premium, commercially licensable AI training asset

Required Expertise

We are looking for someone with demonstrated experience in one or more of the following:

LLM or multimodal model training data (vision-language, CLIP-style datasets, VQA, DETR, etc.)
Dataset design for AI labs, research institutions, or large-scale ML pipelines
Cultural heritage, museum, or archival metadata standards (LIDO, CDWA, CIDOC CRM, Getty AAT)
Structuring datasets for zero-shot learning, cross-modal reasoning, and multilingual AI
Data licensing for AI training (public domain, CC0, custom licenses)
AI data evaluation, benchmarking, or alignment

Nice to have:

Prior work with OpenAI, Meta, Google, Amazon, Apple, or comparable AI organizations
Experience with large image datasets (100k+ items)
Familiarity with RAG systems and embedding-based retrieval
Experience preparing datasets for commercial sale or enterprise clients

Deliverables

Depending on engagement scope, deliverables may include:

Written recommendations on dataset structure and field definitions
Improved or finalized metadata schema (JSON examples included)
Annotation and captioning guidelines
Dataset documentation suitable for enterprise AI buyers
Optional: review of a sample subset of images and metadata

Job Q&A

Become a member to ask a question, view Q&A, and get more benefits.

Similar Jobs

Hiring Machine Learning Engineer
Hourly$200 - $3001-10 hrs/wk6+ monthsPosted: January 15, 2026
AI Developer
Fixed PricePosted: December 22, 2025
Programmer Needed for Interactive AI
Hourly$20 - $1101-10 hrs/wk1-5 daysPosted: January 11, 2026

Posted By

Michael v

Canada


Feedback	No Feedback 100.0%
Total Spend	$13,608
Jobs Posted	132
Jobs Paid	23 (17%)
Paid Invoices	40 (95%)
Outstanding Invoices	2

More Jobs from Michael v (2)

Bookkeeping in Odoo
Send before: January 22, 2026
Bookkeeping
Send before: January 22, 2026

Add to Watchlist Send a Quote