Banner Image

All Services

Other

Clustering using GPU nearest neighbours

$60/hr Starting at $60

  • Engineered a GPU-accelerated review-clustering pipeline that generates Azure OpenAI embeddings and retrieves nearest neighbors in PostgreSQL using pgvector’s cosine-distance operator for fast similarity ranking. GitHub
  • Coupled UMAP for non-linear dimensionality reduction with HDBSCAN to discover dense semantic clusters without pre-specifying k, aligned to best-practice usage of both algorithms. UMAP Documentation+1

  • Orchestrated a 72-variant hyperparameter sweep (n_neighbors × min_dist × n_components × min_cluster_size × min_samples) and parallelized evaluation across 8 processes for efficient model selection.

  • Designed an auto-selection score with Min-Max normalization that weights Davies–Bouldin, Silhouette, and Calinski-Harabasz indices and enforces guardrails on valid cluster counts to pick the best iteration. Scikit-learn+1

  • Persisted per-iteration labels, metrics, and configs via SQLAlchemy; materialized the final reviews→cluster mapping in Postgres with safe delete/append semantics for idempotent reruns.

  • Added structured logging (structlog) and timestamped checkpoints, delivering a reproducible, configurable clustering service for large-scale product-review analysis (Python, RAPIDS cuML, PostgreSQL/pgvector, SQLAlchemy, Azure OpenAI, pandas, scikit-learn).

About

$60/hr Ongoing

Download Resume

  • Engineered a GPU-accelerated review-clustering pipeline that generates Azure OpenAI embeddings and retrieves nearest neighbors in PostgreSQL using pgvector’s cosine-distance operator for fast similarity ranking. GitHub
  • Coupled UMAP for non-linear dimensionality reduction with HDBSCAN to discover dense semantic clusters without pre-specifying k, aligned to best-practice usage of both algorithms. UMAP Documentation+1

  • Orchestrated a 72-variant hyperparameter sweep (n_neighbors × min_dist × n_components × min_cluster_size × min_samples) and parallelized evaluation across 8 processes for efficient model selection.

  • Designed an auto-selection score with Min-Max normalization that weights Davies–Bouldin, Silhouette, and Calinski-Harabasz indices and enforces guardrails on valid cluster counts to pick the best iteration. Scikit-learn+1

  • Persisted per-iteration labels, metrics, and configs via SQLAlchemy; materialized the final reviews→cluster mapping in Postgres with safe delete/append semantics for idempotent reruns.

  • Added structured logging (structlog) and timestamped checkpoints, delivering a reproducible, configurable clustering service for large-scale product-review analysis (Python, RAPIDS cuML, PostgreSQL/pgvector, SQLAlchemy, Azure OpenAI, pandas, scikit-learn).

Skills & Expertise

Cluster AnalysisCluster ManagementClusteringCumlHdbscanOpenAIUmap

15 Reviews

Sign up or Log in to see more.