🌅 Building Semantic Search for 100k+ Creative Assets

How I transformed a chaotic Dropbox archive into an AI-powered search system that understands concepts instead of filenames.

By Jameson Campbell

Executive Summary

Impact

  • • Search time reduced from twenty minutes to under two minutes
  • • Lost asset time down forty percent
  • • Adopted daily across design and sales
  • • Zero training required

I built an end-to-end semantic search system that allows our teams to find past creative work using meaning based queries. The platform processes more than 100k assets, generates embeddings, stores vectors, and returns visually similar results with high accuracy and low latency. It eliminated the bottleneck of digging through nested folders and searching by filename guesswork.

1. The Problem

At Sock Club, our creative archive lived in Dropbox with more than 100k PSDs and exports accumulated over years. Filenames were inconsistent, folder structures varied, and searching depended on tribal knowledge. Designers searched for concepts, not file names. Sales needed examples for live client conversations.

Dropbox Dash could not solve this because it was built for general file search, not semantic visual discovery across creative content.

The consequences were clear. Designers recreated assets because finding originals took too long. Sales slowed down client emails while trying to find examples. Junior team members regularly pinged senior designers for help locating files.

We needed meaning based search that understood visual similarity, aesthetic themes, and conceptual intent.

2. Requirements

After working with designers and sales reps, I defined the core needs.

User Requirements

  • • Search by concept, not filename
  • • Fast results across the entire library
  • • Support PSDs, PNGs, and exports
  • • Direct access to original files with secure authentication
  • • Integration with HubSpot Deals
  • • A pipeline that processes more than 100k assets without manual intervention

The system needed to be simple, intuitive, and reliable.

3. Architecture Overview

The platform uses a unified embedding space for images and text to power semantic similarity search.

Tech Stack

  • • Next.js frontend on Vercel
  • • FastAPI backend on Vercel serverless Python runtime
  • • Vertex AI multimodal embeddings
  • • Pinecone for vector search
  • • AWS S3 for asset storage
  • • AWS Cognito for authentication
  • • HubSpot API integration

Data flow

  • • Assets migrate from Dropbox to S3.
  • • Files convert to PNG when needed.
  • • Vertex AI generates embeddings.
  • • Pinecone stores vectors with metadata.
  • • Search requests embed the query and retrieve nearest matches.
  • • The UI displays results with metadata, previews, and PSD downloads.
Sock Scout architecture diagram showing system components and data flow
View Full Size

Click to view full-size architecture diagram

4. Ingestion and Embedding Pipeline

Processing more than 100k assets required a durable, recoverable ingestion pipeline.

Key Pipeline Challenges

  • • Staying within model rate limits
  • • Converting PSDs reliably
  • • Skipping corrupt files
  • • Recovering from long batch interruptions
  • • Detecting duplicates before embedding
  • • Migrating from Dropbox to S3 cleanly
  • • Handling PSDs that were 99 percent white

Design choices

  • • Batch execution with adjustable concurrency
  • • Exponential backoff and retry logic
  • • PNG conversion pipeline for PSDs
  • • Pinecone pre checks to avoid duplicate embeddings
  • • Metadata mapping for HubSpot integration
  • • Progress tracking to support multi hour runs

The pipeline completed more than 253k embedding units with a total compute cost of $25, reduced to $8 after Google Cloud credits.

5. Vector Database Strategy

Pinecone provided fast similarity search with minimal operational overhead.

Why Pinecone

  • • Serverless architecture with zero maintenance
  • • Metadata based filtering
  • • High read performance
  • • Smooth large scale upserts

Index structure

  • • Cosine similarity with 1408 dimension vectors
  • • Rich metadata for filtering
  • • Dropbox ID based keys for designer assets to ensure long term consistency
  • • S3 hash keys for sales assets for backward compatibility

This eliminated duplicate vectors and prevented orphaned entries during S3 reorganizations.

6. Search API

The FastAPI service handles:

  • • Text or image queries
  • • Embedding generation
  • • Vector retrieval
  • • Metadata hydration
  • • Authorization checks
  • • Secure access to original PSDs

Why Vertex AI

  • • High quality multimodal embeddings
  • • Same embedding space for text and images
  • • Strong performance on MTEB benchmarks
  • • Predictable latency
  • • Cost effective at scale

The API consistently returns top results in under a second, helped by debouncing on the frontend.

7. Frontend Experience

The Next.js interface makes searching feel instant and intuitive.

UX Priorities

  • • Fast feedback
  • • Clear concept based results
  • • Minimal learning curve
  • • Useful previews without losing context
  • • Full mobile support

Features

  • • Unified search bar
  • • Toggle for Design or Sales assets
  • • Responsive grid of results
  • • Fullscreen preview modal
  • • Asset metadata with HubSpot links and Internal Portal links
  • • Explore Similar Designs using the asset as the query
  • • Asset flagging system for quality control
  • • Dark mode via theme variables
  • • Optimized mobile layout
  • • Lazy loaded result sets
Sock Scout home screen showing search bar and interface

Home screen with unified search interface

Semantic search results showing concept-based matches

Concept-based search results

Full screen preview modal showing image canvas with metadata and actions

Full screen preview with metadata and actions

Users can search for terms like playful geometric layout or vintage airport theme and the system returns visually coherent matches.

8. Asset Quality Control

To maintain brand quality, I built a flagging system with four preset reasons. This calls out outdated or unusable assets within searches and gives designers a lightweight review workflow.

Asset flagging system interface showing predefined flagging reasons

Asset flagging system for quality control

9. Performance and Reliability

The system meets the following internal targets:

  • • Search results under 300 to 600 milliseconds
  • • Efficient preview loading even in large result sets
  • • Ingestion pipeline stable over many hours
  • • Zero downtime deployments with Vercel
  • • Full analytics and error tracking via PostHog

10. Cost Efficiency

The platform costs about $72 a month to run.

  • • Pinecone: $50
  • • Vercel: $20
  • • S3: about $1 to $2
  • • Vertex AI queries: effectively zero after credits

The architecture can scale with minimal changes while staying cost efficient.

11. Results

Impact Metrics

  • • Search 10x faster
  • • 40% reduction in lost asset time
  • • Designers stopped recreating assets unnecessarily
  • • Fewer interruptions to senior team members
  • • Faster sales responses for client emails

Adoption was organic. No training or rollout was required. People saw value immediately.

"Sock Scout is phenomenal, by the way. I've used it many times this week and it has saved me so much time already. Thanks for building such an amazing tool for us!!" — Taylor Spence, Senior Designer

The biggest validation: people used it without being asked.

12. Lessons Learned

  • • Pipelines must be resilient to failure
  • • UX drives adoption more than features
  • • Trust in search accuracy unlocks new workflows
  • • Isolated environments enable safe experimentation
  • • Each environment teaches you what to automate next
  • • Dry-run mode prevents costly mistakes
  • • Early architectural decisions have lasting impact
  • • Building for scale means making systems adaptable to constraints, not just fast

Final Takeaway

Semantic search turns an archive into a discovery engine. This project showed how modern AI tooling, paired with simple UX, can give a small team the type of internal capability usually reserved for large engineering organizations.

It reflects how I approach internal tools work. I identify a real operational friction point, design a pragmatic architecture, build the solution end to end, and deliver measurable improvements in team productivity and decision making.