What does an LLM app developer or AI agent builder do?
An LLM app developer builds product or internal software around language models. An AI agent builder goes further into multi-step systems that call tools, retrieve information, manage context, and decide when to ask for help. In both cases, the useful work is not just prompt text. It is product engineering around uncertain model behavior.
- Build chat, assistant, extraction, summarization, or workflow interfaces.
- Connect models to product data through retrieval and permissioned APIs.
- Create eval sets so changes can be tested before release.
- Design tool-calling paths, guardrails, and human escalation points.
Role taxonomy: conversational AI developer vs. LLM app developer vs. AI agent engineer
Prompt engineering is now better treated as a skill inside LLM and agent roles, not the title most startups should post. If the work involves shipping software, use a title that names the system being built: LLM app developer, AI agent engineer, conversational AI developer, or AI integration developer.
| Role | Primary work | Proof to request |
|---|---|---|
| Conversational AI developer | Chatbot, voice, and support flows | Conversation design plus analytics |
| LLM app developer | User-facing AI features and backend integration | Code, evals, and shipped product examples |
| AI agent engineer | Tool use, orchestration, and multi-step workflows | Agent traces, failure handling, and test harnesses |
| Prompt-heavy specialist | Instruction design and quality iteration | Prompt systems connected to measurable outcomes |
Ready to turn this into a clearer role brief?
Use this guide to scope LLM apps, agent systems, RAG, tool calling, evals, and prompt work without posting a vague AI role.
Common job description mistakes
The risky brief asks for someone to build an AI agent without naming the tools, permissions, target users, data sources, latency needs, or review process. A better brief says what the system must do in week one, what it can access, and what failure looks like.
- Do not use prompt engineer as the default title for product engineering work.
- Do not ask for every model framework unless your stack truly requires them.
- Do not hide whether the work includes RAG, tool calling, fine-tuning, or evals.
- Do not skip data access. The candidate needs to know what the model can retrieve.
What to look for in a portfolio
Strong portfolios show the hard parts around the model: retrieval quality, prompt versioning, cost controls, latency budgets, eval results, and examples of bad outputs that were caught before users saw them.
- RAG pipelines with retrieval examples and relevance notes.
- Tool-calling flows with validation and recovery behavior.
- Evaluation sets that compare prompt or model changes.
- Production screenshots, code samples, or technical writeups.
The 5 best interview questions
- How would you decide between RAG, fine-tuning, and prompt-only design for this use case?
- Show us a model output that failed. How did you catch it?
- How would you evaluate an agent that performs a three-step workflow?
- What data permissions and audit logs would this feature need?
- Take-home test: design a small LLM feature with retrieval, an eval set, and an escalation path.
Scope, structure, and compensation
For a 20-50 person startup, scope should start with one workflow or product surface. A solo developer can ship an assistant, retrieval workflow, or internal agent if the data sources are available and the success criteria are narrow.
Compensation varies with product engineering depth. Candidates who can ship backend systems, evals, and model-integrated features command more than candidates focused only on prompt iteration.
Seven examples of work an LLM app developer can own
The right hire should be able to translate model capability into product or operations software. That can mean building a RAG-backed knowledge base that ingests company documentation, chunks it, stores embeddings in Pinecone or Chroma, and returns cited answers. It can mean a customer-support agent that classifies tickets, drafts responses for review, escalates edge cases, and logs every interaction for quality review. It can also mean a sales call summarizer that extracts next steps and deal-stage signals before updating Salesforce or HubSpot.
More product-heavy examples include a writing assistant, proposal generator, or contract analysis feature embedded inside a SaaS product with streaming responses, feedback collection, and usage metering. More agentic examples include a system that reads PDFs, calls internal APIs, browses permitted sources, and composes structured outputs with human checkpoints. The mature version of the role also owns eval harnesses: test suites that run prompt regressions and alert the team when a model or prompt update lowers output quality.
- RAG-backed knowledge base with source attribution and retrieval-quality checks.
- Support agent with classification, drafted replies, escalation, and quality logs.
- Sales call summarizer that writes structured CRM updates and action items.
- Product-embedded writing, contract, or proposal assistant with usage metering.
- Multi-step agent with tool use, document reading, API calls, and human checkpoints.
- Evaluation harness that tests prompt changes against labeled examples.
- Fine-tuned or adapted model for domain language, output format, or terminology when prompt-only design is not enough.
Skills, portfolio signals, and red flags
A production-ready LLM developer needs more than model enthusiasm. Table-stakes skills include Python, at least one major LLM SDK, production RAG implementation, vector database experience, an eval framework such as Promptfoo or DeepEval, observability through LangSmith, LangFuse, Helicone, or custom logs, and cost awareness through caching, model routing, and token budgeting. Fine-tuning, multi-agent frameworks, streaming UI, and GPU infrastructure are useful, but they are not the first filter for most startup roles.
| Signal | Green flag | Red flag |
|---|---|---|
| Deployment | A feature handling real users, usage volume, and cost per query | Only notebooks, tutorials, or local demos |
| Evaluation | A test set or rubric used before prompt/model changes ship | No answer to how quality is measured |
| Retrieval | Can explain chunking, reranking, source attribution, and hallucination controls | Claims RAG experience but cannot discuss retrieval tradeoffs |
| Operations | Shows logs, traces, regression checks, and cost monitoring | Only discusses model choice or prompt wording |
| Prompt skill | Prompt systems tied to measurable product outcomes | Prompt engineering listed as the whole job without broader implementation depth |
The most common hiring mistake is treating this as a general software role. A strong React or Node engineer may be excellent and still not know how to design RAG, build evals, monitor hallucinated citations, or control model spend. Conversely, a prompt-heavy specialist may improve outputs but fail to own the product system. For a 10-50 person startup, the practical default is an LLM app developer first. Agent engineering is a subset of that skill set, and full autonomous-agent complexity is usually unnecessary in year one.
Scope, compensation, and sourcing
A solo LLM developer at a 20-50 person startup can usually own one production LLM feature, one internal AI tool, and the evaluation/cost monitoring around those systems. They cannot realistically fine-tune large models, build a multi-agent platform, maintain a vector database at large scale, and ship new product features at the same time. Before hiring, define the first feature, the data sources, the product owner who can judge output quality, and the cost ceiling for model usage.
| Level | Base salary | Total comp | Contract rate |
|---|---|---|---|
| Junior LLM engineer | $130,000-$180,000 | $140,000-$185,000 | $90-$130/hr |
| Mid-level | $180,000-$280,000 | $185,000-$260,000 | $130-$250/hr |
| Senior | $280,000-$400,000 | $240,000-$360,000 | $250-$400/hr |
| Staff / principal | $400,000-$700,000+ | $520,000-$900,000+ | $400-$700/hr |
Seed and Series A startups may compete at the lower end of these ranges if they offer meaningful equity and product ownership. The source research also points to a national senior applied AI engineering median around $230,625 base, which is useful context but should not be treated as a precise benchmark for every LLM role. Strong sourcing channels include practitioner communities, GitHub contributors to LangChain, LlamaIndex, Promptfoo, and LangFuse, Hacker News hiring threads, AI Engineer events, and startup-oriented platforms where candidates understand equity and ambiguity.
- Give the hire API keys and cost monitoring in week one.
- Provide access to the product codebase, deployment pipeline, and data stores.
- Set up or approve a vector database and observability tool before the first sprint.
- Name the stakeholder who decides whether an output is useful, incorrect, or risky.
- Ask a take-home test around a broken RAG pipeline, hallucinated citations, and monitoring design rather than a blank chatbot build.
How to calibrate the job post
The job post should start with the product surface or internal workflow, then name the AI system requirements. "Build a customer support assistant over our help center, product docs, and ticket history" is clearer than "hire an AI engineer." The clearer version tells candidates they need retrieval, access control, response quality checks, and support-team handoff. It also signals that the team understands the difference between a demo and a maintained product surface.
If the role includes agents, define what autonomy actually means. A useful agent may gather context, call a tool, draft a recommendation, and pause for human approval. It does not have to be a fully autonomous system that plans for hours. For many startups, the practical version is an LLM app with tool use and checkpoints. That framing attracts builders who can ship and discourages over-engineering around agent frameworks before the business case is proven.
- Name data sources: docs, CRM records, tickets, transcripts, PDFs, product data, or internal APIs.
- Name quality controls: eval set, human review, citations, logging, or regression checks.
- Name cost controls: caching, token budgeting, model routing, usage metering, and prompt compression.
- Name product requirements: latency, streaming, access control, permissions, and feedback collection.
- Avoid making model choice the whole role; the harder work is system design, evaluation, and operations.
For a first hire, prioritize someone who can ship a narrow feature and prove it works. Fine-tuning and multi-agent orchestration may become useful later, but most early teams need a dependable retrieval or assistant workflow first. The evaluation process should reward candidates who ask for sample data, success criteria, and failure examples before proposing architecture.
A practical evaluation rubric
Score candidates across four dimensions. First, system design: can they explain data ingestion, retrieval, model calls, permissions, output formatting, and failure handling as one system? Second, evaluation: can they define a test set, metrics, and regression process? Third, product judgment: can they explain when to keep a human checkpoint, when latency matters, and how users will give feedback? Fourth, operations: can they monitor cost, traces, quality drift, and bad outputs after launch?
The take-home should be bounded to three or four hours and should test debugging instead of raw invention. A broken RAG pipeline with hallucinated citations is more revealing than a blank chatbot assignment. Strong candidates will inspect retrieval, chunking, prompt constraints, citation formatting, logs, and eval coverage. Weak candidates will swap models or rewrite the prompt without diagnosing the system.
Last-mile calibration
If the candidate cannot explain how the feature behaves after launch, keep digging. Production LLM work is a loop: ship a narrow capability, collect examples, review failures, update prompts or retrieval, rerun evals, watch cost, and repeat. That loop is the difference between someone who has experimented with models and someone who can own an AI product surface.
Sources and review notes
Last reviewed: .
- AI chatbot and agent builder category - Canonical category path.
- AppliedHire taxonomy - Role-family and skill-cluster framing.
- LLM and agent builder role-family research - Uses source brief examples for RAG, support agents, CRM summarizers, eval harnesses, fine-tuning, and 2026 compensation ranges.