GCP AI/ML skill. Use when: (1) running Gemini models (gemini-2.5-pro, gemini-2.5-flash, gemini-2.0-flash) via Vertex AI,
Vertex AI is the production entry point for ML on GCP. It covers:
gemini-2.5-pro, gemini-2.5-flash, gemini-2.0-flash, gemini-1.5-pro, gemini-1.5-flash.text-embedding-004, gemini-embedding-001, text-multilingual-embedding-002.AI Studio vs Vertex AI: AI Studio (aistudio.google.com) is the prototyping UI — API key auth, consumer Google account, no VPC/IAM controls. Vertex AI is the production path — IAM, service accounts, regional endpoints, VPC Service Controls, quotas. Always use Vertex for anything user-facing.
| Use case | Recommended |
|---|---|
| South African app, low-latency chat | Gemini Flash in europe-west2 (nearest with full model availability) |
| Complex reasoning, agentic | Gemini 2.5 Pro or Claude Opus 4.7 via Model Garden |
| Cheap high-volume summarisation | Gemini 2.0/2.5 Flash |
| Anthropic-loyal app needing GCP billing | Claude via Vertex Model Garden |
| RAG over private docs | Vertex AI Search OR Gemini + embeddings + your vector DB |
| On-device / edge inference | Not Vertex — use Gemini Nano (Android) or Workers AI |
Vertex AI uses Application Default Credentials (ADC) — no API keys. On GCE/Cloud Run/GKE, the attached service account is used automatically. Locally, gcloud auth application-default login.
# Grant the service account Vertex AI user role
gcloud projects add-iam-policy-binding my-app-prod \
--member "serviceAccount:[email protected]" \
--role "roles/aiplatform.user"
npm install @google/genai
import { GoogleGenAI } from '@google/genai';
// Vertex AI mode — uses ADC, no API key
const ai = new GoogleGenAI({
vertexai: true,
project: 'my-app-prod',
location: 'europe-west2',
});
// Simple generate
const res = await ai.models.generateContent({
model: 'gemini-2.5-flash',
contents: 'Summarise the POPIA requirements for processing SA personal data in 3 bullets.',
});
console.log(res.text);
// With system instruction + generation config
const res2 = await ai.models.generateContent({
model: 'gemini-2.5-pro',
contents: [
{ role: 'user', parts: [{ text: 'Draft a 1-line product update for Penny briefings.' }] }
],
config: {
systemInstruction: 'You are a concise technical writer. Max 25 words.',
temperature: 0.3,
maxOutputTokens: 60,
responseMimeType: 'application/json',
responseSchema: {
type: 'object',
properties: { headline: { type: 'string' }, emoji: { type: 'string' } },
required: ['headline', 'emoji'],
},
},
});
// Streaming
const stream = await ai.models.generateContentStream({
model: 'gemini-2.5-flash',
contents: 'Write a 5-step plan for launching in SA.',
});
for await (const chunk of stream) {
process.stdout.write(chunk.text ?? '');
}
const tools = [{
functionDeclarations: [{
name: 'lookup_client',
description: 'Find a 2nth client by ID',
parameters: {
type: 'object',
properties: { clientId: { type: 'string' } },
required: ['clientId'],
},
}],
}];
const res = await ai.models.generateContent({
model: 'gemini-2.5-pro',
contents: 'Show me client 2n-014 summary',
config: { tools },
});
// Inspect and handle function calls
const call = res.candidates?.[0]?.content?.parts?.[0]?.functionCall;
if (call?.name === 'lookup_client') {
const data = await db.clients.get(call.args.clientId as string);
// Send function response back for final answer
const final = await ai.models.generateContent({
model: 'gemini-2.5-pro',
contents: [
{ role: 'user', parts: [{ text: 'Show me client 2n-014 summary' }] },
{ role: 'model', parts: [{ functionCall: call }] },
{ role: 'function', parts: [{ functionResponse: { name: 'lookup_client', response: data } }] },
],
config: { tools },
});
console.log(final.text);
}
// Generate embedding — 768 dims for text-embedding-004
const emb = await ai.models.embedContent({
model: 'text-embedding-004',
contents: 'POPIA requires consent and purpose limitation for personal data.',
config: { taskType: 'RETRIEVAL_DOCUMENT', outputDimensionality: 768 },
});
const vec = emb.embeddings[0].values; // number[]
// Store in Vectorize / pgvector / Firestore vector search
await env.VECTORIZE.insert([{ id: 'doc-1', values: vec, metadata: { source: 'popia.md' } }]);
// Query with RETRIEVAL_QUERY task type for asymmetric retrieval
const q = await ai.models.embedContent({
model: 'text-embedding-004',
contents: 'Do I need consent to collect email addresses?',
config: { taskType: 'RETRIEVAL_QUERY' },
});
const results = await env.VECTORIZE.query(q.embeddings[0].values, { topK: 5 });
Key embedding task types: RETRIEVAL_DOCUMENT (when storing), RETRIEVAL_QUERY (when searching), SEMANTIC_SIMILARITY (symmetric), CLASSIFICATION, CLUSTERING. Must match between index and query.
Anthropic's Claude is available through Vertex — billed on the GCP invoice, same IAM, same region boundary. Useful if your org requires single-vendor billing or VPC Service Controls around Anthropic inference.
# Enable in Model Garden (one-time, per model, per region)
# Console: Vertex AI → Model Garden → search "Claude 4.5 Sonnet" → Enable
import Anthropic from '@anthropic-ai/vertex-sdk';
const client = new AnthropicVertex({
projectId: 'my-app-prod',
region: 'europe-west4', // Claude regions: us-east5, europe-west4, asia-southeast1
});
const msg = await client.messages.create({
model: 'claude-sonnet-4-6@20250106', // Vertex uses pinned model revisions
max_tokens: 1024,
messages: [{ role: 'user', content: 'Write a 3-sentence POPIA compliance check.' }],
});
console.log(msg.content[0].type === 'text' ? msg.content[0].text : '');
Regional caveat for Claude on Vertex: Claude models are only in a subset of regions — commonly us-east5, europe-west4, asia-southeast1. Not all sizes in all regions. Check Model Garden before designing.
# Deploy a tuned model or custom container to a managed endpoint
gcloud ai endpoints create --display-name my-endpoint --region europe-west2
ENDPOINT_ID=$(gcloud ai endpoints list --region europe-west2 \
--filter "displayName=my-endpoint" --format "value(name.scope(endpoints))")
gcloud ai endpoints deploy-model $ENDPOINT_ID \
--region europe-west2 \
--model projects/my-app-prod/locations/europe-west2/models/MODEL_ID \
--display-name my-deployment \
--machine-type n1-standard-4 \
--min-replica-count 1 --max-replica-count 5
Endpoints with min-replica-count >= 1 are billed 24/7 for the underlying machines. Consider batch prediction (cheaper, async) for non-interactive workloads.
// Gemini can ground responses in Google Search (reduces hallucination, adds citations)
const res = await ai.models.generateContent({
model: 'gemini-2.5-flash',
contents: 'What are the SARS e-filing deadlines for 2026?',
config: {
tools: [{ googleSearch: {} }],
},
});
// Inspect res.candidates[0].groundingMetadata for citations
// Ground in your own Vertex AI Search datastore (RAG without writing vector code)
const res = await ai.models.generateContent({
model: 'gemini-2.5-pro',
contents: 'What does our onboarding contract say about notice period?',
config: {
tools: [{
retrieval: {
vertexAiSearch: {
datastore: 'projects/my-app-prod/locations/global/collections/default_collection/dataStores/my-contracts',
},
},
}],
},
});
config: {
safetySettings: [
{ category: 'HARM_CATEGORY_HATE_SPEECH', threshold: 'BLOCK_MEDIUM_AND_ABOVE' },
{ category: 'HARM_CATEGORY_HARASSMENT', threshold: 'BLOCK_MEDIUM_AND_ABOVE' },
{ category: 'HARM_CATEGORY_SEXUALLY_EXPLICIT', threshold: 'BLOCK_LOW_AND_ABOVE' },
{ category: 'HARM_CATEGORY_DANGEROUS_CONTENT', threshold: 'BLOCK_MEDIUM_AND_ABOVE' },
],
}
When a response is blocked, res.candidates[0].finishReason === 'SAFETY' and text is empty. Check finishReason on every call.
| Model | Input | Output |
|---|---|---|
| Gemini 2.5 Pro | ~$1.25 | ~$5.00 |
| Gemini 2.5 Flash | ~$0.075 | ~$0.30 |
| Gemini 2.0 Flash | ~$0.075 | ~$0.30 |
| Gemini 1.5 Pro | ~$1.25 | ~$5.00 (≤128k) |
| Gemini 1.5 Flash | ~$0.075 | ~$0.30 (≤128k) |
| Claude Sonnet 4.6 via Vertex | ~$3.00 | ~$15.00 |
| Claude Opus 4.7 via Vertex | ~$15.00 | ~$75.00 |
| Claude Haiku 4.5 via Vertex | ~$0.80 | ~$4.00 |
| text-embedding-004 | ~$0.025 / 1M input chars | — |
Context caching (Gemini): cache input prompts > 32k tokens for a flat fee, then per-token reads at 25% of input cost. Big savings on repeated long-context prompts (same system prompt + knowledge base across many calls).
ai.google.dev often use API-key auth against the Gemini API (not Vertex). That flow has no IAM, no VPC-SC, no regional control. Always verify which endpoint your SDK hits — look for vertexai: true.gemini-2.5-pro is widely available; third-party Model Garden models (Claude, Llama) are limited to specific regions.RESOURCE_EXHAUSTED means you hit a per-minute quota. Request quota increases in Cloud Console → IAM & Admin → Quotas. Default quotas for new projects are low.responseMimeType: application/json: Great for structured output but requires a responseSchema for reliability. Without the schema, Gemini often emits markdown-fenced JSON that fails to parse.HARM_CATEGORY_DANGEROUS_CONTENT can trigger on security tutorials, medical advice, or legal questions. If your app is a regulated domain, loosen thresholds or expect blocks.text-embedding-004 supports dimension reduction (768 or 256). Must match your vector DB config. Changing mid-flight invalidates existing embeddings.claude-sonnet-4-6@20250106). Using an unpinned name can break without notice as Google moves defaults.ttl: '3600s' if your caching pattern relies on cross-request reuse.