#LLM Ops

Posts about llm ops. ← All posts

A2AADKAGTAIAI GovernanceAIGPAMLAPI DesignAWSAadhaarAccountingAgentsAnomaly DetectionArchitectureArdan LabsAuditAudit LogAzureBCPBankingBedrockBenchmarksBhashiniBigQueryCRAGCachingCareerCase StudyClinical Decision SupportCloud ArchitectureCloud KMSCloud RunCoding AgentsCommunicationComplianceConcurrencyConfigCost OptimisationCryptographyCultureCures ActDSLData ResidencyDatabase DesignDatabase MigrationDatabase SecurityDataflowDatastreamDebuggingDeploymentDesign PatternDevOpsDeveloper ExperienceDevice FlowDistributed SystemsDoclingElevenLabsEmbeddingsEngineeringEntity ResolutionEnvoyEvaluationFHIRFREE-AIFinOpsFinTechFoundationsFraudGCPGDPRGKEGOMEMLIMITGSoCGeminiGenieGitHubGoGo 1.23GoMLXGoogle CloudGoogle Cloud NextGovernanceGrafanaGraphQLGraphRAGHIPAAHITLHL7 v2Healthcare ITHyDEIAPPISO 27001IdempotencyIdentity FederationIncident ResponseIndic LanguagesIngestionIntegrationJWTJupyterKMSKYCKafkaKnowledge GraphKubernetesLLMLLM OpsLLM-as-JudgeLatencyLendingLessons LearnedLocal AILoggingMAFMARAMCPML EngineeringMagenticMemoryMentorshipMicroservicesMiddlewareMigrationMulti-AgentMulti-Agent AIMulti-CloudMulti-LanguageMultilingualNPCINetworkingOAuthOPAOTelOWASPObservabilityOllamaOpen BankingOpen SourceOpenTelemetryOperationsOperatorsOpinionOrchestrationPAMPCSEPDFPKCEPasskeysPatternsPaymentsPerformancePipelinePolicyPolicy as CodePostgreSQLPrivacy EngineeringProductionPrometheusPrompt InjectionPromptingProtocolsProvider AbstractionPub/SubPythonRAGRBACRBIREPLRFC 8693ReactRedisRefactorRegistryRegulationReliabilityReservationsResilienceRetrievalRetrospectiveSAMLSLOSOC 2SPIFFESPIRESQLSRESSESagaSaudi ArabiaSchemaSecuritySecurity Command CenterSelf-RAGService MeshSoftware ArchitectureSpannerSpeakingState ManagementStdlibStorageStreamingTata GroupTerraformTestingTier PromotionToken BudgetingTool CallingToolsUAEUPIUXVectorsVertex AIVideoVisionVoice AIVotingWebAuthnWhisperWorkflowWorkflowsWorkload IdentityWorkload Identity FederationWritingZero-Trustembed.FSerrgroupgRPCiter.SeqmTLSpgvectorslog
· Engineering

Ardan Ultimate AI #19 — Speculative decoding with a draft model

Run a small draft model to predict several tokens at once; verify them in a single pass with the large model. Latency drops without quality dropping. The technique production LLM serving uses but most application engineers don't see.

· Engineering

Ardan Ultimate AI #18 — Incremental message caching (IMC) for chat

A long chat reprocesses the entire history on every turn. Prefix caching lets the LLM serve the cached KV-cache prefix from the previous turn and only compute the new suffix. Massive latency win on long conversations.

· Engineering

Cost-aware agent dispatch — when the cheap agent is enough

Not every query needs the production agent. A cost-aware dispatcher decides whether to route to the cheap-and-fast agent or the expensive-and-thorough one. Same UX, dramatically lower bill.