Dont Forget DR for GAI

A colleague asked me about resilience strategies for LLM based applications (you can use LLM interchangeably with GAI in this post) which got me thinking that I’ve not seen much written about that . So here we are. I have touched tangentially on some aspects or DR in my posts on security and LLMs so I won’t repeat too much of that here.

Similar to When you look at the security of LLM based applications many of the issues you deal with for traditional security planning are the same but there are also differences & new patterns that you need to consider. I won’t spend too much time on the standard processes I hope you already have in place. I will also assume that you are using Google Cloud but you can do the mapping for your cloud of choice.

With any DR solution you need to plan so I suggest start with a refresher if you haven’t had to think about this for a while by reviewing Disaster recovery planning guide | Cloud Architecture Center

Your strategy will evolve depending on whether your LLM is deployed on-premises or in the cloud or do you have a hybrid cloud/on-premises set up. If you have a quick pass through the GCP DR docs ( or the DR docs of your cloud of choice) that can help you start to think through what you may or may not want to do.

This is about DR so think RTO and RPO . What is the tolerance for downtime and your budget?

Backup & recovery of data

Without data then a Model isn’t really of any use so all the guidance around data resilience applies.

Regularly backup the data used to train the model and also the model checkpoints.

Data for use in LLM based applications can be stored in GCS as well as Databases.

Think how long it will take to rebuild a vector database for example. This will affect your RTO & RPO values.

Standard database DR patterns should be implemented and there is plenty of guidance for whatever flavor of database you use .

Data stored in GCS. Store these backups securely in separate GCS buckets. Implement processes so it’s not easy to traverse from the primary bucket to the backup bucket to help mitigate against propagating whatever incident has caused a loss or corruption of the data in the first place.

I like the GCS Soft Delete feature as a way to help mitigate against accidental or malicious deletion.

MLOPs pipeline

Don’t forget your MLOps pipeline.

Implement processes to ensure that your MLOPs pipeline works as expected. Stress test all parts of the pipeline . Can you still pull your application code ? Can you deploy to alternative targets? It’s actually pretty easy to use the same model weights from an alternative location I’d say it’s even easier to do than pointing to a restored database

And test test test. Can you roll back to a known good state? Don’t wait till it’s too late to find out that your recovery process doesn’t quite work!

Disaster recovery scenarios for data | Cloud Architecture Center details various options for data backup and recovery scenarios.

Model resilience

A pattern that is becoming common is the use of multiple LLMs. These patterns include:

The use of multiple specialized models . This post has a nice table contrasting the one big model versus a single model pattern
Ensemble methods that involve combining multiple models to improve the overall performance of a machine-learning task. The underlying principle is that leveraging the strengths and mitigating the weaknesses of multiple models can achieve better accuracy and robustness than any single model. For a deeper dive on this pattern see A Review of Hybrid and Ensemble in Deep Learning for Natural Language Processing
Although we are now in the era of multi modal LLMs, Prompt chaining is a popular and practical pattern to adopt as each model behaves in different ways and some are better optimized for specific tasks rather than a multi modal LLM to do everything. This pattern focuses on strategically feeding outputs from one LLM as prompts to another LLM to achieve a complex task. For example if you have an LLM that excels at summarizing factual topics and another that is better at creative writing. You could use the first LLM to summarize a factual background for a story and then feed that summary as a prompt to the creative LLM to generate a fictional narrative based on the factual background. This approach requires orchestration frameworks like LangChain or LlamaIndex to manage the complex workflow of prompts, data retrieval, and LLM interactions. My From Rag to agents post describes a similar pattern and indeed you can use agents to implement this pattern to use agents which use a different model for each agent,

With any of these patterns the devil is in the detail and how to actually implement them . This post isn’t focused on how you do a thing but explaining what you can do ( although via my rag to agents post there is a how to ) !

So you may be asking after that small diversion into LLM design patterns, how does this help with DR for LLM applications? Well funny you should ask .

You can design redundancy into your application design by implementing a secondary LLM as a backup. The discussion on Multiple model patterns describes some approaches to help with this. It could be a similar model or one specialized for different tasks . Think about your end user experience when implementing this strategy.

You can design for failover so you could implement some routing code that detects outages , network latency problems and automatically switches to the secondary LLM if the primary fails.

You should also keep versions of the Model weights available so you can roll back to a previous version . Model versioning with Model Registry | Vertex AI | Google Cloud describes how you can manage versions of your model using Vertex AI

Keeping it simple

Yes LLMs are wonderful things but that doesn’t mean you need to over engineer so keeping your application design simple will go some way to helping you develop a resilient architecture.

You can implement techniques such as:

Designing your application to gracefully degrade in case of LLM failure. This could involve implementing a fallback mode with reduced functionality , providing canned messages or maybe failing over to a simpler static site with maybe some simple drop down lists etc if it’s a support chat bot type application.

For critical and frequently asked questions / prompts you could collect a set of responses from the LLM and store them for retrieval in case of outage.

Posted on May 3, 2024 at 17:32