Building Useful and Reliable AI Agents
Key takeaways from Princeton's AI agent session (August 29th, 2024): building accessible, reliable, and cost-effective AI for the real world.
On August 29th, 2024, I attended the “Useful and Reliable AI agents” session hosted by Princeton Language & Intelligence.
Attracting a global audience of over 600 people from over 70 countries, the event highlighted a universal desire to harness the potential of AI agents - autonomous systems that can perceive, reason, and act towards goals creatively and flexibly within defined boundaries for various applications across work and life.
The session dove deep into the key issues facing AI agent development today, highlighting six main areas:
Infrastructure
Cost
Reliability
Privacy
Safety/Security
UI
I learned a lot from the session and ensuing conversations; here’s my summary of takeaways for navigating these six challenges, which are essential for unlocking the full potential of AI agents and ensuring their responsible development and deployment:
1 - Infrastructure
Agent Architecture: Defining and Programming: Given the complexity of understanding even a single LLM's output, configuring Agents demands systems thinking and extensive programming. Multiple LLM calls are involved, along with prompts, compute optimizations, and tools organized in a graph framework, which can quickly get complicated, especially for any non-trivial tasks. Much like guiding a competent human, clear communication and feedback are vital - the more of the Agent that we can explicitly define, like incorporating expected queries and answers into examples, the better we can maximize its output potential.
Evaluating Agent Performance: Evaluations have become a greater focus in LLM development, and are even more elusive for agent performance that chains LLM outputs to autonomously make rational decisions and actions. The history of evaluations spans from text-based games in 2015, 2020 to practical applications like WebShop in 2022, which captured basic agent actions in making online purchases. Recent and impressive benchmarks like SWE-bench and TAU-bench offer more thorough assessments of agent capabilities for software development and practical applications that use tools.
The Importance of Benchmarks: Benchmarks provide standardized measurements for LLM output and Agent quality across various domains. However, their realism and applicability to subjective topics and business-critical applications (like law and medicine) demand constant improvement as they become increasingly capable. Model developers should be also aware of possible “overfiting” with highly publicized benchmarks that can reduce their effectiveness in real-world scenarios at the expense of vanity metrics.
2 - Cost
Cost Optimization: Beyond Accuracy: Knowing how computationally expensive it is to run frontier models, the industry is shifting focus beyond accuracy and output quality towards the costs for inference, especially for complex tasks and large-scale deployments. Smaller models can be surprisingly effective and may be the best solution for resource-constrained users.
The Power of Simpler Models: Smaller models, with careful tuning and repeated sampling, can sometimes outperform even state-of-the-art models (when zero-shot prompted). Techniques like shorter prompts, varied model temperatures, and better few-shot examples can help ‘warm up’ a model to reach its full potential.
Repeated Inference for Higher Quality: OpenAI’s new ‘Strawberry (o1)’ model is making waves in the industry for taking on this new approach towards inference to extract higher quality outputs, touted as “advanced reasoning abilities”. By baking in established prompt engineering methods like ReAct, reinforcement learning, and Chain of Thought (CoT), the model can self-reflect and think step by step, in alignment with a safe set of human values, goals, and expectations.
Inference is getting much cheaper: Encouragingly, LLM inference costs continue to rapidly decrease, similar to the data costs of wireless telecoms, coming down by 100x over 2 years. This is essential for enabling effective agentic workflows, as inference calls can drastically increase to support systematic reflection, reasononing, and iteration of outputs.
3 - Reliability
Ensuring Reliable Agent Behaviour: The inherent variability of LLM outputs makes it challenging to ensure consistent and reliable agent behaviour. Reinforcing the points on Infrastructure, programming more of the architecture instead of relying solely on natural language prompting can improve consistency and reproducibility, vital for any business.
100% Accuracy Is Not Realistic: Maximizing accuracy can lead to unbounded cost. It’s practically impossible to achieve 100% accuracy, and the cost of additional inference calls rises exponentially the closer you try to get to perfect (ex. 80% —> 90% could cost the same as 95% —> 96%). Acknowledging this limitation, a future where humans remain involved in key-decision making processes, supported by powerful agents that offload much of the work, seems to be the way.
From ML to LLM Reliability: ML reliability relies on deterministic and rules-based measurement that follows explicit instructions for any kind of input. In contrast, LLMs are more nuanced, requiring a deeper understanding of human intent and preferences, often involving subjective judgments or “vibe checks”. For example, there’s no perfect answer for “Write a creative email for an outbound client lead.”
4 - Privacy
Protecting User Privacy: As agents become more helpful and handle increasingly sensitive data, it’s imperative to maintain user trust and prioritize ethical AI development. Well-designed agents must incorporate state-of-the-art privacy-preserving technologies that can make use of sensitive data with minimal compromise.
Navigating Regulatory Compliance: Adhering to evolving data privacy regulations like the EU AI Act and California Consumer Privacy Act (CCPA) presents obstacles and important guardrails for development. At the same time, AI agents can also be instrumental in the regulatory compliance process itself by analyzing data to enhance risk assessment and management processes.
5 - Safety/Security
The AI Safety Institute’s Role: Established by the UK government, the institute functions like a start-up, focusing on developing safety evaluations, addressing misuse, societal impacts, and risks associated with autonomous systems.
The AI Risk Repository: A Comprehensive Resource: While not explicitly mentioned during the sessions, this comprehensive database introduced in August by MIT researchers catalogs over 700 AI risks, providing an important, common frame of reference for understanding and addressing potential threats from AI. Google’s DeepMind has been actively involved in AI risk research, releasing reports like “The Ethics of Advanced AI Assistants (2024)”.
Ensuring AI Safety: Just like accuracy, achieving 100% safety in critical scenarios is near impossible, necessitating continuous due diligence and risk mitigation efforts for the one of the most pressing world problems. In a future with potentially millions or even billions of agents, evaluation systems must evolve to keep pace with progressively greater ranges of freedom and limited supervision of these systems. They can easily be misused by malicious actors seeking to spread misinformation or conduct cyberattacks.
6 - User Interface (UI)
Human-Centred Design: Recognizing the impracticality of achieving 100% autonomous and reliable agents, more focus is needed on increased transparency of an agent’s capabilities, limitations and decision-making processes to foster trust and understanding. Human input will be essential to verify and address those final percentages for accuracy and safety that make applications in high-stakes scenarios viable.
Creating Intuitive Interactions: Creating user experiences that allow for natural communication and interaction with AI agents has lots of potential. Advancements in natural language processing and innovative use of multi-modal models that can process voice audio and video will bring us closer to realizing the personal assistants we see in science fiction.
Continuous Learning and User Feedback: Extending the established concept of “Human in the Loop”, humans will play a crucial role in teaching agents to improve by providing valuable feedback, especially as they’ll inevitably run into unforeseen situations. We should also be mindful of when to incorporate human feedback - do we always know the best action to take? When should agents delegate to humans?
Emerging AI Agents in the Market
We’re seeing a surge in the launch of AI assistants and agent-like products that go beyond simple Q&A chatbots, performing increasingly sophisticated and useful tasks.
Examples from major tech companies are enabling versatile and more personalized functions on top of their commercial LLM offerings like scheduling appointments, data analysis, and generating creative content. These include:
OpenAI’s Assistants (November 2023)
Anthropic Claude’s Artifacts (August 2024)
Google Gemini’s Gems (August 2024)
We’re also seeing a push toward novel applications that push the limits of AI capabilities in groundbreaking ways, starting to take on the whole workload of software engineers:
Cognition AI‘s ‘Devin’ (March 2024): This full-fledged developer agent goes beyond coding to replicate the complete work scope of a traditional software engineer, including designing, testing, and deploying software.
Cosine’s ‘Genie’ (August 2024): This innovative AI coding assistant is trained on data that mimics the cognitive processes, logic, and workflow of human engineers, making it a truly collaborative coding partner.