What is the difference between an AI agent and an AI workflow?

A workflow is a system where the steps are predefined in code. An agent is a system where the LLM itself decides what steps to take and in what order. The key difference is who controls the logic — the developer or the model.

Do I need to know how to code to build an AI agent?

No. Tools like Claude Code let you describe what you want in plain language and handle most of the implementation. What matters more is clarity about what the system should do and what a good result looks like.

Why is observability important for AI agents?

AI systems are non-deterministic — they can behave differently across runs. Observability traces each step of an agent's execution so you can debug failures, understand outputs, and improve performance over time.

What is the Arthur Engine?

The Arthur Engine is a free, open-source tool for AI observability and evaluation. It traces every step of an AI agent or workflow so teams can see what happened, measure performance, and catch issues before users do.

Team Arthur at NeurIPS-19: A Retrospective

Arthur is fresh off the plane returning from NeurIPS, AI’s largest — and somewhat infamous — research conference. While there, Arthur announced its seed round and hosted a 50-person model monitoring meetup next to the convention center. Beyond that, the full NeurIPS was seven packed days of new advances and directional changes in the machine learning community. Here’s what the experience was like for us, written by Arthur Chief Scientist, John Dickerson.

I’ll start with a broad takeaway: NeurIPS this year was welcoming.

In previous years, the conference was known for being somewhat rowdy, coming to a head in 2017/18 with controversies that led, in part, to the conference being renamed. With 13,000 attendees this year, my worry was that things would feel even more chaotic. But, a largely inclusive and constructive tone was set via community-wide norm changes led by leaders in the field, community-building events such as Black in AI and Queer in AI, and workshops such as the ever-growing Women in Machine Learning. This progress is great to see, as it is increasingly evident that AI algorithms are influenced not just by the raw data fed into them, but also the traits and experiences of those who build them.

Along those lines, bias, fairness, and explainability were front and center in many talks, posters, and workshops. One of my favorite talks, given by Aaron Roth from Penn, blended ideas from individual and statistical (aka “group”) notions of fairness. The former constrains treatment across pairs of inputs (e.g., “similar individuals are treated similarly” due to Dwork et al.), while the latter partitions inputs into groups (e.g., based on age, race, or sex) and enforces equalization of various metrics across those groups. Both of these approaches have well-known pros and cons. Group-level fairness is relatively easy to define and enforce on general input distributions, but may result in unexpected behavior as one looks at sub-groups (e.g., inputs with a specific race and sex). Individual notions of fairness, unsurprisingly, come with attractive guarantees at the individual input level, but may be hard to define and incorporate into models. Aaron and colleagues’ work provides a natural middle ground, where explicit group selection is no longer needed and some level of guarantee (in expectation) is given to individual inputs; check out the paper!

Explainability was also center stage across the board, ranging from workshops such as Robust AI in Financial Services: Data, Fairness, Explainability, Trustworthiness, and Privacy with panel participants affirming the financial services industry’s need for trustworthy and explainable systems, to industry expo days like Fairness and Explainability: From Ideation to Implementation to domain-specific papers and demos focusing on computer vision and natural language processing tasks. Understanding why models make particular inferences is important for everything from debugging systems to adhering to regulatory requirements, and it was great to see the research community stepping up to provide new tools that will, hopefully, find their way into industry.

This was a big year for NLP and, specifically, all things BERT (Bidirectional Encoder Representations from Transformers, from Google) — and all kinds of visualizations of and attacks on BERT and friends.¹

Over the last couple of years especially, NLP has seen explosive progress, and that was showcased at this year’s NeurIPS as well.

Panel at the NeurIPS-19 “Minding the Gap: Between Fairness and Ethics” workshop.

Probably the biggest NLP paper presented XLNet (also from Google), which dominates BERT on a number of tasks. It’s exciting to see NLP move forward so quickly, of course, but many — including big names in the field — are starting to get grouchy² about the field’s seeming obsession with making already big models even bigger. This Twitter thread by Yoav Goldberg is a nice place to start for under-loved holes in NLP research, ranging from theory to explainability to generalization to incorporating linguistic theory back into deep-learned-based techniques.

The generalization capabilities of deep networks are poorly understood — they famously don’t align with the traditional statistical view that, after a point, bigger models are worse. Rather, they tend to exhibit a “double descent”, where as a model grows in size test error decreases, then increases, then decreases again. At least two nice papers came out in this “understanding deep learning generalization” space: the winner of the Outstanding New Directions Paper Award from researchers at CMU argues against a particular class of theoretical bounding technique, and another nice paper from Princeton and CMU advanced the state of understanding what larger (namely, infinitely larger) networks can represent.

At the NeurIPS Retrospectives Workshop on Friday, leading researchers asked about their past work, “What should readers of this paper know now, that is not in the original publication?”

I love this idea and hope it takes off — researchers, especially younger researchers (when many do some of their “most famous” work!), often write papers from a single perspective. As their research gains prominence, folks from other fields inevitably chime in with references to earlier related work or methods for better framing. This feels personal to me: before heading off to do my PhD, I submitted my first research paper and was informed a few weeks later that a nearly-identical problem had been solved by researchers in the USSR and at RAND in the US in the 1970s, available online in poorly-scanned PDFs — in both Russian and English!

It turns out Michael Littman, a reinforcement learning luminary, had a similar experience. In the 1990s, he published seminal and extremely highly-cited work (pdf) introducing “Markov games” to the CS community, which helped spur thinking in multi-agent reinforcement learning. In his retrospective on that paper, he mentioned wishing he’d known about “stochastic games” — effectively the same idea, but introduced in the 1950s (!!), and with more theoretical rigor. Still, Mike states he is happy he wrote the paper — as is the community, because this pushed forward interest in multi-agent reinforcement learning, a topic of increasing importance today in application areas ranging from simple game playing to energy production and disaster relief operations. AI is such an interdisciplinary field that this type of cross-pollination is more a blessing than a curse; indeed, we’d all benefit from looking — and talking! — beyond traditional technology fields.

Also on Friday, our co-founder Liz spoke on a panel at the Minding the Gap: Between Fairness and Ethics workshop, alongside a variety of research scientists and engineers from Google focusing on ethics, fairness, and AI. The workshop featured practitioners and, in general, participants from outside the traditional AI/ML communities, which led to in-depth conversations about the gulf between what “standard” ML-based definitions of fairness offer and what practitioners might want or at least find practical to use in their day-to-day jobs.

While the AI/ML community (including myself) has started to address this divide, there’s clearly a ton to be done when it comes to properly incorporating the wants and needs of stakeholders into modern AI systems.

I spent Saturday at the CausalML [b1] aka Do the Right Thing: Machine Learning and Causal Inference for Improved Decision Making workshop. Invited speakers such as Susan Athey and Susan Murphy connected techniques for uncertainty management from the AI/ML world with application areas in business, including labor markets and advertising, and healthcare. I found the poster sessions in this workshop particularly enlightening: causal and counterfactual reasoning are two intertwined topics that are forming the basis for both fair and robust (e.g., to adversarial manipulations, or to simple noise in data) automated systems, and I look forward to these concerns and ideas melting more into the greater ML community.

Finally, what would a large AI/ML conference be without a little drama?

Bengio and Schmidhuber, two deep learning visionaries from different continents, stay at loggerheads about who should give whom credit about what. I won’t dignify linking to the /r/machinelearning and Blind threads discussing this, but suffice it to say that the “who did it first in deep learning” rabbit hole is still getting, ah, deeper.

All in all, NeurIPS this year captured the largely positive zeitgeist of the machine learning and AI community. After a period of extreme growth, the conference — and, by proxy, the community — feels like it’s growing, if not grown, up. Yet, NeurIPS’ workshops, events, and keynotes captured another, more subtle thread: as the impact of AI continues to spread — and the community continues to learn and understand how that spread impacts society — researchers and practitioners alike are desperate for tools and guidance about how to integrate models safely and responsibly. There’s so much left to be done regarding not just the scalability and generalizability of modern ML methods, but also the ability to define and incorporate notions of fairness, bias, trustworthiness, and accountability into models and pipelines in ways that are interpretable at both train/test time as well as in deployment. I’d love to see more research and development time spent on&$58;

Crossing boundaries and discussing exactly what industry practitioners and other stakeholders want and need. The human-computer interaction (HCI) and AI/ML communities are starting to build this knowledge out, but what we need is a feedback loop between stakeholders and researchers formalizing the communication pipelines between both parties. This is the only way we will settle on meaningful definitions of what it means to approach “fairness” in different domains
Tracking and analyzing the impact of implementing various objectives or constraints (e.g, promoting combinations of fairness, diversity, and economic efficiency) on truly dynamic systems, that is, systems where the input data distribution drifts over time, metrics change, and so on.
Simply put, measuring things. A discussion we have constantly at Arthur revolves around which metrics we should (i) show by default, (ii) pre-compute and allow to be toggled on or off, (iii) don’t pre-compute but allow to be computed, and (iv) leave to the user to define and pass to our system as a custom metric. We need to help industry understand what it needs, then develop scalable methods to measure that, and then integrate those metrics and measurements in live systems to track model performance. That means not just the so-called efficiency metrics such as accuracy, precision, and f-score, but also measures for fairness, bias, diversity, explainability, and anything else stakeholders might want.

That last point is a multidisciplinary one, and — frankly — there’s no way to escape that. As AI practitioners, we aim to build systems for users in myriad fields, and we build them so they (i) work and (ii) work in a way that users understand. The more progress the field makes toward understanding and then building scalable and general methods, the better. Those methods need to consider the competing and sometimes contradictory wants of stakeholders: efficiency, fairness, robustness, explainability, and even justice. It’s not a simple problem, and it’s not one that can be solved by the AI/ML community alone. I look forward to the work and discussion that will come from future AI/ML conferences — such as AIES and AAAI in NYC, Arthur’s home city, in early 2020!

[1] Or, in this case, attacks on BERT by friends. I’m a fan of bad jokes, and approve of the team from AI2 and UW building Grover, a fake news generator.

[2] There’s an Oscar the Grouch joke in here somewhere.

Team Arthur at NeurIPS-19: A Retrospective

I’ll start with a broad takeaway: NeurIPS this year was welcoming.

Over the last couple of years especially, NLP has seen explosive progress, and that was showcased at this year’s NeurIPS as well.

At the NeurIPS Retrospectives Workshop on Friday, leading researchers asked about their past work, “What should readers of this paper know now, that is not in the original publication?”

While the AI/ML community (including myself) has started to address this divide, there’s clearly a ton to be done when it comes to properly incorporating the wants and needs of stakeholders into modern AI systems.

Finally, what would a large AI/ML conference be without a little drama?

Meet Our Summer 2022 Research Fellows

How to Build a Production-Ready Model Monitoring System for your Enterprise

Team Arthur at NeurIPS-19: A Retrospective

I’ll start with a broad takeaway: NeurIPS this year was welcoming.

Over the last couple of years especially, NLP has seen explosive progress, and that was showcased at this year’s NeurIPS as well.

At the NeurIPS Retrospectives Workshop on Friday, leading researchers asked about their past work, “What should readers of this paper know now, that is not in the original publication?”

While the AI/ML community (including myself) has started to address this divide, there’s clearly a ton to be done when it comes to properly incorporating the wants and needs of stakeholders into modern AI systems.

Finally, what would a large AI/ML conference be without a little drama?

SHARE

Meet Our Summer 2022 Research Fellows

How to Build a Production-Ready Model Monitoring System for your Enterprise