The Arthur team is back home in New York after a strong showing at the Conference on Neural Information Processing Systems (a.k.a. NeurIPS), arguably the largest and most influential machine learning conference. We hosted onsite and offsite socials, gave an oral presentation, presented two papers in the main conference as well as papers at two workshops, co-organized a workshop, recruited—and one of us helped compose a song on overfitting, sung live with a pair of ukulele-wielding ML practitioners.

NeurIPS holds a special place in Arthur’s history. Back in 2019, we announced our $3.3M seed round at a collocated event to NeurIPS in Vancouver. Now, three years, 40+ team members in growth, and$50+ million dollars later, it was great to return to NeurIPS, this time in New Orleans. Big themes, many of which our team expands upon below, included large language models (LLMs) and their generalization and semi-rebranding as foundation models, cross-collaboration between AI and other fields (psychology, policy, etc.), human-in-the-loop and user-centric ML pipelines, and context-aware ML as it relates to privacy and fairness.

NeurIPS is known for being both a venue for the dissemination of new industrial and academic research as well as a networking meetup with a strong event culture. Arthur, ever the responsible community member, also contributed to this latter focus by organizing a couple of well-attended events. On Wednesday night, joint with our partners at Index Ventures, we hosted an offsite happy hour at Jack Rose, with attendance from the investor, founder, big tech, and academic community. The VC and investing community continues to increase its presence at flagship ML conferences, and largely driven by the current excitement surrounding foundation models and generative ML (e.g., Stable Diffusion, ChatGPT), it was great to see this trend continue at NeurIPS. On Thursday night, joint with friends at Abacus.AI, we held an open mic night at the conference venue with a few hundred attendees. Meant to be a free-form community event full of cheeky machine-learning-oriented fun, this was a great success, with GPT-3 generated poetry and live song, debate about the merits of non-tabular data, discussions of who invented social networking, and the evergreen research topic of how to improve peer review (as if it’s broken!). We’re happy to help build a vibrant ML community.

Below, members of our ML team give their takeaways and hot takes on what’s right, what’s wrong, what’s hot, and what’s not in the academic and industrial machine learning world.

### The Current State of ML Research (It’s Not Just LLMs)

Arthur MLE Valentine d’Hauteville writes, “On my first day at NeurIPS, I recount the awe I felt walking amongst the myriad of different posters and research projects in the big hall of the New Orleans Convention Center. The mosaic of ideas, topics, and research stories displayed before me was impressive and a stark contrast to my usual research flow, which consists in exploring and (sometimes) getting a bit lost in the roots of a deep research paper reference tree. In this forest of posters, I felt immersed in the AI community, enthralled and slightly overwhelmed by the countless research minds and ideas present at my literal fingertips. As one of the biggest annual AI research conferences, NeurIPS in some way mirrors the collective interest and brain-space of the ML community’s, displaying both its prominent and upcoming narratives—for instance, as someone anticipated, I witnessed a large enthusiasm and contingent of work on generative models and LLMs.” Arthur MLE Max Cembalest echoes this, stating that “the biggest trend at the conference was an increased study of large language models, their robustness, their generalizability to out-of-domain text, and their generalizability to tasks that are not directly language but approachable by LLMs anyway.”

But LLMs were far from the only interesting facet of ML research at NeurIPS. Valentine remembers feeling a “strong energy around the design, development, and nurturing of scientifically sound and usable ML practices, with research outputs spanning from theory to implementation.” “There was also much work on model efficiency—how to reduce the computational requirements for deep learning systems,” Arthur MLE Teresa Datta adds.

“Beyond these ever-present areas of research,” says Teresa, “there were two main threads of messaging that struck a chord. The first: developing neural networks which don’t involve backpropagation. Geoffrey Hinton, inspired by the field of neuroscience’s lack of evidence that the brain’s cerebral cortex is able to undergo backpropagation, presented a keynote on a new learning procedure for neural networks that does not involve backpropagation. This forward-forward approach instead replaces a forward+backward pass with two forward passes—one with positive data and one with negative data, meant to imitate the brain’s paradigm of wakefulness and sleep cycles. This is the latest work in the continuing attempts to establish deep learning models as a brain analogue.”

The second messaging thread Teresa resonated with was “more purposefully promoting cross-collaboration between AI and other fields (psychologists, policymakers, domain experts, everyday users, neuroscientists, designers, and more). The 2022 NeurIPS keynotes highlighted a variety of figures at the intersection of AI and other fields: Rediet Abebe on perpetuated societal inequalities, Juho Kim on designing interaction-centric AI, Alondra Nelson and her work in the White House Office of Science and Technology Policy, and David Chalmers on the philosophy of sentiency. While “collaboration with other fields” has always been evoked with high import, there were more discussions on how to formalize this: How do we craft incentives for researchers to actually do this difficult and novel work? Graduate students are often chained to publishing goals—getting a certain number of acceptances at high-profile venues. How do we create accolades, publishing forums, and funding support for interdisciplinary work?”

Chief Scientist John Dickerson added that “the use of modern ML (e.g., transformer-based models) for “traditional” application areas in operations research such as logistics, planning, routing, assignment, scheduling, and resource allocation has also been increasingly present at ML conferences over the last year or two, and certainly at this recent NeurIPS. Until recently, these “old” application areas—that also happen to drive much of the world’s economy—were viewed as boring and solved by the machine learning world, left to the business analysts and consultants in the INFORMS professional community. Yet, with a touch of domain expertise, modern ML and optimization can be shown to eke out significant gains in efficiency and profit driven in these proven business problems where each percentage point corresponds to hundreds of millions or billions of dollars of economic value. I’m excited to see the continued strengthening of ties between the AI/ML and operations research communities and the problems they tackle (i.e., those with both a prediction and a decisioning element).” (Separately, joint with INFORMS, the ACM, and CCC, we’re co-organizers of a series of workshops in this space, e.g. [1] and [2]. Get in touch if you’d like to participate!)

### Distribution Shifts & Benchmarks

Valentine gave the first public presentation of her work, joint with Arthur Research Fellow Naveen Durvasula, on explainability and data drift at the Workshop on Distribution Shifts (DistShift). On the same day as AFCP (more on AFCP below), DistShift attracted a larger crowd—generalization, extrapolation, and robustness to distribution shift are core ML problems, and it’s great to see continuing progress in this fundamental area. It was cool to see Valentine’s work, which ties together clustering, Shapley values, and Skope rules to find emergent clusters of “drifty” points over time, as part of a larger cohort of explainability and data drift research. (We’ll be submitting a full version of this work to one of the January ‘23 conference deadlines, so stay tuned!)

Also in the space of model performance under distribution shift, Arthur-MLE-turned-Berkeley-PhD-student Jessica Dai and Arthur Research Fellow Michelle Bao presented at the Women in Machine Learning (WiML) workshop on their ongoing work with our team understanding models’ the impact of covariate and concept drift on group fairness, when ground truth labels are not available at test time.

Valentine was particularly impressed by Isabelle Guyon’s keynote, The Data-Centric Era: How ML is Becoming an Experimental Science. She writes, “Her talk reminded us that, as a scientific endeavor, ML research should abide by the same rigorous scientific research standards as those that govern research in other disciplines such as the natural sciences. Guyon debunked some bad scientific practices within the ML research community, such as a common one which consists in selecting validation datasets based on their anticipated or observed ability to display the behaviors that will confirm a hypothesis (a form of selection and confirmation bias). To combat such practices, Guyon emphasized the importance of adopting scientifically and statistically sound data curation, experimentation, and validation procedures. For instance, one should ensure that published experiments and findings are reproducible and carefully documented. She also advocated for the adoption of more rigorous vetting processes on existing datasets as well as an increased focus towards developing and documenting more comprehensive benchmarks. In fact, her prescriptions seemed to echo those of the NeurIPS community at large, as the conference recently created a new Datasets & Benchmarks research track which rewards datasets and benchmark papers on an equal footing with other traditional research content.”

### Human-in-the-Loop ML

Arthur MLE Daniel Nissani, while researching the Tensions paper (more detail on this paper below), became enamored with the idea that AI systems should somehow encapsulate knowledge about the context of their deployment. “I was happy to see that I wasn’t alone,” he writes. “At the HiLL workshop, Cynthia Rudin did an excellent job explaining how users of AI, those who don’t necessarily have AI skills, but want to benefit from AI systems, have opinions past “which models have the best accuracy.” Drawing from her research on Rashomon sets and sparse decision trees, she asked for a paradigm shift for how we generate models. Instead of asking a user to accept one, heavily optimized model, she wants us to present users with multiple models that achieve similar accuracy scores. She has developed a UI for such discovery processes when searching through different decision trees.”

“While listening to Cynthia’s talk, a poster caught my eye on participatory systems, which presented an equally inventive idea,” Daniel continues. “This paper devises a model agnostic scheme that trains various models on different sets of features and protected attributes. This allows users to understand the effect of providing or omitting certain types of information, such as a medical status or gender. Both of these ideas present ways for AI systems to interact with the context of their deployment, allowing for feedback between the system and the user.”

Valentine noticed the same theme. She writes, “At NeurIPS, user-centric ML pipelines were also in strong focus. For instance, a couple of presentations I attended intelligently incorporated user feedback as key steps in the design of comprehensive ML solutions. One researcher presented a new explainability pipeline for self-driving cars but had first run user field studies in order to understand the nature of what makes a good explanation for his specific use case. He also resorted to user feedback to comprehensively validate his first design iterations and presented ways to incorporate the feedback into his future designs. Cynthia Rudin, a prominent scholar in the field of explainability presented a clean mathematical proof showing that, contrary to the popular conception, designing complicated and high-capacity models is necessary to obtain peak performance; it is often possible to compute simple, inherently explainable yet suitably performant models for a given task. Her approach echoed Occam's razor (“the simplest solution is almost always the best”) and followed the keynote talk’s footsteps in anchoring ML back to core scientific principles. Rudin also created a clean and innovative UI which allows domain experts to explore and understand a set of generated simple models before selecting a one that is most suitable to their use case.”

### Fairness & Related Topics

“I really enjoyed the Algorithmic Fairness through the Lens of Causality and Privacy (AFCP) workshop, a semi-annual gathering that focuses on the nuances of connecting responsible AI to practice, says Arthur’s Chief Scientist John Dickerson. “This is one of the few but growing communities in “core” machine learning that gives more than lip service to human-centered AI, contextual machine learning, and sociotechnical systems (STS).”

Arthur MLE Teresa Datta presented her co-lead-author paper, Tensions Between the Proxies of Human Values in AI, as an Oral at AFCP. The paper was also accepted to HCAI at NeurIPS as well as SaTML. Check out Arthur MLE and co-lead author Daniel Nissani’s blog post on that work here, and Teresa’s talk below.

Daniel found the talks on causality to be particularly interesting. “Causality has been in the fairness literature for quite a while now, but one of the biggest bottlenecks is making sure you have a causal model (distilled as a causal graph) that can be used for causal analysis,” he writes. “I was pleasantly surprised to see work directly in this space, where some authors ran experiments to see if causal discovery methods could actually create causal graphs that are effective enough to measure fairness notions. Although their results were promising, the authors plan to construct a causal discovery method specific to fairness notions. If successful, this could create otherwise theoretical works, such as another paper at AFCP describing post-treatment bias in causal fairness analyses, more impactful for real world systems.”

“The final highlight for me,” he adds, “was a roundtable discussion at AFCP, where many researchers, whether from the privacy or fairness space, acknowledged the need for more contextual understanding in our research. Emphasis to start researching entire ML systems, eliciting user feedback, and integrating context into our research were the biggest takeaways. It made me feel proud that our team at Arthur presented the Tensions paper at AFCP, since it seems our ideas were not only heard, but preaching to an active choir that wants to start integrating context as well.”

“The AFCP workshop gave critical takes on interpretability and explainability in ML, and also touched on the intersection and interactions between forms of privacy and fairness, as well as causality and fairness,” says John. “As in the AI/ML meets OR discussion above, we’re seeing the intersection between traditionally separate areas of focus—statistics, economics, human-computer interaction, machine learning, and others. As productionized ML continues to expand across the economy and our society, these intersections are inevitable and welcome, and I’m happy to see thoughtful workshops like AFCP continue to grow in lockstep.”

Valentine reflected profoundly on this topic as well, noting that “it can often be tempting as ML researchers to think of ourselves as scientists working on objective and universal algorithms that can then be adapted and tailored to fit specific use cases. Such conceptions in some sense make us the principal bearer of truth to the detriment of domain experts, and can lead us astray. During an explainability panel I attended, Zach Lipton pointed out that we might for instance benefit from letting domain experts be the ones to first scope out desired design for an ML system and associated explainability mechanisms before resorting to ML Engineers to implement or iterate on it. Such responsibility delegation could ensure designs and solutions are inherently more usable and useful. Furthermore, one must remember that there is no such thing as scientific objectivity and that science and ML at large is value-laden—one of the key calls to action in Teresa’s Tensions paper and awesome presentation. Forgetting or ignoring this reality can lead us to resort to deeply insufficient solutions based on mathematical formulations in our attempt to address problems that are socio-technical in nature, such as fairness.”

### Conclusion

We arrived back in New York last week feeling full (of knowledge, but also of Cajun food and beignets). And, as is the case with many academic conferences, we were left with equal amounts of questions and answers. Here are just some of the questions we’re looking forward to exploring further in 2023:

• How can model interpretation and explanation be aligned with context, audience, and sensible baselines?
• How is the geometry of information informing model design and analysis?
• How do we create accolades, publishing forums, and funding support for interdisciplinary work?
• How can we engage communities via interactive AI/ML systems, so that we enable consent, choice, and trust?
• How do we start doing research about the system that models are deployed in, rather than just the model itself?
• Are we starting to approach the idea that approximate causal models are enough for real world causal analyses?