Back

NeurIPS 2024: Beyond Scale

Jack Hogan
2024-12-21

Last week, we joined thousands of researchers, engineers, entrepreneurs and investors in Vancouver for NeurIPS 2024—the biggest iteration of the conference yet. The headlines may have been dominated by Ilya Sutskever's somewhat sensationalist talk, as well as not one but two unfortunate scandals, but beneath the drama lay something far more interesting: a fundamental shift in how the field thinks about improving AI systems. The era of chasing ever-larger model scales seems to be drawing to a close, with researchers instead seeking performance gains through more sophisticated approaches.

The mood at the conference was one of genuine optimism. What could have been an existential moment for academic AI research—with compute requirements spiralling beyond reach—has instead become a catalyst for innovation. Presentation after presentation demonstrated how carefully designed systems, despite using orders of magnitude less compute, could match or exceed the performance of frontier models in specific domains. The successes came from every angle: novel post-training techniques, sophisticated inference strategies, adaptive optimisation methods. The era of innovation in AI, it seems, is just beginning.

In this post, we'll share our key takeaways from the conference, focusing on three themes that we believe will be crucial in the coming year: the emergence of advanced post-training techniques, the growing importance of inference-time optimisation, and the renewed emphasis on System 2 reasoning capabilities—a topic particularly relevant to our recent work on the ARC challenge.

Post-training: Beyond Basic Instruction Tuning

With pre-training delivering diminishing marginal returns, much of the conference's focus centred on refining and optimising the post-training pipeline. The contrast with just a year ago was striking, with the basic instruction-tuning techniques that made ChatGPT possible (i.e., InstructGPT) now seeming almost quaint compared to the sophisticated multi-stage pipelines being presented.

A clear recipe has emerged and is thoroughly described in, for example, the Llama 3 and Tulu 3 papers. The pipeline typically starts with initial supervised fine-tuning on a mix of human and synthetic instruction data, followed by multiple iterative rounds of preference fine-tuning on self-generated data that is scored using reward modelling or LLM-based judges.

The basic outline of the post-training recipe may be settled, but we saw a remarkable diversity in how different teams are implementing and specialising each component. Several papers particularly caught our attention:

  • Direct Q-Function Optimization: Reformulates preference learning as a Markov Decision Process rather than a bandit problem, enabling better handling of long-form reasoning tasks
  • Process Reward Model: Introduces dense, stepwise rewards for code generation, instead of relying only on final pass/fail signals
  • Self-play Preference Optimization: A game-theoretic approach to preference learning that better captures the intransitivity often present in human preferences
  • Asynchronous RLHF: Demonstrates how to achieve the same performance ~40% faster by separating generation and learning in the RLHF pipeline

Each of these innovations represents a different way to extract more performance from existing architectures—better modelling of the reasoning process, denser feedback signals, more nuanced preference structures. For teams like ours working on specialised applications, this is particularly encouraging—with the right post-training pipeline, even relatively small models can achieve remarkable performance in specific domains.

Test-time Compute: From Scaling Training to Scaling Inference

Just as post-training techniques have evolved beyond basic instruction tuning, we're seeing a similar evolution in how the community thinks about inference. What started with simple prompt engineering and few-shot learning has developed into a sophisticated toolkit of techniques for extracting better performance at inference time: ranking and merging multiple generations, chain-of-thought prompting, self-refinement loops, and even test-time fine-tuning of the model weights. Several researchers demonstrated how relatively small models with clever inference strategies could match or exceed the performance of much larger models using basic greedy decoding.

This shift toward inference-time optimisation raises interesting questions about how we evaluate AI systems—it no longer makes sense to evaluate performance based on a single forward pass completion. Several talks highlighted the need for new benchmarking approaches that consider the full cost-benefit tradeoff, including inference-time compute costs or FLOPs alongside raw performance metrics.

The range of inference-time techniques being explored is remarkable:

  • Input augmentation through more sophisticated prompting and in-context learning
  • Output augmentation via chain-of-thought reasoning and multiple sampling
  • Hybrid approaches that combine multiple models or incorporate external tools
  • Test-time model adaptation, including fine-tuning based on specific inputs

One paper even demonstrated a neural-network-like architecture composed of layers of inference strategies composed together, transforming the problem of inference technique selection into a hyperparameter optimisation objective. By efficiently and automatically searching the space of model choices, inference-time techniques and their compositions, they were able to outperform GPT-4o and Sonnet 3.5 across a range of instruction-following, reasoning, and coding benchmarks.

Other papers that caught our attention were:

  • Divide-and-Conquer Meets Consensus: recursively decomposes complex coding problems into simpler sub-functions, arranged in a tree hierarchy
  • Decompose, Analyze and Rethink: a framework for iterative reasoning that mimics human cognitive processes, building a tree of decomposed sub-problems and using feedback from solved sub-problems to update higher-level reasoning
  • Efficiently Learning at Test Time: an efficient test-time fine-tuning approach; instead of conventional nearest-neighbour retrieval methods, training examples are carefully selected to maximise information gain and reduce uncertainty

System 2 Reasoning: Beyond Pattern Matching

Sunday featured an entire workshop dedicated to System 2 reasoning, reflecting growing interest in moving beyond the pattern-matching capabilities of current AI systems. Throughout the day, researchers explored the intersection of LLM reasoning and cognitive science, examining how machine "thinking" fundamentally differs from human cognition.

In humans, thinking precedes language, and the learning of language subsequently transforms how we think. In LLMs, the reverse is true: "thinking" emerges from language and depends entirely on carefully curated language data. This fundamental distinction helps explain why even the most powerful models, like GPT-4o, struggle with tasks that require spatial reasoning or intuitive physics understanding—capabilities that humans develop largely independently of language.

Several talks explored how we might bridge this gap. A particularly interesting thread focused on the relationship between neural networks and symbolic reasoning. While traditional "hybrid" approaches that combine neural and symbolic systems have shown promise, they often inherit the scalability challenges of symbolic AI. More recent "unified" approaches that embed symbolic structures directly within neural architectures are promising, potentially offering the best of both worlds.

For example, one paper proposed extending the Differentiable Tree Machine with sparse vector representations of symbolic structures, while another demonstrated how Monte Carlo Tree Search could be used to break down complex reasoning tasks into granular steps. Meta's Jason Weston also presented Coconut, a new paradigm that allows models to reason in continuous latent space rather than being constrained to language tokens.

The discussions repeatedly returned to a central theme: while language is a powerful tool for communicating reasoning, it might not be the optimal medium for performing reasoning. This aligns closely with our experience at Agemo—we've found that code often serves as a better medium than natural language for modelling and reasoning about the world, offering more precise and verifiable ways to express computational thinking.

A real highlight of NeurIPS was seeing Agemo explicitly mentioned in François Chollet's keynote talk as one of the startups actively contributing to this space!

chollet-keynote

Looking Ahead

The themes that emerged at NeurIPS 2024—sophisticated post-training pipelines, intelligent use of inference-time compute, and a renewed focus on reasoning capabilities—suggest an exciting direction for AI research. Rather than pursuing raw scale, the field is developing more nuanced approaches to extracting intelligence from our models. The resulting systems may be more complex, combining multiple techniques and requiring careful orchestration, but they're also more capable, reliable and efficient.

For us at Agemo, these developments are particularly encouraging. Our thesis has always been that truly intelligent software creation requires more than pattern matching—it demands the ability to reason systematically about requirements, architecture, and correctness. The shift we're seeing in the broader AI community towards approaches that emphasise structured reasoning validates this direction and opens up new possibilities for our work bridging the gap between human intent and working software.

We're excited to be part of this evolving conversation about the future of AI and software development. As always, we’d love to hear your comments and feedback. Follow us on X and LinkedIn.

Jack