The last decade has witnessed an experimental revolution in data science and machine learning, epitomised by deep learning methods. Indeed, many high-dimensional learning tasks previously thought to be beyond reach -- such as computer vision, playing Go, or protein folding -- are in fact feasible with appropriate computational scale. Remarkably, the essence of deep learning is built from two simple algorithmic principles: first, the notion of representation or feature learning, whereby adapte...| arXiv.org
Quantum systems are notoriously difficult to simulate with classical means. Recently, the idea of using another quantum system - which is experimentally more controllable - as a simulator for the original problem has gained significant momentum. Amongst the experimental platforms studied as quantum simulators, superconducting qubits are one of the most promising, due to relative straightforward scalability, easy design, and integration with standard electronics. Here I review the recent state...| arXiv.org
At this early stage of its passage through our Solar System, 3I/ATLAS, the recently discovered interstellar interloper, has displayed various anomalous characteristics, determined from photometric and astrometric observations. As largely a pedagogical exercise, in this paper we present additional analysis into the astrodynamics of 3I/ATLAS, and hypothesize that this object could be technological, and possibly hostile as would be expected from the 'Dark Forest' resolution to the 'Fermi Paradox...| arXiv.org
We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a "student" model trained on this dataset learns T. This occurs even when the data is filtered to remove references to T. We observe the same effect when training on code or reasoning tra...| arXiv.org
Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce KernelBench, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making...| arXiv.org
The "ringdown" radiation emitted by oscillating black holes has great scientific potential. By carefully predicting the frequencies and amplitudes of black hole quasinormal modes and comparing them with gravitational-wave data from compact binary mergers we can advance our understanding of the two-body problem in general relativity, verify the predictions of the theory in the regime of strong and dynamical gravitational fields, and search for physics beyond the Standard Model or new gravitati...| arXiv.org
Quasinormal modes of rapidly rotating black holes were recently computed in a generic effective-field-theory extension of general relativity with higher-derivative corrections. We exploit this breakthrough to perform the most complete search for signatures of new physics in black hole spectra to date. We construct a template that describes the post-merger gravitational-wave emission in comparable-mass binary black hole mergers at current detector sensitivity, notably including isospectrality ...| arXiv.org
One obstacle to applying reinforcement learning algorithms to real-world problems is the lack of suitable reward functions. Designing such reward functions is difficult in part because the user only has an implicit understanding of the task objective. This gives rise to the agent alignment problem: how do we create agents that behave in accordance with the user's intentions? We outline a high-level research direction to solve the agent alignment problem centered around reward modeling: learni...| arXiv.org
We explore the use of expert iteration in the context of language modeling applied to formal mathematics. We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only. We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associat...| arXiv.org
Dark matter in the form of macroscopic composites is largely unconstrained at masses of $\sim 10^{11}- 10^{17}$ g. In this mass range, dark matter may collide with planetary bodies, depositing an immense amount of energy and leaving dramatic surface features that remain detectable on geological timescales. In this paper, we show that Ganymede, the largest Jovian moon, provides a prime target to search for dark matter impacts due to its differentiated composition and Gyr-old surface. We study ...| arXiv.org
We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is dr...| arXiv.org
We summarize the properties and initial data release of the JADES Origins Field (JOF), which will soon be the deepest imaging field yet observed with the James Webb Space Telescope (JWST). This field falls within the GOODS-S region about 8' south-west of the Hubble Ultra Deep Field (HUDF), where it was formed initially in Cycle 1 as a parallel field of HUDF spectroscopic observations within the JWST Advanced Deep Extragalactic Survey (JADES). This imaging will be greatly extended in Cycle 2 p...| arXiv.org
JWST has revealed a stunning population of bright galaxies at surprisingly early epochs, $z>10$, where few such sources were expected. Here we present the most distant example of this class yet -- MoM-z14, a luminous ($M_{\rm{UV}}=-20.2$) source in the COSMOS legacy field at $z_{\rm{spec}}=14.44^{+0.02}_{-0.02}$ that expands the observational frontier to a mere 280 million years after the Big Bang. The redshift is confirmed with NIRSpec/prism spectroscopy through a sharp Lyman-$α$ break and ...| arXiv.org
This work studies post-training parameter quantization in large language models (LLMs). We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from $\textit{incoherent}$ weight and Hessian matrices, i.e., from the weights being even in magnitude and the directions in which it is important to round them accurately being unaligned with the coordinate axes. QuIP consists of two steps: (1) an adaptive rounding procedure minimizin...| arXiv.org
Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and ...| arXiv.org
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-...| arXiv.org
Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimatio...| arXiv.org
We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of polysemantic and difficult-to-interpret units like attention heads or neurons, rendering them unsuitable for many downstream applications. In contrast, sparse feature circuits enable detailed understanding of unanticipated mechanisms. Because they are based on f...| arXiv.org
Foundation models are applied in a broad spectrum of settings with different inference constraints, from massive multi-accelerator clusters to resource-constrained standalone mobile devices. However, the substantial costs associated with training these models often limit the number of unique model sizes that can be offered. Consequently, practitioners are compelled to select a model that may not be optimally aligned with their specific latency and cost requirements. We present MatFormer, a no...| arXiv.org
How can we design safe reinforcement learning agents that avoid unnecessary disruptions to their environment? We show that current approaches to penalizing side effects can introduce bad incentives, e.g. to prevent any irreversible changes in the environment, including the actions of other agents. To isolate the source of such undesirable incentives, we break down side effects penalties into two components: a baseline state and a measure of deviation from this baseline state. We argue that so...| arXiv.org
Narrow bit-width data formats are key to reducing the computational and storage costs of modern deep learning applications. This paper evaluates Microscaling (MX) data formats that combine a per-block scaling factor with narrow floating-point and integer types for individual elements. MX formats balance the competing needs of hardware efficiency, model accuracy, and user friction. Empirical results on over two dozen benchmarks demonstrate practicality of MX data formats as a drop-in replaceme...| arXiv.org
When numerically evaluating a function's gradient, sparsity detection can enable substantial computational speedups through Jacobian coloring and compression. However, sparsity detection techniques for black-box functions are limited, and existing finite-difference-based methods suffer from false negatives due to coincidental zero gradients. These false negatives can silently corrupt gradient calculations, leading to difficult-to-diagnose errors. We introduce NaN-propagation, which exploits t...| arXiv.org
Energy limits that delineate the `habitable zone' for exoplanets depend on a given exoplanet's net planetary albedo (or `Bond albedo'). We here demonstrate that the planetary albedo of an observed exoplanet is limited by the above-cloud atmosphere - the region of the atmosphere that is probed in remote observation. We derive an analytic model to explore how the maximum planetary albedo depends on the above-cloud optical depth and scattering versus absorbing properties, even in the limit of a ...| arXiv.org
The recent phenomenal success of language models has reinvigorated machine learning research, and large sequence models such as transformers are being applied to a variety of domains. One important problem class that has remained relatively elusive however is purposeful adaptive behavior. Currently there is a common perception that sequence models "lack the understanding of the cause and effect of their actions" leading them to draw incorrect inferences due to auto-suggestive delusions. In th...| arXiv.org
Long-horizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state-action space. However, despite a lack of dense rewards, these tasks often have a multi-stage structure, which can be leveraged to decompose the overall objective into manageable subgoals. In this work, we propose DEMO3, a framework that exploits this structure for efficient learning from...| arXiv.org
When language models (LMs) are trained via reinforcement learning (RL) to generate natural language "reasoning chains", their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasin...| arXiv.org
We present a mathematical analysis of the statistical parallax method. The method yields physical insight into the maximum-likelihood determinations of the luminosity and velocity distribution and enables us to conduct a vigorous Monte Carlo investigation into various systematic effects. We apply our analytic formalism to the RR Lyrae sample of Layden et al. The velocity distribution of RR Lyrae stars is highly non-Gaussian, with kurtoses K_π= 2.04, K_θ= 3.22 and K_z = 4.28 in the three pri...| arXiv.org
State space models have shown to be effective at modeling long range dependencies, specially on sequence classification tasks. In this work we focus on autoregressive sequence modeling over English books, Github source code and ArXiv mathematics articles. Based on recent developments around the effectiveness of gated activation functions, we propose a new layer named Gated State Space (GSS) and show that it trains significantly faster than the diagonal version of S4 (i.e. DSS) on TPUs, is fai...| arXiv.org
Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets...| arXiv.org
Behavior Cloning (BC) on curated (or filtered) data is the predominant paradigm for supervised fine-tuning (SFT) of large language models; as well as for imitation learning of control policies. Here, we draw on a connection between this successful strategy and the theory and practice of finding optimal policies via Reinforcement Learning (RL). Building on existing literature, we clarify that SFT can be understood as maximizing a lower bound on the RL objective in a sparse reward setting. Givi...| arXiv.org
Autoregressive transformers are spectacular models for short sequences but scale poorly to long sequences such as high-resolution images, podcasts, code, or books. We proposed Megabyte, a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes. Megabyte segments sequences into patches and uses a local submodel within patches and a global model between patches. This enables sub-quadratic self-attention, much larger feedforward lay...| arXiv.org
With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only require a single-vector presentation of the sequence. With this intuition, we propose Funnel-Transformer which gradually compresses the sequence of hidden ...| arXiv.org
Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001% of the human gen...| arXiv.org
In this paper we demonstrate methods for reliable and efficient training of discrete representation using Vector-Quantized Variational Auto-Encoder models (VQ-VAEs). Discrete latent variable models have been shown to learn nontrivial representations of speech, applicable to unsupervised voice conversion and reaching state-of-the-art performance on unit discovery tasks. For unsupervised representation learning, they became viable alternatives to continuous latent variable models such as the Va...| arXiv.org
The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different ...| arXiv.org
A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (...| arXiv.org
Large Language Models (LLMs) employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one's output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Bandwidth Memory (HBM) to the accelerator's cache. While methods such as speculative decoding have been suggested to address this issue, their implementation is impeded by the challenges associated with acquiring and maintaining a separate draft mode...| arXiv.org
Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs: full paragraphs produced by GPT-3 or well-structured images produced by DALL-E. These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility. We postulate that having an explicit hierarchical architecture is the key to Transformers that ef...| arXiv.org
We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapidly throughout application areas. We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning t...| arXiv.org
There is growing interest in ensuring that large language models (LLMs) align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only...| arXiv.org
Large Language Models (LLMs) have demonstrated remarkable performance in many applications, including challenging reasoning problems via chain-of-thoughts (CoTs) techniques that generate ``thinking tokens'' before answering the questions. While existing theoretical works demonstrate that CoTs with discrete tokens boost the capability of LLMs, recent work on continuous CoTs lacks a theoretical understanding of why it outperforms discrete counterparts in various reasoning tasks such as directed...| arXiv.org
We present a novel non attention based architecture for large language models (LLMs) that efficiently handles very long context windows, on the order of hundreds of thousands to potentially millions of tokens. Unlike traditional Transformer designs, which suffer from quadratic memory and computation overload due to the nature of the self attention mechanism, our model avoids token to token attention entirely. Instead, it combines the following complementary components: State Space blocks (ins...| arXiv.org
A key design constraint when implementing Monte Carlo and variational inference algorithms is that it must be possible to cheaply and exactly evaluate the marginal densities of proposal distributions and variational families. This takes many interesting proposals off the table, such as those based on involved simulations or stochastic optimization. This paper broadens the design space, by presenting a framework for applying Monte Carlo and variational inference algorithms when proposal densit...| arXiv.org
Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence lengths and large batch sizes. Since the invention of the transformer, two of the most effective interventions discovered for reducing the size of the KV cache have been Multi-Query Attention (MQA) and its generalization, Grouped-Query Attention (GQA). MQA ...| arXiv.org
Neural networks are trained primarily based on their inputs and outputs, without regard for their internal mechanisms. These neglected mechanisms determine properties that are critical for safety, like (i) transparency; (ii) the absence of sensitive information or harmful capabilities; and (iii) reliable generalization of goals beyond the training distribution. To address this shortcoming, we introduce gradient routing, a training method that isolates capabilities to specific subregions of a ...| arXiv.org
We reconsider calibration of statistical distance scales for planetary nebulae, examining precision and systematic error for various distance methods used as well as the scales themselves. A different calibration strategy, one based on precise trigonometric parallaxes by Harris et al. (2007; some improved by Benedict et al. 2009), is presented. Most statistical scales have an overall scale error; in addition, all four tested show dependence of distance ratio [scale/actual] on nebular radius. ...| arXiv.org
We report the discovery and careful orbital determination of 64 new irregular moons of Saturn found in images taken using the Canada-France-Hawaii Telescope from 2019-2021, bringing the total number of saturnian irregulars to 122. By more than doubling the sample of saturnian irregular moon orbits, including pushing to smaller sizes, we can now see finer detail in their orbital distribution. We note the emergence of potential subgroups associated with each of Siarnaq and Kiviuq within the Inu...| arXiv.org
The increasing complexity of AI systems has made understanding their behavior critical. Numerous interpretability methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components, which emerged from explainable AI, data-centric AI, and mechanistic interpretability, respectively. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of methods and terminology...| arXiv.org
Large language models (LLMs) are powerful but static; they lack mechanisms to adapt their weights in response to new tasks, knowledge, or examples. We introduce Self-Adapting LLMs (SEAL), a framework that enables LLMs to self-adapt by generating their own finetuning data and update directives. Given a new input, the model produces a self-edit-a generation that may restructure the information in different ways, specify optimization hyperparameters, or invoke tools for data augmentation and gra...| arXiv.org
As AI agents powered by Large Language Models (LLMs) become increasingly versatile and capable of addressing a broad spectrum of tasks, ensuring their security has become a critical challenge. Among the most pressing threats are prompt injection attacks, which exploit the agent's resilience on natural language inputs -- an especially dangerous threat when agents are granted tool access or handle sensitive information. In this work, we propose a set of principled design patterns for building A...| arXiv.org
AI developers often apply safety alignment procedures to prevent the misuse of their AI systems. For example, before Meta released Llama 2-Chat - a collection of instruction fine-tuned large language models - they invested heavily in safety training, incorporating extensive red-teaming and reinforcement learning from human feedback. We explore the robustness of safety training in language models by subversively fine-tuning Llama 2-Chat. We employ quantized low-rank adaptation (LoRA) as an eff...| arXiv.org
The sub-Neptune frontier has opened a new window into the rich diversity of planetary environments beyond the solar system. The possibility of hycean worlds, with planet-wide oceans and H$_2$-rich atmospheres, significantly expands and accelerates the search for habitable environments elsewhere. Recent JWST transmission spectroscopy of the candidate hycean world K2-18 b in the near-infrared led to the first detections of carbon-bearing molecules CH$_4$ and CO$_2$ in its atmosphere, with a com...| arXiv.org
Cosmic hydrogen reionization and cosmic production of first metals are major phase transitions of the universe occurring during the first billion years after the Big Bang, however these are still underexplored observationally. Using the JWST NIRSpec prism spectroscopy, we report the discovery of a sub-$L_\ast$ galaxy at $z_{\rm spec}=8.1623\pm0.0007$, dubbed RXJ2129-z8HeII, via the detection of a series of strong rest-frame UV/optical nebular emission lines and the clear Lyman break. RXJ2129-...| arXiv.org
Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art. This paper presen...| arXiv.org
The outer solar system is theoretically predicted to harbour an undiscovered planet, often referred to as P9. Simulations suggest that its gravitational influence could explain the unusual clustering of minor bodies in the Kuiper Belt. However, no observational evidence for P9 has been found so far, as its predicted orbit lies far beyond Neptune, where it reflects only a faint amount of Sunlight. This work aims to find P9 candidates by taking advantage of two far-infrared all-sky surveys, whi...| arXiv.org
"Pasta alla Cacio e pepe" is a traditional Italian dish made with pasta, pecorino cheese, and pepper. Despite its simple ingredient list, achieving the perfect texture and creaminess of the sauce can be challenging. In this study, we systematically explore the phase behavior of Cacio and pepe sauce, focusing on its stability at increasing temperatures for various proportions of cheese, water, and starch. We identify starch concentration as the key factor influencing sauce stability, with dire...| arXiv.org
While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex an...| arXiv.org
While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex an...| arXiv.org
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtua...| arXiv.org
In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE tec...| arXiv.org
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These re...| arXiv.org
arXivLabs: Showcase| info.arxiv.org