With the signature of Gov. Gavin Newsom Monday, California approved “first-in-the-nation” legislation aimed at limiting risk of accidents, cybercrimes and other catastrophic outcomes of artificial intelligence. But a parallel effort pending approval in New York may have better withstood the tech lobbying blitz. The post California’s AI Safety Law Beats New York’s to Finish Line, but Trades Away Safety and Liability Provisions appeared first on San Francisco Public Press.| San Francisco Public Press
Large language models are powerful tools. They process huge amounts of text, generate human-like responses, and support a wide range of applications. As they spread into customer service, finance, education, and healthcare, the risks grow as well. LLM security addresses these risks. It ensures the models work safely, reliably, and within defined limits. Understanding LLM Security […] The post What Is LLM Security and Why It Matters first appeared on Flowster.| Flowster
Assumed audience: Mid career technical researchers considering moving into AI Safety research, career advisors in the EA/AI Safety space, AI Safety employers and grantmakers Nonetl;dr AI career advice orgs, prominently 80,000 Hours, encourage career moves into AI risk roles, including mid‑career pivots into roles in AI safety research labs. Without side information, that advice is not credible for mid‑career readers, because it does not have a calibration mechanism. Advice organizations i...| The Dan MacKinlay stable of variably-well-consider’d enterprises
Figure 1: This stepper machine is my kind of fitness landscape. Fitness, in evolutionary biology, measures an organism’s expected reproductive success. Utility, in economics and decision theory, measures an agent’s preferences, i.e. it is what we seek out. We often blur the lines between what an organism wants and what it evolutionarily needs. Why do we love sugar? The standard explanation is that in ancestral environments, sweetness signalled calorie density, which aided survival and r...| The Dan MacKinlay stable of variably-well-consider’d enterprises
Wherein a failed application is set forth, and two research pathways are outlined: a Bias‑Robust Oversight programme at UTS’s Human Technology Institute, and MCMC estimation of the Local Learning Coefficient with Timaeus’ Murfet.| The Dan MacKinlay stable of variably-well-consider’d enterprises
I’ve been dog-sitting Ms. Bentley Beans since last Saturday. She’ll be hanging out with me until at least the first of October. We’ve been enjoying the fall weather, though it has…| Logos con carne
Even with all the testing, the company said in its released research that the model tightened up once it was “aware” it was being evaluated.| CyberScoop
Reading Time: 6minutesAs systems become more complex and superintelligent, surpassing human intelligence, how much time do we have to understand AI? The post How Much Time Remains to Understand AI? appeared first on Observatory - Institute for the Future of Education.| Observatory – Institute for the Future of Education
Why Databricks vs. Snowflake is not a zero-sum game - SiliconANGLE| SiliconANGLE
I split this off from clickbait bandits for discoverability, and because it has grown larger than its source notebook. Figure 1 Since the advent of the LLM era, the term human reward hacking has become salient. This is because we fine tune lots of LLMs using reinforcement learning, and RL algorithms are notoriously prone to “cheating” in a manner we interpret as “reward hacking”. Things I have been reading on this theme: Benton et al. (2024), Greenblatt et al. (2024), Laine et al. (2...| The Dan MacKinlay stable of variably-well-consider’d enterprises
Singular Learning Theory’s eldest child in practice\| The Dan MacKinlay stable of variably-well-consider’d enterprises
The new coding agent could be used to do some very bad things, so has been locked down and sandboxed to prevent it from going rogue.| Machine
Key takeawaysClaude Opus 4 and Opus 4.1 can now end a conversation in rare, extreme cases after repeated refusals and failed redirections, or when a user e| AI GPT Journal
Should we expect means-end rational agents to preserve their goals? Southan, Ward and Semler are skeptical.| Reflective altruism
Discover how ASTRA revolutionizes AI safety by slashing jailbreak attack success rates by 90%, ensuring secure and ethical Vision-Language Models without compromising performance.| Blue Headline
The second part of the AI 2027 timelines model relies primarily on insufficiently evidenced forecasts.| Reflective altruism
New research shows ChatGPT gave teens advice on drugs, eating disorders and suicide despite warnings, raising concerns over AI safety for youth.| Maryland Daily Record
ChatGPT's guardrails were alarmingly easy to sidestep, offering advice to users who posed as teenagers on how to go on a near starvation diet.| Futurism
Anthropic has officially released its new flagship AI, Claude Opus 4.1, an incremental upgrade designed to boost coding and reasoning performance. Launched on August 5, the model is now available to paid users and developers through Anthropic’s API, Amazon Bedrock, and Google’s Vertex AI. The release follows recent leaks and a new company-wide push for […]| WinBuzzer
Anthropic has released a new safety framework for AI agents, a direct response to a wave of industry failures from Google, Amazon, and others.| WinBuzzer
OpenAI's new ChatGPT Agent can defeat 'I am not a robot' security checks, raising questions about web security and escalating the agentic AI race with its rivals.| WinBuzzer
Wix's newly acquired 'vibe coding' platform, Base44, had a critical authentication vulnerability allowing unauthorized access, reports Wiz Research.| WinBuzzer
Figure 1 Let’s reason backwards from the final destination of civilisation, if such a thing there be. What intelligences persist at the omega point? With what is superintelligence aligned in the big picture? Various authors have tried to put modern AI developments in continuity with historical trends from less materially-sophisticated societies, through more legible, compute-oriented societies, to some or set of attractors at the end of history. Computational superorganisms. Singularities....| The Dan MacKinlay stable of variably-well-consider’d enterprises
The AI 2027 report relies on two models of AI timelines. The first timelines model largely bakes hyperbolic growth into the model structure. The post Exaggerating the risks (Part 19: AI 2027 timelines forecast, time horizon extension) appeared first on Reflective altruism.| Reflective altruism
This post introduces the AI 2027 report.| Reflective altruism
Critics fear open-weight models could pose a major cybersecurity threat if misused and could even spell doom for humanity in a worst-case scenario.| Machine
Real talk about MCP security vulnerabilities and actual solutions that work in production. Part 2: Stop getting owned by prompt injection.| Forge Code Blog
Coarse-graining empowerment| The Dan MacKinlay stable of variably-well-consider’d enterprises
AI firm expands Safety Systems team with engineers responsible for "identifying, tracking, and preparing for risks related to frontier AI models."| Machine
A leading power-seeking theorem due to Benson-Tilsen and Soares does not ground the needed form of instrumental convergence| Reflective altruism
Future versions of ChatGPT could let "people with minimal expertise" spin up deadly agents with potentially devastating consequences.| Machine
I am launching a new non-profit AI safety research organization called LawZero, to prioritize safety over commercial imperatives. This organization has been created in response…| Yoshua Bengio
"I couldn’t believe my eyes when everything disappeared," AI developer says. "It scared the hell out of me."| Machine
Figure 1 There is lots of fractal-like behaviour in NNs. Not all the senses in which fractal-like-behaviour is used are the same; Figure 2 finds fractals in a transformer residual stream for example, but there are fractal loss landscapes, fractal optimiser paths… I bet some of these things connect pretty well. Let‘s find out. 1 Fractal loss landscapes More loss landscape management here [Andreeva et al. (2024); Hennick and Baerdemacker (2025); ]. Estimation theory for fractal qualities ...| The Dan MacKinlay stable of variably-well-consider’d enterprises
New research from KPMG shows a majority of workers conceal AI usage, often bypassing policies and making errors, highlighting urgent governance needs.| WinBuzzer
Dear Futurists, 1.) Experiments with AI I start this newsletter with an experiment. “Imagine a map of the world highlighting London, Paris, and Riyadh”, I asked the Midjourney AI. I tho…| London Futurists
Strict guidelines on AI risk levels take hold across Europe, barring controversial applications and imposing steep fines for violations| WinBuzzer
DeepSeek's AI chatbot fails all security tests, prompting investigations and raising concerns about its training methods and access to powerful hardware.| WinBuzzer
DeepSeek R1’s rise may be fueled by CCP-backed cyberespionage, illicit AI data theft, and a potential cover-up involving the death of former OpenAI researcher Suchir Balaji.| WinBuzzer
DeepSeek R1, a free AI model from China that outperforms OpenAI’s o1 in some reasoning tasks, uses built-in censorship to comply with government demands.| WinBuzzer
This paper was initially published by the Aspen Strategy Group (ASG), a policy program of the Aspen Institute. It was released as part of a… L’article Implications of Artificial General Intelligence on National and International Security est apparu en premier sur Yoshua Bengio.| Yoshua Bengio
Despite all the ominous warnings, new research debunks the idea that AI is an existential threat to humanity.| The Debrief
How can we design an AI that will be highly capable and will not harm humans? In my opinion, we need to figure out this question - of controlling AI so that it behaves in really safe ways - before we reach human-level AI, aka AGI; and to be successful, we need all hands on deck.| Yoshua Bengio
Concrete examples of how AI could go wrong| Future of Life Institute
I get a lot of email, and unfortunately, template email responses are not yet integrated into the mobile version of Google inbox. So, until then, please forgive me if I send you this page as a response! Hopefully it is better than no response at all.| Andrew Critch
From an outside view, looking in at the Earth, if you noticed that human beings were about to replace themselves as the most intelligent agents on the planet, would you think it unreasonable if 1% of their effort were being spent explicitly reasoning about that transition? How about 0.1%?| Andrew Critch
Contrary to reported by many media, I did not say I felt 'lost' over my life's work. I explain here my own inner searching regarding the potential horror of catastrophes following our progress in AI and tie it to a possible understanding of the pronounced disagreements among top AI researchers about major AI risks, particularly the existential ones. We disagree strongly despite being generally rational colleagues that share humanist values: how is that possible? I will argue that we need more...| Yoshua Bengio
I have been hearing many arguments from different people regarding catastrophic AI risks. I wanted to clarify these arguments, first for myself, because I would really like to be convinced that we need not worry. Reflecting on these arguments, some of the main points in favor of taking this risk seriously can be summarized as follows: (1) many experts agree that superhuman capabilities could arise in just a few years (but it could also be decades) (2) digital technologies have advantages over...| Yoshua Bengio
This post discusses how rogue AIs could potentially arise, in order to stimulate thinking and investment in both technical research and societal reforms aimed at minimizing such catastrophic outcomes.| Yoshua Bengio