Training on aligned data mostly solves alignment
Defensive technologies and law can do the rest.

AI does what the training data tells it to do
If an LLM sees “2 + 2 = 4” in the training data it will predict “4” for the question “2 + 2 = ?” If a vision model sees cats in the training data, it will label the cats it sees in the test set.
Neural network training Just Works. If it didn’t, we would be talking about a different training paradigm that Does Work. It took a lot of research and money to get to this point; now we have a general purpose recipe for solving any task.
Models are only as good as their dataset. Data imbues a model with capabilities and behaviors. The corollary is that models do not learn things that are not part of training. Training on one thing does not magically cause them to do some other thing1. In other words, training chisels cognitive grooves into an agent2.
This is very good from a safety perspective.
AI generalizes outside of the training data
In practice across many domains, AI models generalize outside of their training data. What starts as over-fitting crystallizes into a general solution to the problem. This is often achieved by providing enough diversity in the dataset to ensure that the learned rule is close to the ideal one.
Everyone agrees this works in practice, but does it work in theory? There has been some research progress here, with Singular Learning Theory providing a stylized model for why neural nets generalize to unseen data. And we’re moving closer to an understanding of out-of-distribution generalization34.
The Natural Abstraction Hypothesis considers another angle, asking if there is a deep reason why neural networks seem to converge to the same concepts as humans. If so, we should feel more confident that models understand what we mean in some deep sense.
If this research bears fruit, it would make me feel pretty good about deploying AI models in domains that are slightly-out-of-distribution. A cleaning robot trained in LA and deployed in NY will plausibly be fine. Operating out of distribution is unlikely to produce catastrophic failures. Any issues that do arise can be patched with updated training data.
All tractable alignment problems solved by aligned data
Consider different types of alignment:
User alignment: The model does the things that users (people, companies, AI agents, governments, etc.) want.
Bystander alignment: the model does things that non-users (i.e. other people, other governments, etc.) want or at least does not harm non-users.
Creator alignment: The model does what the inventor wants.
Societal alignment: The model follows some global social choice function. Alternatively, the model only pursues Pareto improvements.
Values alignment: The model does what we really want rather than what we say we want. For instance, the above versions of alignment may fail to implement our true values because of commitment issues.
Many of these are impossible. It’s impossible to satisfy the needs of two enemy states or to solve the various no-go theorems in social choice.
But any single alignment problem can be solved with the appropriate data. Simply create a dataset of what you want the model to do (and not do)5. This has worked for every problem we’ve tried6.
Since models generalize and don’t tend to have catastrophic behavior OOD, it’s safe to test them in a real-world environment. Successes and failures in deployment can be turned into better training.
Safeguards, specialized models, and defensive technologies
Good data can solve alignment problems in principle, but in practice training and safety evaluations will be imperfect. To address this, we can wrap the AI in additional safeguards. For example, training gives us good statistics about the typical input distribution. It is straightforward to flag inputs that are out of distribution and pause the model7 in those cases. Mechanistic interpretability will soon offer a variety of additional tools to enhance safety8.
These efforts are complimented by encouraging the development of specialized AI models rather than general intelligences. Simple, performant models reduce the economic incentive to build AGI, a boon for safety.
We can go further and develop defensive technologies to address large-scale risks. AI models for cybersecurity, lasers for nuclear risk, far-UVC for airborne pathogens, and so on.
Legal system
It would be nice to do something about the other alignment problems, even if they’re impossible to completely address.
For example, an AI aligned with one person might infringe upon the rights of another person. While it’s impossible to prevent these conflicts entirely, a system of torts can discourage such violations and provide compensation to the aggrieved party.
More broadly, a system of market rights, contract, and law has proven effective for addressing conflicts between misaligned humans. We need to extend these systems to an economy with digital minds.
At an international scale, treaties and diplomacy are probably the best we can do to address state conflicts.
Challenges
There are two big challenges to this approach to alignment. The first is ethics. We’ve been talking about the needs of things external to the AI, but what about the needs of the AI itself? And what are the ethics of creating something with agency and preferences?
The second challenge is superintelligence. Will this patchwork of training and safeguards stand up to extremely capable models? Tentatively, I think it is possible to create a society that powerful AIs (and other agents) would prefer to join rather than rail against. But that’s a topic for another time.
Conclusion
The prospect of alignment by default is looking pretty good these days. Data, safeguards, defensive technologies, and the legal system can lower the risks enough to move forward with AI development. With engineering effort and policy change, the world can safely enjoy the benefits of AI.
Further Reading
Alignment By Default by John Wentworth
Alignment By Default? - by Harry Law
Rohin Shah thinks catastrophic alignment is unlikely
In practice, training is “messy”. It may implicitly provide several abilities. Training on task A and then task B allows to model to perform concatenations of A and B. This meanz that model capabilities can often be surprising, especially at scale.
I conjecture that we can go further and prove that the behavior of neural networks far out of distribution is essentially random or undirected. That’s a good thing; while behavior will be chaotic OOD, there is little risk of a model turning in to a paperclip maximizer when fed the wrong input.
This side-steps the problem of needing to define exactly what you want. Though in practice, LLM’s seem pretty good at figuring out what you want anyways.
Though the costs may be prohibitively high for particular quality requirements and domain.
Ideally you would make pausing a part of your training data so that the model robustly performs that behavior.
Of course, there are application-specific safety measures that should be employed as well. Flammable liquids should be in flammables cabinets, strong robot arms should be separated from squishy humans, etc.


I hope you’re right, as it would be a clean solution - just train on good (as opposed to evil) data.
A bit, this looks like children education, then.