AI alignment
Feb - 2026 note
Cognitive capability and goals do not automatically converge on human well-being.
The key question is not only whether this is possible, but how goals are formed, stabilized, and controlled.
https://www.lesswrong.com/w/orthogonality-thesis
One year later note Feb 2026
- Which version of the thesis are we adopting?
- the weak version (logical possibility), or
- the strong version (the practical “ease” of implementing arbitrary goals).
- What does this imply for current ML models?
- Do contemporary LLMs/agents actually fit the model of a coherent terminal objective?
- Or are they better understood as a mixture of heuristics, policies, and local objectives?
- Probability vs. possibility
- Even if orthogonality is true as a design-space thesis, the key practical questions are:
- Which goals are most likely to arise under real training pipelines?
- Are human-friendly goals easier or harder to stabilize?
- Even if orthogonality is true as a design-space thesis, the key practical questions are:
- Relation to 42n1⁝ Metaethics and philosophy (philosophy)
- Is orthogonality really neutral with respect to metaethics?
- Or does it still imply something about moral realism / internalism?
- Misalignment vs. misuse (human abuse / malicious use)
- AI itself optimizes in a misaligned way, vs.
- Humans use AI for harmful goals.