AI alignment

Feb - 2026 note

Cognitive capability and goals do not automatically converge on human well-being.
The key question is not only whether this is possible, but how goals are formed, stabilized, and controlled.

https://www.lesswrong.com/w/orthogonality-thesis

One year later note Feb 2026

Which version of the thesis are we adopting?
- the weak version (logical possibility), or
- the strong version (the practical “ease” of implementing arbitrary goals).
What does this imply for current ML models?
- Do contemporary LLMs/agents actually fit the model of a coherent terminal objective?
- Or are they better understood as a mixture of heuristics, policies, and local objectives?
Probability vs. possibility
- Even if orthogonality is true as a design-space thesis, the key practical questions are:
  - Which goals are most likely to arise under real training pipelines?
  - Are human-friendly goals easier or harder to stabilize?
Relation to 42n1⁝ Metaethics and philosophy (philosophy)
- Is orthogonality really neutral with respect to metaethics?
- Or does it still imply something about moral realism / internalism?
Misalignment vs. misuse (human abuse / malicious use)
- AI itself optimizes in a misaligned way, vs.
- Humans use AI for harmful goals.

A Personal Journal of Learning and Discovery

Archive

42n⁝ AI alignment

AI alignment

Odnośniki zwrotne

Graf