AI alignment

42 AI

Feb - 2026 note

Cognitive capability and goals do not automatically converge on human well-being.
The key question is not only whether this is possible, but how goals are formed, stabilized, and controlled.


https://www.lesswrong.com/w/orthogonality-thesis

One year later note Feb 2026

  • Which version of the thesis are we adopting?
    • the weak version (logical possibility), or
    • the strong version (the practical “ease” of implementing arbitrary goals).
  • What does this imply for current ML models?
    • Do contemporary LLMs/agents actually fit the model of a coherent terminal objective?
    • Or are they better understood as a mixture of heuristics, policies, and local objectives?
  • Probability vs. possibility
    • Even if orthogonality is true as a design-space thesis, the key practical questions are:
      • Which goals are most likely to arise under real training pipelines?
      • Are human-friendly goals easier or harder to stabilize?
  • Relation to 42n1⁝ Metaethics and philosophy (philosophy)
    • Is orthogonality really neutral with respect to metaethics?
    • Or does it still imply something about moral realism / internalism?
  • Misalignment vs. misuse (human abuse / malicious use)
    • AI itself optimizes in a misaligned way, vs.
    • Humans use AI for harmful goals.