A Personal Journal of Learning and Discovery

Home

❯

20251116104113⁝ Safety Alignment in LLM

20251116104113⁝ Safety Alignment in LLM

Nov 23, 20251 min read

https://github.com/p-e-w/heretic

interesting idea to remove censorship

we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities.

https://arxiv.org/abs/2406.11717

See also 20251113123357b⁝ LLM for adjacent model notes.


Graph View

  • GitHub