A Personal Journal of Learning and Discovery
Search
Szukaj
Tryb ciemny
Trzyb jasny
Archive
Znacznik: rl
Oznaczony 1 element.
02 kwi 2026
Reinforcement Learning with GRPO Fine-Tuning a Small Language Model for Chain-of-Thought Math Reasoning. Similar to Deepseek R1 training
llm
coding
training
rl
deepseek