We show how to assess a language model’s knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
Latest posts by Ryan Watkins (see all)
- Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks - June 7, 2025
- Neural and Cognitive Impacts of AI: The Influence of Task Subjectivity on Human-LLM Collaboration - June 5, 2025
- From Lived Experience to Insight: Unpacking the Psychological Risks of Using AI Conversational Agents - May 30, 2025