Value Compass Cards
Introduction
Foundational AI models are often chaotic, providing incoherent responses that are sometimes outrightly harmful and inconsistent with human values and societal norms. Researchers have devised several approaches for taming and aligning pre-trained models with human values, including using RLHF/RLAIF for helpfulness and harmlessness and bias mitigation to ensure protection against different forms of discrimination. However, despite the prevalence of these techniques which practitioners use to introduce human values and preferences into AI models, there currently is limited methodological technique for documenting these model moderation activities and presenting them to the public to improve transparency and positive human-AI interaction. In this paper, we introduce Value Compass Cards, a documentation approach for cataloging the human values and preferences embedded into AI models to foster transparency and accountability of model capabilities.
Pre-print available January 23rd, 2024