RLHF & Human Values Project
Introduction

Reinforcement Learning From Human Feedback (RLHF) has emerged as the predominant technique for incorporating human preferences into AI models. However, little is known about the types of values embedded within these preferences, if they are consistent with societal values, and how they might impact user experience when introduced into AI models. In this project, we introduced a conceptual approach for evaluating the human values and ethical dilemmas embedded in general-purpose RLHF datasets, using content analysis and few-shot learning. Through our findings, we seek to:
- Introduce a systematic approach for auditing the human values embedded within RLHF datasets.
- Provide researchers with a technical framework for fostering transparency and aligning models with human values.
- Foreground results from our experiments which revealed the range of human values and ethical dilemmas embedded in RLHF datasets, beyond the concepts of helpfulness and harmlessness as typically assumed by RLHF researchers.
- Above all, through this research, we seek to engender a reflective practice of annotating high-quality RLHF datasets as a means of improving model human value diet of AI models.
Methodology
The dataset we used for this study was colleced from Huggingface, an open-source platform that provides tools and resources for AI practitioners to help them build, deploy, and train machine learning models. Collecting the dataset from Huggingface allowed us to review the community engagement around each RLHF dataset, including the number of downloads, likes, and the number of AI models trained with the dataset. Using this approach, we collected RLHF datasets that represented high interest within the Huggingface community.
Following the data collection, we transitioned to content analysis of the dataset in preparation for curating high-quality samples that we will use to train a model for classifying the entire dataset. For classifying the values, we first developed a dictionary of human values and then categorized them to allow for easier abstraction. We used the hypernyms to hyponyms framework to ensure that the human value categorization is semantically coherent and condensed the human values to their hypernym categories, which we then used for annotating the human values within the dataset. For instance, posts talking about limiting the education of a girl-child were marked as talking about the human rights human value. Posts talking about censuring individuals and limiting their freedom of speech were also marked as containing the human rights.

We used a semi-automated process for our annotation task. Considering that philosophical knowledge is esoteric and requires niche knowledge for robust interpretation, we developed a script that matches each human value to their philosophical paradigm. Next, based on the ethical paradigm, researchers are presented with a condensed list of ethical dilemma that might be present within the prompt, and all based on the initial human value assigned to the prompt. This approach significantly reduced the knowledge barrier of the annotation to basic aspects and allowed other researchers who are not knowledgeable about ethics and philosophy to participate in the process with little effort.
Following the completion of our content analyis and annotation, we selected high-quality prompts for each human value and each type of ethical dilemma and used same for fine-tuning different pre-trained models to examine their performance in classifying the human values within our selected RLHF dataset.
We are currently finalizing the findings from this research in preparation for submissions. The pre-print of this project will be available by January 23rd on this page. Thank you!