Blog
Writer

Passive BCIs for Reinforcement Learning

The world of Artificial Intelligence (AI) is experiencing an unprecedented revolution, with Natural Language Processing (NLP) at the forefront. Ingenious techniques such as Transformer Architecture and Reinforcement Learning with Human Feedback (RLHF) have enabled the latest Large Language Models (LLMs) to achieve remarkable performance across a multitude of fields and domains.

However, such techniques are not without their limitations, making it hard to effectively scale up. At Zander Labs, our passive Brain-Computer Interface (pBCI) technology could address several of these limitations, thereby significantly improving RLHF. Amongst others, the combination of RLHF with pBCI has the potential to enhance AI alignment, capture more meaningful information, save time and resources, and improve affective speech and context understanding.

The current state of AI and LLMs

The Transformer Architecture and RLHF have enabled the development of very exceptional LLMs. Trained to predict the next word in a sequence of text based on human prompts, LLMs came to exhibit a great deal of sophisticated behaviors with remarkable human-like traits. Latest iterations of such models, like GPT-4, show capabilities beyond the mastery of language, with hints of concept understanding, emotional intelligence, creative thinking, reasoning, and complex problem solving, all seemingly present at least in rudimentary form.

GPT-4 performs exceptionally well in standardized tests and benchmarks, even surpassing human performance in some cases. Being surprisingly good at handling challenging and novel tasks across a wide range of fields like mathematics, programming, computer vision, medicine, law, economics, psychology and more, GPT-4 has become the center of attention for its impressive behavior. Current LLMs have led many experts to recognize the multipurpose usefulness of these models, slowly setting the stage as Artificial General Intelligence (AGI) precursors, on the way to bridge the gap between AI and human intelligence. Regardless of whether this holds true, at the very least, it has forced experts to reevaluate our understanding and definitions of intelligence.

The Transformer Architecture and RLHF

How do these novel techniques enable such functionality? The Transformer is a type of neural network architecture that is based on the concept of self-attention which allows the model to weigh the importance of different parts of the input sequence, as well as how they relate to one another. Unlike traditional recurrent neural networks (RNNs) that process one element at a time, Transformers allow the entire input sequence to be processed at once, significantly speeding up the necessary sequential data processing. This is a game changer for natural (written) language. The parallel processing of input sequences directly contributes to improved efficiency and enables the model to capture relationships between distant elements of the sequence. In the case of Natural Language Processing (NLP), these long-range dependencies dramatically improve context understanding and allow the model to remain on topic while maintaining coherence throughout the conversation.

On the other hand, RLHF is an approach that leverages direct human feedback for the fine-tuning process. This is especially useful when the performance of a model is hard to evaluate or when the reward cannot be objectively computed, like in the case of language modeling. Simplified, RLHF for LLMs relies on people manually evaluating the text generated by a pre-trained language model either numerically or categorically as “good” or “bad”. The human feedback is then used to train a reward model that will later serve as the reward function during the reinforcement learning process. The LLM is fine-tuned using the reward model, optimizing its policy to predict whether a given output will receive a high reward (indicating it is perceived as good) or a low reward (indicating it is perceived as bad). Through multiple iterations and refinements, the model gradually improves its ability to generate responses that align with human preferences. This process enables AI systems to incorporate human values and preferences into their decision-making, enhancing the alignment between the model's behavior and human intentions.

Even though RLHF is a powerful tool for enabling human-like behavior in AI models and improving AI alignment, it introduces its' challenges. RLHF relies on a large number of human evaluators, it’s hard to scale up, and it is limited in the amount and type of feedback it can capture. Moreover, acquiring high-quality feedback can be slow and hard to obtain as it is highly subjective and the consistency of the feedback varies depending on the task, the interface, and the individual preferences, motivations and biases of humans. These challenges are further compounded by the limitations imposed by the Transformer Architecture itself. Being computationally complex and resource-intensive, Transformers require significant computational power to run, posing constraints on the scalability and cost-effectiveness of the RLHF implementation. Given how useful RLHF has been, is there a way to address some of these limitations and improve it?

Integrating Brain-Computer Interfaces with AI: A Fresh Approach

Our innovative approach to improving RLHF relies on passive BCI technology. Passive BCIs (pBCIs) can extract implicit information about the mental state of a human from ongoing brain activity.

At Zander Labs, we have described how such information can be used as feedback to allow neuroadaptive logic for Human–Computer Interaction (HCI) applications.

To give an example, let’s say a drone is deployed to scan an area in search of a missing person. The search drone will decide its path based on algorithms while being overseen by a human operator who is familiar with the terrain. With his intimate knowledge of the landscape and probable hideouts, the operator may sense if the drone is veering off the optimal route. Therefore, the operator's brain will naturally produce error-related signals (an intuitive "this doesn't feel right" response). Using a pBCI, we can detect these error signals in real time. So, instead of manually steering with controls, the operator's very brainwaves, specifically these error signals, will guide the drone back on the right track.

The same principle of using implicit neural information as feedback could also be used in the context of Reinforcement Learning by substituting manual feedback with pBCI-derived metrics, an approach we call Neuroadaptive Feedback Learning (NFL).

As proof of concept, pBCI-supported RL has been demonstrated in the implicit cursor control experiment. Briefly explained, the participants of the experiment implicitly guided a cursor towards a predetermined target, through a combination of pBCI and reinforcement learning. More specifically, the experiment demonstrated how mental states related to error perception could be used to modify the probabilities of subsequent cursor movements.

The use of pBCI leads to a significant improvement over a no-feedback approach, decreasing the number of cursor movements needed to reach the target by half on average (from 27 to 13), closing the gap from random (27 steps) to a perfect performance (10 steps) by more than 80%.

Neuroadaptive Feedback Learning: Improving RLHF with pBCI

Apart from increasing performance, the NFL has four main advantages over traditional Reinforcement Learning techniques.

To begin with, as pBCI technology doesn’t rely on the active participation of the user, no additional actions are required on the part of the human evaluators to convey their feedback. NFL eliminates the need for pushing buttons or manual labeling, providing a more direct, natural and intuitive way for humans to communicate their preferences and evaluations. Overall, this would save time and resources, increase efficiency and streamline the processes.

Besides, NFL could provide continuous and real-time feedback with higher resolution. Traditional feedback is given over large portions of data that first need to be processed by a person and evaluated as a whole. In the case of LLMs, feedback is usually provided for the entire output, unable to independently evaluate subsections or individual components, making the process suboptimal. In contrast, NFL would allow for fine-grained feedback that traditional RLHF cannot capture like, for example, on how individual language components like words and phrases relate to various mental states while retaining overall output evaluation. This would help identify when words and phrases are emotionally charged or have negative connotations, are unexpected, misused or otherwise unwanted, significantly improving context understanding and affective and figurative speech.

Furthermore, pBCI-derived feedback has the potential to capture more nuanced information regarding the evaluators’ decision-making process. In traditional RLHF, humans typically provide feedback after having completed the evaluation of a particular output and concluded it. Intermediary decisions, judgments, and thought processes that occur throughout the evaluation are ignored. As a result, the feedback reflects the human evaluators' final assessment of the output without explicitly revealing the underlying reasoning or cognitive processes that led to their conclusion. However, pBCIs could capture implicit information relating to every aspect of the evaluation process, from initial subconscious reactions to higher-order cognition, and can also keep track of how the relevant mental states evolve. This would provide a more comprehensive understanding of how evaluators arrive at their conclusions, enable the identification of key factors that influence decision making and greatly improve alignment with human values.

Finally, another advantage of the NFL is its potential to provide multifaceted feedback. Using mental states for Reinforcement Learning purposes is not limited to a single state; instead, multiple states could be combined into one. Each mental process may convey meaningful information about a specific domain, from performance evaluation to emotional reactions.

For instance, the mental workload could be indicative of how easy or difficult it is to understand a particular word or phrase, while surprise could be used to identify unexpected responses. This could allow for feedback across multiple aspects to be conveyed at the same time, providing a richer, more comprehensive feedback signal. Additionally, the simultaneous detection of multiple mental states could provide insights into processes that individual states cannot capture alone.

To sum up, while RLHF has been a major contributor to the success of the latest LLMs, it comes with its own set of challenges that could be solved with pBCI, as this technology has the potential to address some of the LLMs’ limitations.

Through the Neuroadaptive Feedback Learning approach, it is possible to eliminate the need for active human participation, enabling fine-grained and real-time feedback, capturing nuanced information on evaluators' decision-making processes and incorporating multiple dimensions of mental and affective states. By utilizing pBCI-derived metrics as the reward function in RLHF, NFL enhances the alignment between AI behavior and human intentions, leading us closer to bridging the gap between AI and human intelligence.

Blog
Writer

Other resources

Neuro Tech for Implicit Cursor Control

Publications

Towards Passive Brain-Computer Interfaces

Publications