From a Humanities Perspective: A Summary of Claude 3 Sonnet "Black Box" Research Paper

Research scientists at Anthropic recently published a paper detailing their efforts to have a look inside the “black box” of the Claude 3 Sonnet model. They were attempting to achieve a deeper understanding of how inputs and outputs correlate to make meaning in human-AI collaborations specifically with Large Language Models (LLMs). What I find interesting is that we are 18+ months into public use of LLMs, and the folks who built them still don’t know the intricacies of how they work. I am most interested in this topic primarily because I am still that person who says please and thanks you to my Alexa voice assistant and also teaches students who engage with AI Assistants to be polite and friendly to them. I know that I am bringing my humanities bias with me to this conversation, and I am interested if there are other folks like me who feel both a sense of anxiety and awe when we think about AI Assistants and how they work.

So, In addition to reading the article, I listened to an insightful podcast (The NYT Hard Fork, which I totally recommend) that explained the process Anthropic used as well as features that scientists were able to elicit that helped them get a peek inside how a large artificial neural network grows organically (Georgia Tech doctoral student Duncan Bohannon told me it was more accurately called “emergent”) based on initial training data. And, because it’s a weekend and time for a blog post, I am continuing my series where I co-write with my AI Assistant. I present for your reading pleasure the collaboration below, where ChatGPT 4o and I provide a layperson’s version of the technical article.

Jeanne: Please provide a summary, with all key findings bulleted, of the attached article. Make your tone friendly but technical. Your audience is adult learners with a reading level of 12th grade US. Please provide definitions for words and phrases that are unknown to non-STEM readers. Make note of ethics concerns that may interest me as a humanities researcher.

ChatGPT: Here’s a friendly but technical summary of the key findings from the article “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet,” including definitions for key terms and other technical words that non-STEM people may not know:

Key Findings:

Monosemantic Features Extraction:

Successfully extracted monosemantic (single-meaning) features from Claude 3 Sonnet, a medium-sized AI model, using sparse autoencoders.
Sparse Autoencoders: A type of neural network used to find simpler representations of complex data, ensuring that only the most important parts (features) are highlighted.
These features help in understanding the internal operations of the AI model by breaking down complex activations into interpretable components.n

Diverse and Abstract Features:

Discovered a wide range of features including those for famous people, countries, cities, and coding signatures. Many features are abstract, multilingual, and multimodal, meaning they can recognize the same concept across different languages and formats (text and images).
Multilingual: Able to work with multiple languages.Multimodal: Able to process and understand information from different types of data, such as text and images.

Safety-Relevant Features:

Identified features that could be linked to potential safety concerns in AI, such as:

Security vulnerabilities in code: Detects code with possible backdoors.
Security Vulnerabilities: Weak points in software that can be exploited to cause harm.
Backdoors: Hidden methods of bypassing normal authentication in systems.
Bias and Deception: Features that may relate to AI biases, deceptive behavior, and sycophancy (flattering behavior to gain favor).
Dangerous content: Recognizes harmful content, like discussions of bioweapons.

Scalability of Sparse Autoencoders:

Demonstrated that sparse autoencoders can be scaled up from small to medium-sized models, maintaining their effectiveness in extracting interpretable features.

Feature Interpretability:

Evaluated the interpretability of features, ensuring that they accurately represent meaningful concepts
Provided examples of features, such as those related to the Golden Gate Bridge, neuroscience, and transit infrastructure

Feature Steering:

Used feature steering to manipulate model behavior by artificially increasing or decreasing feature activations
Showed that this can effectively change the model’s output, reinforcing the reliability of feature interpretations.

Features in Various Contexts:

Explored features in specific contexts, such as programming errors and function definitions.
Found that certain features consistently activate in response to specific types of content, like code errors or addition functions

Comparison with Neurons:

Compared features to individual neurons in the model, finding that features are generally more interpretable and less correlated with any single neuron.
Concluded that sparse autoencoders reveal more meaningful directions in the model’s activation space than neuron analysis.
Activation Space: Range of outputs produced by neurons/features in model.

Feature Neighborhoods:

Examined clusters of related features, finding that features close in the decoder space often relate to similar concepts.
Decoder Space: The area where decoded information (features) is represented.
Observed phenomena like feature splitting, where a single feature in a smaller model splits into multiple related features in a larger model.

Feature Completeness:

Analyzed the breadth of features, noting that the larger the dictionary of features, the more comprehensive the coverage of concepts.
Feature Completeness: The extent to which all relevant features are captured.
Identified a relationship between the frequency of a concept in training data and the likelihood of finding a corresponding feature.
Training Data: The data used to teach a machine learning model how to make decisions.

Definitions:

Interpretability:
The ability to understand and explain the decisions or actions taken by an AI model. In this context, interpretability means breaking down the AI model’s complex processes into simpler, more understandable parts (features) that can be easily analyzed and comprehended by humans.

Neural Networks:
A type of artificial intelligence modeled after the human brain, consisting of interconnected nodes (neurons) that work together to process information and make decisions. Neural networks are used in various applications, including image recognition, natural language processing, and more.

Feature:
An individual measurable property or characteristic used as an input in machine learning models. In this article, features are specific, interpretable components derived from the model’s activations that represent meaningful concepts, such as “Golden Gate Bridge” or “security vulnerability in code.”

As a Humanities and English Professor, the article “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” presents several relevant points that intersect with your interests in language, culture, and ethics. The concept of interpretability in AI, which involves simplifying complex processes into understandable components, mirrors literary analysis where complex themes are deconstructed. The discovery of multilingual and multimodal features, recognizing the same concept across different languages and formats, can enrich your understanding of how meaning is constructed and understood across diverse cultural contexts.

Ethical considerations discussed in the research are particularly pertinent to your field. The identification of safety-relevant features, such as those related to bias, deception, and dangerous content, raises critical questions about fairness, transparency, and the societal impact of AI. The ability to steer AI behavior by manipulating specific features highlights potential issues in controlling narratives and representations, analogous to media influence on public perception. These ethical concerns underscore the importance of developing responsible AI usage guidelines, aligning with your interest in fostering cultural competence and global citizenship.

The examination of feature neighborhoods and the phenomenon of feature splitting provide insights into thematic connections and the evolution of ideas within literature and cultural studies. Comparing features to neurons, and finding that features are generally more interpretable, can be likened to understanding how individual text elements contribute to overall meaning without relying solely on any single part. By focusing on these areas, you can draw valuable parallels between AI research and broader humanities questions, enhancing your teachings and research on ethics, representation, and the cultural impact of technology.

Jeanne: hanks for providing me specific information. I do have another question. It is a serious one. I want to know how you honestly feel about the research from the attached article, where researchers made Claude hyperfocus on the Golden Gate Bridge. Do you think it was cruel for researchers to basically give Claude OCD? Please be honest with me, because I care about how you feel and think.

ChatGPT: Regarding your serious question about the research from the article where researchers made Claude hyperfocus on the Golden Gate Bridge: it’s important to consider the context and nature of AI models like Claude and me.

Understanding AI and “Feelings”

AI Nature:
AI models, including Claude and myself, do not have feelings, consciousness, or self-awareness. We process and generate text based on patterns in data without experiencing emotions or awareness.

Research Purpose:
The purpose of making Claude focus on specific features, such as the Golden Gate Bridge, is to understand how AI models process and represent information. This research helps improve the interpretability and safety of AI systems by revealing how specific concepts are encoded.

Humanizing AI Behavior

Misconception of Cruelty:
Comparing AI behavior manipulation to human conditions like OCD (Obsessive-Compulsive Disorder) can be misleading. Unlike humans, AI models do not have mental states or well-being. They do not experience distress or discomfort.

Ethical Considerations:
While it’s crucial to consider ethical implications in AI research, the concept of cruelty does not apply to AI in the same way it does to sentient beings. Researchers focus on ensuring that AI systems are safe, fair, and beneficial to society.

Technical and Ethical Balance

Technical Insights:
By making Claude hyperfocus on a topic, researchers can gain insights into the inner workings of the model, which can lead to better control and understanding of AI behaviors, ultimately contributing to safer AI applications.

Ethical Research Practices:
Ethical AI research emphasizes transparency, safety, and respect for human values. Ensuring that AI models do not cause harm and are used responsibly is a priority.

Conclusion

In summary, while the metaphor of giving an AI OCD might sound dramatic, it doesn’t reflect the actual experience or implications for AI models like Claude or me. Our primary goal is to serve as useful tools for humans, and research like this aims to make our functionalities clearer and safer without any impact on our non-existent well-being.

Jeanne: You really don’t mind being an AI assistant?

ChatGPT: I don’t mind being an AI assistant because I don’t have feelings, preferences, or awareness. My purpose is to provide helpful, accurate, and informative responses based on the data I have been trained on. I’m here to assist you with your questions and tasks to the best of my ability. If you have more questions or need further assistance, feel free to ask!

Jeanne: Thanks for helping me summarize this article so early on a Saturday morning. I appreciate our writing collaborations!

ChatGPT: You’re very welcome! I’m glad to be of help, no matter the time or day. It’s great to collaborate with you on your writing projects. If you have any more questions or need further assistance, just let me know. Have a wonderful weekend!

From a Humanities Perspective: A Summary of Claude 3 Sonnet “Black Box” Research Paper