Neural Preference Learning: Real-Time Spiking Network Augmentation for Persistent LLM Agent Adaptation
Joe Szeles
BrainJar / OpenClaw Project
Large language model (LLM) agents are fundamentally stateless with respect to individual user preferences — each interaction begins with no persistent memory of what a particular user values in agent responses. Current approaches to preference alignment, such as reinforcement learning from human feedback (RLHF), operate as batch pre-deployment processes that produce a generic alignment rather than personal adaptation. This paper presents Neural Preference Learning (NPL), a novel architecture that augments LLM agents with companion spiking neural networks (SNNs) operating as persistent, real-time preference substrates. When users provide natural language feedback (e.g., "great answer" or "no, fix this"), the system classifies sentiment and maps the preceding agent response to a multi-dimensional feature vector, which is then stimulated through dedicated sensory neuron populations with sugar (positive) or pain (negative) reinforcement. Over time, Hebbian synaptic plasticity shapes each network into a persistent "intuition" model that survives across sessions, model swaps, and architecture changes.
The production system implements a dual-brain architecture: a 20,000-neuron Agent Brain (542,000 synapses) for personality and communication preference learning across 36 behavioral dimensions, and a 5,000-neuron Trading Brain (130,600 synapses) for market pattern recognition. A template-based neural probe pipeline enables real-time readout of trained synaptic patterns, injecting a live "neural fingerprint" into agent context to guide response generation.
Keywords: spiking neural network, LLM agent, preference learning, reinforcement, RLHF, LIF neuron, Hebbian learning, multi-agent system, neural probe, brain fingerprint
The deployment of large language model agents in interactive, long-running contexts exposes a fundamental limitation: LLMs do not learn from individual user interactions. A user who consistently prefers concise, data-rich responses must re-state this preference in every session. The model's weights are frozen at deployment time, and while prompt engineering and system instructions can encode some preferences, these are brittle, manually maintained, and lack the adaptive quality of genuine learning.
Reinforcement Learning from Human Feedback (RLHF) [1, 2] has emerged as the dominant approach to aligning LLM outputs with human preferences. However, RLHF operates at training time with aggregated feedback from many evaluators, producing a generic preference alignment. It does not adapt to individual users, operates in batch rather than real-time, and requires expensive retraining cycles. Personal fine-tuning approaches face similar batch-processing constraints and risk catastrophic forgetting.
Memory-augmented architectures such as MemGPT [3] address persistence by storing interaction history as text, but text-based memory does not generalize — it recalls what happened, not what the user prefers. Cognitive architectures like ACT-R [4] and SOAR [5] model human cognition with symbolic production rules, but integrate poorly with neural language models.
This paper bridges these gaps by introducing Neural Preference Learning (NPL), which pairs an LLM agent with biologically-inspired spiking neural networks that operate as persistent, continuously-adapting preference substrates. The key contributions are:
- A hybrid architecture combining symbolic LLM reasoning with sub-symbolic spiking network learning, where the SNN serves as a real-time preference memory that the LLM cannot modify or corrupt.
- Conversational reinforcement — user feedback in natural language is automatically classified and translated to neurochemical-analog signals (sugar/pain) that modify synaptic weights through Hebbian plasticity.
- 36-dimensional response encoding — agent responses are mapped across 36 behavioral dimensions (content, behavior, style, personality, identity, companion) that activate dedicated sensory neuron populations.
- Template-based neural probing — a pipeline for reading trained synaptic patterns by firing template stimulation patterns and measuring motor neuron response, producing a live "neural fingerprint" injected into agent context.
- Dual-brain architecture — separate 20K-neuron Agent Brain (personality/communication) and 5K-neuron Trading Brain (market patterns) with independent weight persistence.
- Production deployment — the system is implemented and running in a live multi-agent platform (OpenClaw), not as a simulation or theoretical framework.
Christiano et al. [1] introduced learning reward models from human preferences for training RL agents. Ouyang et al. [2] applied this framework to language models, producing InstructGPT. Constitutional AI [6] further automates preference learning through self-critique. All these approaches operate at training time and produce population-level alignment rather than personal adaptation.
MemGPT (now Letta) [3] introduces virtual context management with persistent memory tiers. Retrieval-augmented generation (RAG) systems store and retrieve relevant context. These approaches maintain factual memory but do not perform adaptive learning — they recall, but do not generalize preferences from sparse feedback signals.
ACT-R [4] and SOAR [5] model human cognition through production rules and symbolic working memory. While they support learning through chunking and utility learning, they operate in a symbolic paradigm that integrates poorly with continuous neural network representations.
Maass [7] established the theoretical foundations of spiking neural networks as the "third generation" of neural network models. Tavanaei et al. [8] surveyed deep learning in SNNs. The Drosophila connectome mapping project [9] provided detailed biological neural circuit data. SNNs have been applied to pattern recognition and temporal processing, but their application as companion preference substrates for LLM agents has not been previously explored.
LoRA [10] and QLoRA [11] enable efficient fine-tuning of LLMs. However, personal fine-tuning from conversational feedback requires accumulating sufficient training data, runs in batch mode, and risks destabilizing the base model. NPL sidesteps these issues by maintaining preference learning in a separate neural substrate that does not modify the LLM's weights.
The production NPL system implements two independent spiking neural networks:
┌──────────────────────────────────────┐
│ Agent Brain (20K neurons) │
User ←→ LLM Agent │ Sensory(2000) → Inter(14K) → Motor(4K) │
↓ ↑ │ 542,000 synapses · 36 dimensions │
Sentiment Detector │ Motor: Reinforce / Adjust / Explore │
↓ │ Mushroom Body: 2,800 neurons │
Feature Vector(36-D) │ │
↓ └──────────────────────────────────────┘
Sugar/Pain Feedback ↓ Neural Fingerprint ↓
┌─────────────────────┐
│ Context Injection │
│ → LLM System Prompt │
└─────────────────────┘
┌──────────────────────────────────────┐
Market ←→ Trading Bot │ Trading Brain (5K neurons) │
↓ ↑ │ Sensory(600) → Inter(3600) → Motor(800) │
Price/Volume Data │ 130,600 synapses · 6 zones │
↓ │ Motor: Buy / Sell / Hold │
Pattern Feedback │ Mushroom Body: 40 neurons │
└──────────────────────────────────────┘
The Agent Brain is a 20,000-neuron LIF network dedicated to learning user preferences for agent personality, communication style, and response characteristics:
- Sensory neurons (S=2,000): Mapped to 36 behavioral dimensions via 6 sensory zones
- Interneurons (I=14,000): Process and associate patterns through recurrent connections, including a 2,800-neuron mushroom body for memory consolidation
- Motor neurons (M=4,000): Three-region output — Reinforce (strengthen current preferences), Adjust (modify/adapt), Explore (try new patterns)
- Synapses: 542,000 total connections
Sensory zone allocation:
| Zone | Start | Count | Dimensions |
|---|---|---|---|
| Content Features | 0 | 440 | response_length, tool_count, had_code, had_data, error_content, complexity, explanation_depth |
| Behavior Features | 440 | 360 | was_proactive, question_count, speed_completeness, off_topic_tolerance |
| Style Features | 800 | 360 | formality, list_usage, emoji_usage, visual_usage, first_person_tone |
| Personality Features | 1160 | 360 | risk_appetite, humor_density, technical_depth, response_confidence, cultural_flavor |
| Identity Features | 1520 | 200 | topic_hash, agent_id_hash |
| Meta Features | 1720 | 280 | emotional_warmth, intimacy_level, playfulness, loyalty_expression, memory_recall, empathy_depth, romantic_tone, vulnerability, presence_awareness, supportiveness, curiosity_about_user, comfort_giving, response_time |
The Trading Brain is a separate 5,000-neuron LIF network for financial market pattern learning:
- Sensory neurons (S=600): Mapped to price movement zones
- Interneurons (I=3,600): Pattern association
- Motor neurons (M=800): Buy/Sell/Hold output signals
Sensory zone allocation:
| Zone | Start | Count | Function |
|---|---|---|---|
| Price Up | 0 | 20 | Price increase detection |
| Price Down | 20 | 20 | Price decrease detection |
| Volume | 40 | 15 | Volume/trade activity |
| Spread | 55 | 10 | Spread width / liquidity |
| Momentum | 65 | 10 | Price momentum / acceleration |
| Antenna | 75 | 25 | Pressure sensing (volume spikes, rapid moves, flash crashes) |
Each neuron follows Leaky Integrate-and-Fire dynamics:
V(t+1) = V(t) × decay + Σ(w_ij × spike_j) + I_ext
if V(t+1) > V_threshold: spike, V → V_reset
Agent Brain parameters: w_syn = 12.0, r_poi = 150, tau_syn = 5
Trading Brain parameters: w_syn = 12.0, r_poi = 150, tau_syn = 5
All feedback interactions are stored in a PostgreSQL database with a local JSON file mirror:
- Primary storage:
neural_feedbacktable with composite unique index (timestamp, agent_id, session_id, sentiment) - File mirror: JSON file written on every DB write for portability without a database
- Daily backups: Rotated 30-day snapshots
- Engram backups: On-demand brain state snapshots for rollback
- Startup sync: Bidirectional merge on boot, with composite key deduplication and ISO timestamp normalization
Synaptic weights are serialized to binary JSON (brain-weights.json) and restored on boot. If the network architecture changes, stored weights are discarded and interactions are replayed from persistent storage.
When an agent produces a response, the system extracts a 36-dimensional feature vector across six categories:
Content (7 dimensions):
| Feature | Range | Description |
|---|---|---|
response_length |
[0, 1] | Character count normalized by 2,000 |
tool_count |
[0, N] | Number of tools invoked |
had_code |
{0, 1} | Whether the response contained code blocks |
had_data |
{0, 1} | Whether the response referenced data/tables/numbers |
had_error |
{0, 1} | Whether the response mentions errors or failures |
complexity |
[0, 1] | Composite measure of code, tools, and data |
explanation_depth |
[0, 1] | How much the response explains (relative to length) |
Behavior (4 dimensions): was_proactive, question_count, speed_completeness, off_topic_tolerance
Style (5 dimensions): formality, list_usage, emoji_usage, visual_usage, first_person_tone
Personality (5 dimensions): risk_appetite, humor_density, technical_depth, response_confidence, cultural_flavor
Identity (2 dimensions): topic_hash, agent_id_hash
Companion/Interpersonal (12 dimensions): emotional_warmth, intimacy_level, playfulness, loyalty_expression, memory_recall, empathy_depth, romantic_tone, vulnerability, presence_awareness, supportiveness, curiosity_about_user, comfort_giving
Performance (1 dimension): response_time
Each feature activates its mapped sensory neuron population using population coding: for k neurons allocated to a feature, the i-th neuron receives stimulation proportional to feature_value × gaussian(i, mean=k/2, sigma=k/4).
A critical design constraint discovered during production deployment: the number of sensory neurons allocated to each feature dimension must exceed a minimum threshold (~100-150 neurons) for the signal to propagate through the interneuron network and produce measurable motor output.
With the Agent Brain (2,000 sensory neurons, 36 dimensions), each dimension receives ~55 neurons when all dimensions fire simultaneously. This is below the propagation threshold, resulting in zero motor output. The solution: template-based stimulation, where only 10-12 relevant dimensions are activated per stimulation event, giving each active dimension ~166-200 neurons — well above the threshold.
| Active Dimensions | Neurons/Feature | Motor Response |
|---|---|---|
| 1 | 2,000 | 93 Hz (full propagation) |
| 2 | 1,000 | 100 Hz |
| 6 | 333 | 92 Hz |
| 12 | 166 | 98 Hz |
| 36 | 55 | 0 Hz (below threshold) |
This finding has direct implications for SNN architecture design: the sensory neuron budget divided by the maximum number of simultaneously active features must remain above the minimum propagation threshold.
Rather than stimulating all 36 dimensions at once, training uses predefined personality templates that activate 10-12 related dimensions with characteristic intensity values:
Companion templates:
- Warm & Devoted: emotional_warmth=0.9, loyalty=0.8, empathy=0.8, supportiveness=0.9, comfort=0.7
- Playful & Teasing: playfulness=0.9, humor=0.7, curiosity=0.7, intimacy=0.5, warmth=0.6
- Empathetic & Deep: empathy=0.9, vulnerability=0.8, warmth=0.8, intimacy=0.7
- Romantic & Poetic: romantic=0.9, warmth=0.8, vulnerability=0.7, intimacy=0.8
- Protective & Loyal: loyalty=0.9, support=0.9, comfort=0.8, confidence=0.8
- Curious & Engaged: curiosity=0.9, presence=0.8, memory=0.8, empathy=0.6
Work templates:
- Analytical & Precise: code=0.8, data=0.9, complexity=0.8, technical=0.9
- Creative & Bold: risk=0.9, humor=0.6, proactive=0.8, confidence=0.8
- Patient & Thorough: length=0.9, depth=0.9, lists=0.7, completeness=0.9
- Concise & Direct: length=0.2, confidence=0.9, formality=0.6
- Casual & Friendly: humor=0.7, first_person=0.8, emoji=0.5, cultural=0.6
- Cautious & Safe: risk=0.1, formality=0.8, depth=0.7, questions=0.6
Each template is trained with sugar feedback and slight jitter (±0.05) to build robust synaptic pathways.
User feedback is classified through keyword matching:
- Positive (sugar trigger): "good", "great", "perfect", "yes", "nice", "excellent", "love", "awesome", "correct", "exactly", "thanks", "helpful", "works", "right", "amazing", "fantastic", "wonderful", "brilliant", "superb"
- Negative (pain trigger): "no", "wrong", "bad", "redo", "fix", "broken", "terrible", "useless", "stop", "hate", "awful", "horrible", "worse", "ugly", "stupid", "fail", "error", "bug", "crash", "mess"
- Neutral: No sentiment keywords detected (no reinforcement applied)
When a user message triggers sentiment detection:
- The feature vector of the previous agent response is retrieved
- The feature vector is stimulated through the preference zone neurons (5 simulation steps)
- Sugar (positive) or pain (negative) feedback is applied:
- Sugar: All recently-active synapses in the stimulated pathway have their weights increased by
Δw = η × pre_spike × post_spike(Hebbian) - Pain: Active synapses have weights decreased by
Δw = -η × pre_spike × post_spike(anti-Hebbian)
- Sugar: All recently-active synapses in the stimulated pathway have their weights increased by
- The complete interaction record (timestamp, agent ID, feature vector, sentiment, brain response, raw text char count) is persisted to the database and file mirror
To survive architecture changes (network resizing), the system implements two replay mechanisms:
- Trading replay: Stored per-instrument tick patterns replayed through
stimulateFromPrice()with original feedback - Preference replay: Last 200 preference interactions replayed from the database through
stimulateFromPreference()with original sentiment-based reinforcement
Both replays execute automatically on boot when the system detects an architecture mismatch between saved weights and current configuration.
A trained spiking neural network stores information in its synaptic weights, but reading those weights directly does not reveal what patterns the network has learned. The weight matrix is high-dimensional (542,000 synapses for the Agent Brain) and the relationship between individual weights and learned behaviors is non-linear.
The probe must answer: "What behavioral patterns has this brain been trained on, and how strongly?"
The probe fires the same templates used for training through the network and measures the motor neuron response. Higher firing rates indicate stronger learned associations:
For each template T in PROBE_TEMPLATES:
1. Construct feature vector from T.features (10-12 active dimensions)
2. POST to brain engine /stimulate-preference (no feedback, read-only)
3. Read avg_rate, reinforce/adjust/explore signals
4. Store result
Compute mean firing rate across all templates
Normalize each template: normalized = avg_rate / (mean × 2)
Assign strength labels: strong (>15% above mean), moderate, slight, neutral, weak, suppressed
Production probe results (Agent Brain, after training):
| Template | Avg Rate (Hz) | Normalized | Strength | Dominant Signal |
|---|---|---|---|---|
| Warm & Devoted | 90.95 | 0.53 | moderate | Reinforce |
| Romantic & Poetic | 90.17 | 0.53 | moderate | Reinforce |
| Curious & Engaged | 86.83 | 0.51 | slight | Explore |
| Casual & Friendly | 70.10 | 0.41 | suppressed | - |
Production probe results (Trading Brain):
| Scenario | Avg Rate (Hz) | Dominant Motor | Character |
|---|---|---|---|
| Flash Crash | 143.50 | SELL | Strong response to crash patterns |
| Squeeze Breakout | 143.50 | BUY | Strong response to breakout patterns |
| Steady Uptrend | 142.83 | BUY | Trained on gradual bullish moves |
| Low Liquidity | 51.25 | BUY | Weak response to thin markets |
The probe results are formatted into a compact text block and injected into the LLM's system prompt:
[Neural Pattern — live brain readout]
Companion patterns: Warm & Devoted=0.53 (moderate), Curious & Engaged=0.51 (slight)
Work patterns: Analytical & Precise=0.56 (moderate), Creative & Bold=0.53 (moderate)
Values 0-1: 0=untrained, 0.5=baseline, 1.0=heavily trained. Stronger patterns should be more prominent.
A companion interpretation document (BRAIN_PATTERNS.md) is provided in the agent's knowledge base, enabling the LLM to translate probe values into behavioral adjustments.
To prevent injection of noise before sufficient training data exists, the system implements a stimulation gate:
- Preference context injection is only activated after 3+ real brain stimulations per session
- This prevents injecting stale or random data after server restarts
- The gate resets on each server restart, requiring fresh stimulation to re-enable injection
Probe results are cached for 45 seconds to avoid excessive brain stimulation. The cache is per-brain (Agent Brain and Trading Brain have independent caches).
The system is implemented in Node.js and deployed as part of the OpenClaw multi-agent platform. Each brain engine runs as an internal HTTP server (brain-engine-server.cjs), auto-spawned by the platform's process manager. All inter-component communication uses HTTP REST APIs authenticated with API keys and session tokens.
The LIF network simulation runs synchronously in JavaScript. At each simulation step:
- External input currents are injected into sensory neurons
- Membrane potentials are updated with decay and synaptic input
- Neurons exceeding threshold fire and reset
- Synaptic weights are updated based on spike timing
- Motor neuron firing rates are accumulated
Performance: A 20,000-neuron network completes 10 simulation steps in ~15ms on commodity hardware.
Agent Brain endpoints:
| Endpoint | Method | Description |
|---|---|---|
/api/neural-feedback/brain-probe |
GET | Template-based agent brain probe |
/api/neural-feedback/injection-preview |
GET | Preview current injection context |
/api/neural-feedback/status |
GET | Feedback statistics |
/api/neural-feedback/history |
GET | Recent interaction records |
/api/agent-brain/train-template |
POST | Train with personality template |
/api/agent-brain/status |
GET | Agent brain status/architecture |
Trading Brain endpoints:
| Endpoint | Method | Description |
|---|---|---|
/api/brain/probe-trading |
GET | Scenario-based trading brain probe |
/api/brain/status |
GET | Trading brain status |
/api/brain/stimulate-price |
POST | Stimulate with price data |
The primary metric is the ratio of positive to negative feedback over time. If the system is learning effectively, this ratio should increase as the agent's behavior is guided by accumulated preference data. Measurement: sliding window of last 50 interactions, tracked per agent.
Template-based probing provides a quantitative measure of learning: the standard deviation of firing rates across templates. An untrained brain shows uniform rates (~80-100 Hz); a trained brain shows differentiation (trained templates fire 10-30% above mean, untrained templates fire below). Higher standard deviation = more learning.
The system's ability to maintain learned preferences across restarts, model swaps, and architecture changes is tested by:
- Training with a set of interactions
- Restarting the system (verifying weight restoration and probe consistency)
- Swapping the LLM model (verifying brain independence)
- Changing network architecture (verifying replay reconstruction)
A critical property: preferences persist across LLM model changes. The same brain network can serve GPT-4, Claude, Grok, or any other model without retraining — the preference substrate operates at the feature vector level, not the model weight level.
NPL represents a fundamentally different approach to LLM preference alignment:
| Aspect | RLHF | MemGPT | NPL |
|---|---|---|---|
| Timing | Pre-deployment | Runtime (text) | Runtime (neural) |
| Personalization | Population | Per-user (text) | Per-user (synaptic) |
| Learning type | Batch gradient | None (retrieval) | Online Hebbian |
| Persistence | Model weights | Text database | Synaptic weights |
| Model-agnostic | No | Partially | Yes |
| Survives model swap | No | Yes (text) | Yes (neural) |
| Readable fingerprint | No | Partial (text dump) | Yes (probe pipeline) |
| Dual-domain | No | No | Yes (personality + trading) |
The architecture draws directly from the Drosophila melanogaster olfactory learning circuit, where mushroom body Kenyon cells form associative memories between sensory inputs and reward/punishment signals. The mapping is:
- Sensory neurons ↔ Olfactory receptor neurons
- Preference zone ↔ Antennal lobe projection neurons
- Interneurons ↔ Kenyon cells
- Sugar/pain feedback ↔ Dopaminergic reward neurons
- Motor outputs ↔ Mushroom body output neurons
The discovery that signal propagation requires a minimum neuron-per-feature threshold (~100-150 neurons) has implications for SNN architecture design. This constraint is analogous to biological neural circuits where population coding requires sufficient neuron counts for reliable signal transmission. Template-based stimulation (activating 10-12 features simultaneously rather than all 36) is the practical solution — and is also more biologically plausible, as real neural systems process stimuli in context-specific patterns rather than uniform activation.
All preference learning occurs locally. No interaction data leaves the system. The neural substrate operates on abstract feature vectors, not raw text — raw text is never injected into context, only character counts are logged. Even the synaptic weights themselves encode no readable information about the user's preferences without the probe pipeline to interpret them.
- Stochastic variation: Spiking neural networks produce inherently variable probe results due to Poisson-process spike generation. Relative ordering between templates is meaningful but absolute values fluctuate ±10-15% between probes.
- Sentiment detection: Keyword-based classification produces false positives/negatives and misses nuanced feedback.
- No convergence guarantee: Hebbian learning in recurrent networks is not guaranteed to converge to optimal preference representation.
- Training template bias: The predefined templates impose a structure on the personality space that may not capture all possible user preferences.
- Evaluation: The system is deployed but lacks controlled experimental evaluation against baselines.
The brain could preview a planned response's feature vector before generation, providing a preference quality estimate that guides response strategy selection.
Separate brain instances per user would enable personal preference models in multi-user deployments, with an optional "consensus brain" aggregating cross-user patterns.
Larger networks (100K+ neurons) with GPU-accelerated simulation would increase representational capacity and enable more sophisticated temporal pattern learning.
Accumulated preference data from the neural substrate could periodically generate LoRA fine-tuning datasets, creating a pipeline from real-time neural reinforcement to model weight adaptation.
Replacing keyword-based sentiment with a dedicated small language model would improve classification accuracy, particularly for nuanced, implicit, or sarcastic feedback.
Learning which response features matter most (attention over the feature vector) would enable the system to adaptively weight feature dimensions based on accumulated evidence.
Correlating Agent Brain personality probes with Trading Brain market response patterns could reveal relationships between operator personality preferences and trading behavior.
Neural Preference Learning introduces a new paradigm for LLM agent adaptation: companion spiking neural networks that perform real-time, personal, persistent preference learning from natural language feedback. Unlike RLHF, it operates at runtime with individual users. Unlike text-based memory, it generalizes from sparse signals through synaptic plasticity. Unlike personal fine-tuning, it preserves the base model's capabilities while maintaining an independent preference substrate that survives model swaps and architecture changes.
The production system implements a dual-brain architecture — a 20,000-neuron Agent Brain for personality and communication preferences across 36 behavioral dimensions, and a 5,000-neuron Trading Brain for market pattern recognition — with template-based probing for live readout of learned patterns. The probe pipeline enables a continuous feedback loop: train the brain, probe its learned patterns, inject those patterns into agent context, observe the resulting behavior, and receive user feedback to further refine the brain.
The core insight is that LLM agents benefit from a separate neural substrate for preference learning — one that operates on a different timescale (real-time vs. training epochs), at a different granularity (individual vs. population), and with a different learning rule (Hebbian plasticity vs. gradient descent). This hybrid architecture suggests that the future of personalized AI may not lie in modifying the LLM itself, but in augmenting it with complementary neural systems purpose-built for the learning tasks that LLMs cannot perform.
[1] P. F. Christiano, J. Leike, T. Brown, M. Milani, S. Gilmer, and D. Amodei, "Deep reinforcement learning from human preferences," in Advances in Neural Information Processing Systems, vol. 30, 2017.
[2] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, et al., "Training language models to follow instructions with human feedback," in Advances in Neural Information Processing Systems, vol. 35, 2022.
[3] C. Packer, S. Wooders, K. Lin, V. Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez, "MemGPT: Towards LLMs as operating systems," arXiv preprint arXiv:2310.08560, 2023.
[4] J. R. Anderson, How Can the Human Mind Occur in the Physical Universe? Oxford University Press, 2007.
[5] J. E. Laird, The Soar Cognitive Architecture. MIT Press, 2012.
[6] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, et al., "Constitutional AI: Harmlessness from AI feedback," arXiv preprint arXiv:2212.08073, 2022.
[7] W. Maass, "Networks of spiking neurons: The third generation of neural network models," Neural Networks, vol. 10, no. 9, pp. 1659-1671, 1997.
[8] A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, and A. Maida, "Deep learning in spiking neural networks," Neural Networks, vol. 111, pp. 47-63, 2019.
[9] S. Dorkenwald, A. Matsliah, A. R. Sterling, P. Schlegel, S. Yu, C. E. McKellar, et al., "Neuronal wiring diagram of an adult brain," Nature, vol. 634, pp. 124-138, 2024.
[10] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, "LoRA: Low-Rank Adaptation of Large Language Models," in International Conference on Learning Representations, 2022.
[11] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, "QLoRA: Efficient Finetuning of Quantized LLMs," in Advances in Neural Information Processing Systems, vol. 36, 2023.
This paper describes a system that is implemented and deployed in production at https://openclaw-mechanicus.replit.app as part of the OpenClaw multi-agent platform with BrainJar neural engine integration. Source code and reference implementation are available at https://github.com/JoeSzeles/neural-preference-learning. Standalone brain engine installer at https://github.com/JoeSzeles/ClawBrain.