Segui
Evan Hubinger
Evan Hubinger
Safety Researcher, Anthropic
Email verificata su anthropic.com - Home page
Titolo
Citata da
Citata da
Anno
Discovering language model behaviors with model-written evaluations
E Perez, S Ringer, K Lukošiūtė, K Nguyen, E Chen, S Heiner, C Pettit, ...
arXiv preprint arXiv:2212.09251, 2022
1252022
Risks from learned optimization in advanced machine learning systems
E Hubinger, C van Merwijk, V Mikulik, J Skalse, S Garrabrant
arXiv preprint arXiv:1906.01820, 2019
1002019
Studying large language model generalization with influence functions
R Grosse, J Bae, C Anil, N Elhage, A Tamkin, A Tajdini, B Steiner, D Li, ...
arXiv preprint arXiv:2308.03296, 2023
462023
Measuring faithfulness in chain-of-thought reasoning
T Lanham, A Chen, A Radhakrishnan, B Steiner, C Denison, ...
arXiv preprint arXiv:2307.13702, 2023
352023
Question decomposition improves the faithfulness of model-generated reasoning
A Radhakrishnan, K Nguyen, A Chen, C Chen, C Denison, D Hernandez, ...
arXiv preprint arXiv:2307.11768, 2023
252023
An overview of 11 proposals for building safe advanced AI
E Hubinger
arXiv preprint arXiv:2012.07532, 2020
242020
Risks from learned optimization in advanced machine learning systems. arXiv
E Hubinger, C van Merwijk, V Mikulik, J Skalse, S Garrabrant
arXiv preprint arXiv:1906.01820, 2019
132019
Tamera Lanham, Tim Maxwell, Venkatesa Chandrasekaran, Zac Hatfield-Dodds, Jared Kaplan, Jan Brauner, Samuel R
A Radhakrishnan, K Nguyen, A Chen, C Chen, C Denison, D Hernandez, ...
Bowman, and Ethan Perez, 2023
102023
Sleeper agents: Training deceptive llms that persist through safety training
E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ...
arXiv preprint arXiv:2401.05566, 2024
92024
Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive llms that persist through safety training
E Hubinger, JM Carson Denison, M Lambert, M Tong, M MacDiarmid
arXiv preprint arXiv:2401.05566, 2024
92024
Steering llama 2 via contrastive activation addition
N Rimsky, N Gabrieli, J Schulz, M Tong, E Hubinger, AM Turner
arXiv preprint arXiv:2312.06681, 2023
72023
Model organisms of misalignment: The case for a new pillar of alignment research
E Hubinger, N Schiefer, C Denison, E Perez
Alignment Forum. URL: https://www. alignmentforum. org/posts …, 2023
62023
Chris Olah’s views on AGI safety
E Hubinger
AI Alignment Forum, 2020
52020
Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, et al. 2024. Sleeper agents: Training deceptive llms that persist through safety training
E Hubinger, JM Carson Denison, M Lambert, M Tong, M MacDiarmid
arXiv preprint arXiv:2401.05566, 0
5
Engineering monosemanticity in toy models
AS Jermyn, N Schiefer, E Hubinger
arXiv preprint arXiv:2211.09169, 2022
42022
Conditioning predictive models: Risks and strategies
E Hubinger, A Jermyn, J Treutlein, R Hudson, K Woolverton
arXiv preprint arXiv:2302.00805, 2023
22023
Many-shot Jailbreaking
C Anil, E Durmus, M Sharma, J Benton, S Kundu, J Batson, N Rimsky, ...
Il sistema al momento non può eseguire l'operazione. Riprova più tardi.
Articoli 1–17