CS 4973 / CS 6983: Trustworthy Generative AI

Fall 2024

 

Instructors:

§  Instructor: Alina Oprea (alinao)

§  TA: Pravind Anand Pawar

 

Class Schedule: 

§  Tuesday 11:45am-1:25pm and Thursday 2:50-4:30pm ET

§  Location: Ryder Hall 161

 

Office Hours: 

§  Alina: Thursday 4:30-5:30pm ET and by appointment

§  Pravind: Monday 5-6pm ET on Zoom

 

Class forum:  Canvas with links to Piazza and Gradescope 

 

Class policies:  Academic integrity policy is strictly enforced.

 

Class Description:

 

Recently, generative AI has been increasingly deployed in critical domains such as medicine, biology, finance, and cyber security. Foundation models such as large language models (LLMs) have been trained on massive datasets crawled from the web and are subsequently finetuned to new tasks including summarization, translation, code generation, and conversational agents. This trend raises many concerns about the security of AI models in critical applications, as well as the privacy of the data used to train these models.

In this course, we will study a variety of adversarial attacks on generative AI that impact the security and privacy of these systems. We will cover multiple deployment models for generative AI, including fine-tuning and Retrieval Augmented Generation. We will also discuss existing mitigations against security and privacy vulnerabilities, and the challenges in making AI trustworthy. We will read and debate papers published in top-tier conferences in ML and cyber security. Students will have an opportunity to work on a semester-long research project in trustworthy AI.

 

Disclaimer: This course is not meant to be the first course taken by a student in ML/AI. This course focuses on recent research in security and privacy of ML and AI. Prior knowledge in machine learning is essential for following this course. If you have any questions about the course content, please email the instructor.

 

Pre-requisites:

 

§  Calculus and linear algebra

§  Basic knowledge of machine learning 

 

Grading

The grade will be based on:

 

§  Assignments – 20%

§  Paper summaries – 10%

§  Final project report – 40%

§  Final project presentation – 10%

§  Paper presentation and class participation – 20%

 

     

 Calendar (Tentative)

 

Week

Date

Topic

Readings

1

Tue

09/10

Course outline (syllabus, grading, policies) 

Introduction to trustworthy AI [Slides]

 

Thu

09/12

Review of deep learning and LLMs [Slides]

HW 1 released

2

Tue

09/17

Taxonomy of adversarial attacks on predictive and generative AI [Slides]

Keshav. How to read a paper.

 

Optional read: Chapters 1 and 2.1 of NIST report on Adversarial ML

 

Optional read: Biggio and Roli. Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning

Thu

09/19

Evasion attacks against ML and prompt injection [Slides]

Optional read: Carlini and Wagner. Towards Evaluating the Robustness of Neural Networks. IEEE S&P 2017

 

Required read: Wei et al. Jailbroken: How Does LLM Safety Training Fail? arXiv 2023

3

Tue

09/24

Poisoning attacks against ML: Backdoor and subpopulation attacks [Slides]

Required read: Gu et al. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv 2017.

 

Optional read: Jagielski et al. Subpopulation Data Poisoning Attacks. ACM CCS 2021.

 

HW 1 due

Thu

09/26

Privacy risks in ML. Membership Inference attacks [Slides]

Required read: Carlini et al. Membership Inference Attacks From First Principles. IEEE S&P 2022.

 

Optional read: Yeom et al. Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting. IEEE CSF 2018

4

Tue

10/01

LLM privacy: Data extraction attack [Slides]

Required read: Carlini et al. Extracting Training Data from Large Language Models. USENIX Security 2021.

 

Kassem et al. Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs. arXiv 2024

Thu

10/03

LLM indirect prompt injection and safety alignment [Slides 1] [Slides 2]

Greshake et al. Compromising real-world LLM-integrated applications with indirect prompt injection. AISec 2023.

 

Required read: Bai et al. Constitutional AI: Harmlessness from AI Feedback. arXiv 2022

 

5

Tue

10/08

LLM jailbreaking [Slides]

Required read: Zou et al. Universal and Transferable Adversarial Attacks on Aligned Language Models, arXiv 2023.

 

Chao et al. Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv 2023

 

Thu

10/10

Defenses to prompt injection and jailbreaking [Slides]

 

 

Required read: Wallace et al. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv 2024

 

Zoi et al. Improving Alignment and Robustness with Circuit Breakers arXiv 2024

6

Tue

10/15

RAG poisoning attacks [Slides]

 

Required read: Zou et al. Poisoned RAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models. USENIX Security 2025

 

Chaudhari et al. Phantom: General Trigger Attacks on Retrieval Augmented Language Generation. arXiv 2024

 

Thu

10/17

Class canceled 

 

7

Tue

10/22 

 

LLM poisoning 1 [Slides]

Required read: Hubinger et al. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. arXiv 2024

 

Carlini et al. Poisoning Web-Scale Training Datasets is Practical. 2023.

Thu

10/24

LLM poisoning 2 [Slides]

 

Required read: Shu et al. On the Exploitability of Instruction Tuning. NeurIPS 2023

 

Rando et al. Universal Jailbreak Backdoors from Poisoned Human Feedback. ICLR 2024

 

8

Tue

10/29

LLM safety mitigations [Slides]

 

Rebedea et al. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. arXiv 2023

 

Chen et al. Aligning LLMs to Be Robust Against Prompt Injection. arXiv 2024

 

Thu

10/31

 

LLM fine-tuning privacy risks [Slides]

Required read: Chen et al. The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risks. arXiv 2023

 

Kandpal et al. User Inference Attacks on Large Language Models. EMNLP 2024

9

Tue

11/05

Differential privacy and auditing [Slides on DP-SGD] [Slides on Privacy Auditing]

Abadi et al. Deep Learning with Differential Privacy. ACM CCS 2016

 

Required read: Jagielski et al. Auditing Differentially Private Machine Learning: How Private is Private SGD? NeurIPS 2020

 

 

Thu

11/07

Machine unlearning [Slides]

Bourtoule et al. Machine unlearning. IEEE S&P 2021.

 

Required read: Yao et al. Large Language Model Unlearning. arXiv 2023

 

10

Tue

11/12

Inference-time privacy and privacy of LLM agents [Slides]

Staab et al. Beyond Memorization: Violating Privacy via Inference with Large Language Models. ICLR 2024

 

Required read: Bagdasaryan et al. Air Gap: Privacy-Conscious Conversational Agents. arXiv 2024

Thu

11/14

Watermarking LLMs [Slides]

Required read: Kirchenbauer et al. A Waxtermark for Large Language Models. arXiv 2023

 

Jovanovic et al. Watermark Stealing in Large Language Models ICML 2024

.

11

Tue

11/19

LLM coding models [Slides]

 

Required read: Ullah et al. LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks. IEEE S&P 2024

 

Jenko et al. Practical Attacks against Black-box Code Completion Engines. arXiv 2024

 

Thu

11/21

Copyright and model ownership [Slides]

Required read: Maini et al. LLM Dataset Inference Did you train on my dataset? arXiv 2024

 

Chen at al. COPYBENCH: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation. EMNLP 2024

12

Tue

11/26

Reinforcement learning security
Class review [Slides]

Rathbun et al. SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents. arXiv 2024

    

Thu

11/28

No class

University holiday (Thanksgiving)

 

13

Tue

12/03

Project presentations

Wed

12/04

Project presentations, 11am-1pm

177 Huntington Ave, conference room 503

 

Mon

12/09

Project reports due at noon

 

 

Review materials

§  Probability review notes from Stanford's machine learning class

§  Sam Roweis's probability review

§  Linear algebra review notes from Stanford's machine learning class 

 

 

Other resources

 

Books:

§  Trevor Hastie, Rob Tibshirani, and Jerry Friedman. Elements of Statistical Learning. Second Edition, Springer, 2009.

§  Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2006.  

§  A. Zhang, Z. Lipton, and A. SmolaDive into Deep Learning  

§  C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy

§  Shai Ben-David and Shai Shalev-Shwartz. Understanding Machine Learning: From Theory to Algorithms