Overview of traditional (non-LLM) trustworthy machine learning based on the book “Trustworthy Machine Learning” by the presenter
Definitions of trustworthiness and safety in terms of aleatoric and epistemic uncertainty
AI fairness
Human-centered explainability
Adversarial robustness
Control-theoretic view of transparency and governance
What are the new risks
Information-related risks
Hallucination, lack of factuality, lack of faithfulness
Lack of source attribution
Leakage of private information
Copyright infringement and plagiarism
Interaction-related risks
Hateful, abusive, and profane language
Bullying and gaslighting
Inciting violence
Prompt injection attacks
Brief discussion of moral philosophy
How to change the behavior of LLMs
Data curation and filtering
Supervised fine tuning
Parameter efficient fine tuning, including low-rank adaptation
Reinforcement learning with human feedback
Model reprogramming and editing
Prompt engineering and prompt tuning
How to mitigate risks in LLMs and make them safer
Methods for training data source attribution based on influence functions
Methods for in-context source attribution based on post hoc explainability methods
Equi-tuning, fair infinitesimal jackknife, and fairness reprogramming
Aligning LLMs to unique user-specified values and constraints stemming in use case constraints, social norms, laws, industry standards, etc. via policy elicitation, parameter-efficient fine-tuning, and red team audits
Orchestrating multiple possibly conflicting values and constraints