Blog
HADESS
Cyber Security Magic

AI/ML Security: Adversarial Attacks, Model Poisoning, and Prompt Injection

AI/ML Security: Adversarial Attacks, Model Poisoning, and Prompt Injection

Part of the Cybersecurity Skills Guide — This article is one deep-dive in our complete guide series.

By HADESS Team | February 28, 2026 | Updated: February 28, 2026 | 5 min read

Machine learning systems introduce attack surfaces that traditional security tools do not address. The model itself becomes a target — its training data, its inference behavior, and its integration with other systems all create opportunities for attackers. As AI gets embedded into more products, understanding these threats is becoming a required skill.

Adversarial Attacks

Adversarial examples are inputs designed to make a model produce wrong outputs while appearing normal to humans. A few pixels changed in an image can cause a classifier to misidentify a stop sign as a speed limit sign. A carefully crafted audio sample can make a speech recognition system hear a different command than what was spoken.

White-box attacks (FGSM, PGD, C&W) require access to model weights and gradients. They generate perturbations by computing how to maximally change the output with minimal input change. These are primarily a concern when model weights are public or extractable.

Black-box attacks work without model access. Transfer attacks generate adversarial examples against a surrogate model — they often transfer to the target model. Query-based attacks estimate gradients through repeated API calls, observing how outputs change with input modifications.

Defenses include adversarial training (training on adversarial examples), input preprocessing (JPEG compression, randomized smoothing), and certified defenses that provide mathematical guarantees within a perturbation bound. No single defense covers everything.

Model Poisoning

Poisoning attacks corrupt the training data to make the model behave incorrectly. Backdoor attacks insert a trigger pattern into training samples — the model behaves normally on clean inputs but produces attacker-chosen outputs when the trigger is present.

Data poisoning requires access to the training pipeline. If your model trains on user-submitted data, scraped web content, or crowdsourced labels, poisoning is a real threat. A small percentage of poisoned samples (often under 1%) can insert a reliable backdoor.

Model supply chain poisoning targets pre-trained models and fine-tuning datasets. Downloading a model from a public repository means trusting that the model weights have not been tampered with. Verify model checksums, use models from trusted sources, and evaluate model behavior on test sets that include potential trigger patterns.

Defenses: validate training data quality, use robust aggregation methods that tolerate outliers, inspect models for backdoors using techniques like Neural Cleanse and Activation Clustering, and maintain provenance records for all training data and model artifacts.

Prompt Injection

Prompt injection attacks target LLM-based applications by embedding instructions in user input or retrieved content that override the system prompt. Direct prompt injection provides malicious instructions in user input. Indirect prompt injection hides instructions in external content that the LLM processes — web pages, emails, documents, database records.

A retrieval-augmented generation (RAG) system that searches documents and answers questions can be attacked by planting instructions in the document corpus. The LLM reads the document, follows the embedded instructions, and performs actions the user did not intend.

Defenses are still evolving. Separate data and instructions architecturally. Use output filtering to detect and block instruction-following behavior. Implement least-privilege access for LLM agents — limit what actions the model can take regardless of what it is instructed to do. Monitor for anomalous behavior patterns in LLM outputs.

Data Privacy

Training data extraction attacks recover private information from model outputs. Language models can memorize and regurgitate training data — names, phone numbers, API keys, and other sensitive content that appeared in the training corpus.

Differential privacy adds noise during training to limit what can be learned about individual training examples. Federated learning keeps raw data on client devices, training on local data and sharing only model updates. Both techniques involve privacy-utility tradeoffs.

For inference privacy, consider that API queries reveal information about the user’s data. Homomorphic encryption and secure multi-party computation enable inference without exposing the input, but performance overhead is significant.

Related Career Paths

AI/ML security expertise maps to the Security Researcher career path. This is a rapidly growing specialization as organizations integrate AI into security-sensitive applications.

Next Steps

Related Guides in This Series

Take the Next Step

Browse 80+ skills on HADESS. Go to the browse 80+ skills on hadess on HADESS.

See your certification roadmap. Check out the see your certification roadmap.

Get started freeCreate your HADESS account and access all career tools.

Frequently Asked Questions

How long does it take to learn this skill?

Most practitioners build working proficiency in 4-8 weeks of dedicated study with hands-on practice. Mastery takes longer and comes primarily through on-the-job experience.

Do I need certifications for this skill?

Certifications validate your knowledge to employers but are not strictly required. Hands-on experience and portfolio projects often carry more weight in technical interviews. Check the certification roadmap for relevant options.

What career paths use this skill?

Explore the career path explorer to see which roles require this skill and how it fits into different cybersecurity specializations.

HADESS Team consists of cybersecurity practitioners, hiring managers, and career strategists who have collectively spent 50+ years in the field.

Leave a Reply

Your email address will not be published. Required fields are marked *