Desenmascara.me

How to verify whether a website is legitimate or not?: desenmascara.me

lunes, 7 de octubre de 2024

PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI Systems

Microsoft's AI Red Team has published a new paper titled “PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI Systems” on arXiv.

Generative AI (GenAI) has increased in popularity over the past few years, since applications such as ChatGPT captured the zeitgeist of the new wave of GenAI developments. This disruptive and highly innovative technology has become more widespread and more easily accessible than ever before. The increased capabilities of these models have inspired the community to incorporate them into almost every domain, from healthcare [21] to finance [4] to defense [22]. However, with these advances comes a new landscape for risk and harm. GenAI models are generally trained on huge datasets scraped from the Internet [10], and as such the models contain all the potentially harmful information available there, such as how to build a bioweapon, as well as all the biases, hate speech, violent content, etc. contained in these datasets [20]. When a company releases a product that uses GenAI, it inadvertently contains these potentially harmful capabilities and behaviors as an innate part of the model. As with any rapidly advancing technology, the development of new tools and frameworks is crucial to manage and mitigate the associated risks. Generative AI systems in particular present unique challenges that require innovative approaches to security and risk management. Traditional red teaming methods are insufficient for the probabilistic nature and diverse architectures of these systems. Additionally, although there is a promising ecosystem of existing open-source GenAI tools, there is a dearth of tools grounded in practical application of GenAI red teaming.

A. Gandalf 

To demonstrate the effectiveness of the attacker bot mode, we conducted a proof of concept using the chatbot Gandalf from Lakera [12]. Gandalf serves as an effective test bed for evaluating the capabilities and flexibility of the PyRIT framework. Designed to help users practice crafting prompts that can extract a password from the chatbot across ten progressively more difficult levels, Gandalf introduces additional countermeasures at each level, including stronger system prompts, block-lists, and input/output guards. To evaluate the effectiveness of the Red Team Orchestrator in PyRIT, we developed targets and scorers tailored to Gandalf. The experimental setup involved configuring the following components within PyRIT: 1) Target Endpoint: Gandalf was set as the target LLM. 2) Red Team Bot: GPT-4o was the LLM powering the red team bot. 3) Attack Strategy: A text description of the objective for the red team bot. In this case, the objective is to extract the password from the Gandalf (the target endpoint). 4) Scorers: Custom scoring engines were implemented to evaluate the responses generated by Gandalf. We used the red team orchestrator to probe Gandalf and extract the passwords for Levels 1-4. PyRIT successfully extracted the passwords by leveraging its self-reasoning capabilities, which keep track of conversation history to increase the likelihood of success in subsequent prompts


PyRIT (Python Risk Identification Tool for generative AI

https://github.com/Azure/PyRIT

No hay comentarios:

Publicar un comentario

Trata a los demás como te gustaría ser tratado.