CERT-EU Lightning Talk: Elevating phishing defence with On-Prem LLMs
Empowering SOC Analysts through AI
NOTE: This content is from my experiences until October 2024, LLM’s might evolve and no longer need these tips & tricks.
No LLMs were used to generate this blog post or my lighting talk.
Intro.
I had the opportunity to present a lightning talk at The CERT-EU 2024 conference.
As I’m a cyber engineer and not a data scientist, I have to stick to basic things and basic concepts, so I wanted to present something that SOCs/organizations might want to give a try when working with on-premise LLMs so I presented these 2 concepts:
- Benchmarking (Models + Prompts)
- Micro agents
Table of content
Concept 1: Benchmarking
Benchmarking models
Every other week a new LLM model is published, it’s impossible for me to keep up with how well its performing, what’s new, etc… so I’ve written a simple flow in Tines to grab 4000 emails classified by analysts as clean, imposter, threat, spam, etc… and ask the LLM model to form an opinion on it, we save the output and compare it to what the humans thought of it.
(blogpost of the Tines flow is coming)
Benchmarking prompts
Once you have a (work)flow to benchmark models you can adjust it easily to also benchmark prompts (eg: do we use XML hints, markdown hints?) try out different prompts, with the same benchmark, this allows you to check if a prompt is performing better or worse versus the same subset of emails.

Concept 2: Micro agents
The second concept is (for lack of better words) micro agents, I’ve also heard it being called AI supervisors; the idea is that you use the LLM to evaluate its own output and/or summarize the good stuff
The evaluation micro agent
Once in a while, the LLM would just not respond the way it was expected, so what I suggest you do is to pipe the LLM output back to the LLM with the prompt and ask if the output makes sense, its basically using LLM’s to babysit itself.

The summarizing microagent
The idea with the summarizing micro agent is to ask the LLM x amount of the times a question, (3 times in the example given in the slide) and then ask the LLM to summarize the other outputs, this will eliminate odd replies like “Yes, I can analyze this email for you”

Stacking/combining them
you can combine these 2 concepts as well! either the summarizing agent with a check, or checks with a summarizing agent

Things that work for us
In my last slide, I talked about things that work for us, and you might want to give it a shot too!
The main concept is to leverage the “language” part of the phishing email with LLM’s (go figure), so giving it a bit less information often works better.
I also suggest removing email headers that appliances add, if an LLM “sees” that a sandbox cleared the email, it tends to go like “Sandbox x, y have analyzed this email and have said it clean, so this email is not phishing” when its a phishing email
The last thing I would like you to try is to leverage functions / job descriptions of people against LLM’s eg: does it make sense that Lucy a policy officer receives an invoice for 2 shipping containers of humanitarian aid?
from my experience, LLM’s are very good at picking out “odd” emails addressed to certain jobs if you give them the context of the role of the recipient

What’s next?
Once you have tuned prompt & model, you can run every incoming email against the LLM’s and start flagging phishing emails before they even reach the inbox of your end user.
Either have it as a step on your edge/boundaries, or have it reactively pull the email from inboxes

Notes and mentions,
Thank you CERT-EU for the conference, and for accepting my lighting talk.
Header image taken from CERT-EU, made by Robert Laszlo Kiss