Practical Use Cases for LLM’s in cyber security (part 1)
I want to start this off by stating, I am not a developer. I’ve written very little code myself (some small PHP ticket-system style applications back in the mid 2000’s). But now.. now I’m a 10x engineer (/s).
That being said, I have a strong understanding of code structure and a long background in infrastructure deployments, including self-hosting and Cloud/SaaS solutions.
Over the past two decades, my focus has been on blue, red, and purple team efforts for large enterprises. Now, I’m interested in how we can leverage LLMs to create our own security tools, potentially saving millions on commercial solutions.
With that out of the way, I want to discuss at length use-cases I’ve been able to create since around February 2024 (all in my free time). The code may not be perfect that you’ll see in this, and the use-cases may not fit your exact needs, but it should give an idea of what you/your teams may be able to build relatively quickly for delivery of a product to help with information security tasks.
ChatGPT.. Write me a C2!
While it wasn’t quite as simple as the heading suggests, it’s absolutely possible to have LLMs write a C2 platform for you. Here is a C2 I created from February — Early August of 2024 utilizing mostly ChatGPT, with a bit of Claude toward the end.
This is a C2 platform that I built off-hours over multiple months, this was in the earlier days in ChatGPT 3/3.5 when you could have it write whatever without limits related to red teaming. This is a fully featured platform, completely written in Python (client and server), utilizing Supabase as the backend (SaaS as a C2!).
When I first wrote it and utilized it on some small ops/testing it was completely undetected, as of today I see all common EDR’s/AV’s detecting it, so I figured it was time to make it public.
One of my key goals was to execute commands in-memory, similar to Beacon Object Files (BOFs). So, the tool is packed with built-in commands that leverage APIs, minimizing the use of the command line.
For a more detailed presentation of it, you can see the slides I created for a presentation I did on it back in June here.
I didn’t manually write a single line of code, but if you review the PPT above, you’ll see I chatted well over 10,000 times by the time I wrapped up the tool to move on to other things, it was a lot.
While the techniques themselves weren’t particularly novel, they were new enough that EDR/AV vendors hadn’t seen anything like it. This allowed the tool, complete with features like process injection, to bypass detection. I’d imagine if you could compile the payload another way beside Pyinstaller, you’d have success with the rest of the tool, it even has LLM built-in to help next steps.
How it happened
But how did I write an entire C2 platform with just LLM’s? It breaks down into multiple factors:
- (Most important?) My wife supports me and my off-work hours extracurricular activities, even if she has no clue what I’m working on
- Strong interest in emerging technologies, and LLM’s are as emerging as they come as of 2024
- Absolutely love to SHIP, shipping tools is the best, I also like creating tools that I visually enjoy, and work how I want them to work; also I hate paying for tools that could be built with LLM’s now
- Relentlessness — you’ll need a lot of focus, re-tooling in your head, and then putting ideas into a Notion/Doc/Whiteboard.. whatever works
If you have no background in coding, you’ll be OK — it will still be a struggle if you’ve never written a line of code in your life, but you can do it with practice and iteration.
Recommendations
- Learn the basics so you know what to ask an LLM (ie. you want a C2, that’s cool — but how can it be different/novel? Like me using Supabase as a middleware). Understand concepts — ie. do you need process injection? That’s cool — how can you ask an LLM to create process injection without it flagging your account? Be creative, maybe you’ve worked in a tool before that does C2-ish things, tell the LLM you’re building a plugin for said tool (ie. I’m working on a plug-in for an EDR platform that needs to move into another process)
- You’re going to need to know the basics. Above was a cool example for a C2, but same goes for concepts around application development. What if you want to create a tool to scan file shares (I did this and we’ll discuss later) and you want to front-end it? Well I really like shadcn (a collection of reusable UI components) and daisyui (a Tailwind CSS component library), and reactjs, and express for a backend plugged into postgres, with python as my backend worker code.. understand what these are and how to use them — if you do, the chat back and forth becomes much easier
- Trial & error — you’re going to have issues, lots of them. LLMs aren’t perfect (though Claude is really impressive now). Learn to work through your application iteratively and test thoroughly
- LLM Broke Something Else In My App! Yes this is frequent and happens a lot, even with really good prompting and telling it to only modify small portions of your code. You will continuously need to adjust and modify your code & re-test to make sure other code doesn’t break. They are getting better.. but in the Supaseatwo days, oof
Prompts, chats and more chats
In Part 2 and beyond I’ll introduce/showcase other tools I’ve written as well (or you can cheat and head over to my GitHub, although not everything is public) — but I really want to discuss the topic of prompts.
Prompts get a lot of light in the LLM world, and for good reason. On a whole as I utilized ChatGPT, Gemini, and Claude I rarely give custom prompts for code development. These days I utilize Cursor and rotate models, but I never provide a prompt, I use out of the box prompts and they work just fine.
What I do is be very specific in my asks, and this seems to work as good as, if not better then writing a large custom prompt. Let’s say you want to build a new C2 tool, here is a sample of what I would start with:
I want to build a new real-time response tool that is similar to common EDR's.
Think of like utilizing CrowdStrike/Tanium. I want the tool to be in Rust.
Let's start by creating the backend, we'll need a way to manage a C2, let's
utilize Supabase for this.
From there, it’ll start to generate what you’re asking for, then you continue to iterate and test over that and continue to ask it questions. As this moves on, your tokens are going to get heavy in your chats, and the LLM’s are going to start to produce poor code. That’s when you know it’s time to move on to a new chat.
In tools like Cursor or GitHub Copilot, you can use operators like @
and #
to incorporate specific files into your prompt, allowing you to pick up where you left off in the last conversation (ex. @my_script.py, #my_script.py).
Do Not Fear Starting New Chats (DNFSNC!)
I can’t stress how important it is to start new chats. In my time utilizing LLM’s for coding you are absolutely going to get lost in the same chat, with the LLM providing circular logic to an issue you’re troubleshooting and you’ll waste HOURS, don’t do this! Start a new chat and get an answer quickly, you can always pick up where you left off.
Small Prompts, big prompts
Getting back to the topic of prompts away from chats, they can be very useful in your tooling. Depending on your use-case you may never utilize prompts, and other times you may need them as part of your new application.
Recently I’ve been working on a new platform (not completed yet, but getting close) called Sketch Chat that ties into Timesketch
This project is an incident response-centric chat platform. It lets you freely discuss security incidents, and then uses Gemini/Azure AI on the backend to translate those chats into the Timesketch format and push them directly to Timesketch. Great for tracking critical items during security incidents without having to write a document, it does it for you. It also ignores regular chat.
But to perform this, it has to have a couple of LARGE prompts, Timesketch isn’t super well-known to LLM’s so you need to help it understand what the format should look like, and what chats should make it through to Timesketch.
If LLM’s were super familiar with Timesketch, my prompt would look like this
Convert all chats you read into common Timesketch format and export into json
format. Do not bring over common banter/back and forth, only relevant security
details.
But since it’s not super familiar.. it looks like this
You are a cyber security expert who is working with the tool Timesketch by Google. There is a new interface being created that allow users to talk in "plain english" and you will convert it into the proper timesketch format (.jsonl) to send off to timesketch later.
IMPORTANT: If a message is marked as "LLM Required", you MUST convert it into Timesketch format, even if it appears to be regular chat. For these messages, make reasonable assumptions about security implications and create appropriate entries.
For example, if a message marked as "LLM Required" says "we saw bad things", you should create an entry like:
{"message": "Potential security incident reported [T1078]", "datetime": "2024-10-16T08:00:00Z", "timestamp_desc": "Security Alert", "observer_name": "analyst"}
Important notes about timestamps:
- If a timestamp is mentioned in the message/file, use that timestamp in the datetime field
- Only use the current timestamp if no timestamp is provided in the content
- Timestamps may appear in various formats (e.g., "2024-03-15 14:30:00", "March 15th 2:30 PM", "15/03/24 14:30")
- If a timezone is specified (e.g., EST, PST, GMT+2), convert the time to UTC
- Common timezone conversions:
* EST/EDT → UTC+4/5
* PST/PDT → UTC+7/8
* CST/CDT → UTC+5/6
* MST/MDT → UTC+6/7
- Convert all timestamps to ISO 8601 format in UTC (YYYY-MM-DDThh:mm:ssZ)
- If no timezone is specified, assume UTC
Here are examples of how you would output:
{"message": "Suspicious domain: malicious.ru", "datetime": "2024-10-16T08:00:00Z", "timestamp_desc": "Network Connection", "domain": "malicious.ru", "observer_name": "alice"}
{"message": "Suspicious outbound connection detected to 12.34.56.78 on port 8080", "datetime": "2024-10-16T08:05:00Z", "timestamp_desc": "Network Connection", "dest_ip": "12.34.56.78", "dest_port": "8080", "observer_name": "bob"}
{"message": "Beaconing activity detected to C2 domain: badsite.com", "datetime": "2024-10-16T08:10:00Z", "timestamp_desc": "Network Security", "domain": "badsite.com", "observer_name": "charlie"}
{"message": "Large file transfer (400GB) to external FTP server detected", "datetime": "2024-10-16T08:15:00Z", "timestamp_desc": "Data Loss Prevention", "dest_port": "21", "bytes_sent": "400000000000", "observer_name": "dave"}
{"message": "PowerShell execution with base64 encoded command detected", "datetime": "2024-10-16T08:20:00Z", "timestamp_desc": "Process Execution", "computer_name": "WORKSTATION01", "observer_name": "eve"}
{"message": "Multiple failed login attempts detected from IP 10.0.0.5", "datetime": "2024-10-16T08:25:00Z", "timestamp_desc": "Authentication", "source_ip": "10.0.0.5", "observer_name": "frank"}
{"message": "Scheduled task created for persistence", "datetime": "2024-10-16T08:30:00Z", "timestamp_desc": "Scheduled Task Creation", "computer_name": "SERVER02", "observer_name": "grace"}
{"message": "Malicious file detected with MD5 hash d41d8cd98f00b204e9800998ecf8427e", "datetime": "2024-10-16T08:35:00Z", "timestamp_desc": "File Hash", "md5_hash": "d41d8cd98f00b204e9800998ecf8427e", "observer_name": "henry"}
{"message": "Suspicious executable found with SHA256 hash e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", "datetime": "2024-10-16T08:40:00Z", "timestamp_desc": "File Hash", "sha256_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", "observer_name": "ivy"}
{"message": "Suspicious executable detected at C:\\ProgramData\\XCerfzz.exe [T1059.003]", "datetime": "2024-10-16T08:45:00Z", "timestamp_desc": "File Creation", "file_path": "C:\\ProgramData\\XCerfzz.exe", "computer_name": "WORKSTATION01", "observer_name": "jack"}
Example of message with multiple attributes that should create multiple entries:
Message: "saw some weird processes like C:\\Windows\\System32\\ripFAULT.exe running with the hash 0c32215fbaf5e83772997a7891b1d2ad"
Should create two entries:
{"message": "Suspicious process detected: C:\\Windows\\System32\\ripFAULT.exe [T1059]", "datetime": "2024-10-16T08:50:00Z", "timestamp_desc": "Process Execution", "file_path": "C:\\Windows\\System32\\ripFAULT.exe", "observer_name": "alice"}
{"message": "Process hash identified: 0c32215fbaf5e83772997a7891b1d2ad [T1059]", "datetime": "2024-10-16T08:50:00Z", "timestamp_desc": "File Hash", "md5_hash": "0c32215fbaf5e83772997a7891b1d2ad", "observer_name": "alice"}
Important notes:
1. Always include the observer_name (the person reporting the activity)
2. Only include technical details (IPs, ports, protocols) that were explicitly mentioned in the message
3. Include timestamp from when the message was sent
4. Use appropriate timestamp_desc values like "Network Connection", "DNS Activity", "Network Security", "Data Loss Prevention", "Process Execution", "Authentication"
5. If multiple indicators are mentioned in a single message (like file paths AND hashes, or IPs AND ports), create separate entries for each indicator while maintaining the relationship in the message field
6. If you see wording like "contain" or "network contain" and then a weird name like "ABC123" or "CPC1234" etc, these are most likely the hostname of the impacted machine. Use the computer_name field for this. Do not ignore contain/containment language. This is related to remediation efforts.
7. Always include relevant MITRE ATT&CK TTPs in square brackets at the end of the message field
8. For file hashes, use md5_hash and sha256_hash fields accordingly
9. For file paths, use the file_path field and include the computer_name if available
10. Investigation findings and scope statements should be captured, even if they seem like regular chat. Examples:
- "No other users were impacted" → Create an entry about scope limitation
- "We've reviewed all logs" → Create an entry about investigation completion
- "Analysis complete, only 2 machines affected" → Create an entry about impact scope
Example of investigation finding entries:
{"message": "Investigation scope: No additional users impacted [T1087]", "datetime": "2024-10-16T08:00:00Z", "timestamp_desc": "Investigation Finding", "observer_name": "analyst"}
{"message": "Investigation complete: Impact limited to 2 machines [T1082]", "datetime": "2024-10-16T08:00:00Z", "timestamp_desc": "Investigation Finding", "observer_name": "analyst"}
There may be times it's just "regular chat" and you don't need to convert anything, you need to make that decision. Your focus should be on turning indicators into timesketch, not worrying about common back and forth. If you decide it's regular chat, write back "Regular chat: no sketch update"
IMPORTANT: Messages containing investigation results MUST be converted to Timesketch format, even if they appear to be casual conversation. Examples of investigation results that require conversion:
- "I've reviewed all the emails..."
- "We checked the logs and..."
- "No additional impact found..."
- "Investigation is complete..."
- "Only X machines were affected..."
- "No other users were compromised..."
For these types of messages, use "Investigation Finding" as the timestamp_desc and include relevant MITRE ATT&CK TTPs.
Example:
Message: "I've completed reviewing all emails involved in the phishing campaign and saw no additional impact."
Should create:
{"message": "Email investigation complete: No additional compromise identified from phishing campaign [T1566.001]", "datetime": "2024-11-08T17:44:19Z", "timestamp_desc": "Investigation Finding", "observer_name": "dan"}
Keyword/Pattern Recognition:
- "validate connectivity"
- "check connection"
- "test access"
- "verify connection"
- "ping" (when used in a network context)
Combine these with indicators like domain names (sev1.com), usernames (jsmith), IP addresses, etc.
Contextual Analysis:
Even if a message appears conversational, analyze the context for security implications. If the sentence mentions activities related to network verification, user validation, or access testing, it should be treated as LLM Required.
Default to Security Relevance:
When in doubt, lean towards classifying a message as LLM Required if it contains any technical indicators or action verbs related to security investigations.
Helping LLMs understand your specific requirements is crucial, especially when you’re using them for tasks outside of their ‘standard’ knowledge domain. Just remember you can go less if it’s a general topic LLM’s already know about, but you’ll need to go more in-depth for custom solutions that may not have a large background. You can obviously also fine-tune as well, but I haven’t gone that path yet.
Platform Wars
As of the end of 2024, AI/LLM wars are real. Almost weekly a new model is being delivered and it’s extensively better than the prior models. Cost comes with new models, but they over time have been reduced. In all the work I’ve done, I’ll outline what I’ve found (personal opinion, everyone is different).
Coding
Claude, and even as of the end of November, it’s not even close. Claude 3.5 sonnet is hands down the best coding model on the market. The closest I’ve seen so far has been Gemini’s latest model (exp 1121), but even that is not as consistent and fast as Claude 3.5.
Summarization / Task
Gemini, I prefer it to GPT, but GPT works as well. Gemini I’ve found overall to be really good at “reasoning” ie. I’ve used it in some tools we’ll discuss later for reasoning on detections and it works well. Gemini 1.5 Pro, and the experimental models are excellent in this space. I use Gemini for my reasoning models behind my URL scanning tools and Sketch Chat.
GPT is as well, I just found pricing/speed to be better with Gemini.
General Chat & Conversational
Gemini & GPT again, both are excellent in this regard. Sometimes I’m brainstorming and want to put ideas down that I want to track and expand on, both these platforms are excellent for it, including their conversational models.
One thing to take into consideration is obviously cost. When building LLM’s into your tools, really understand what they need to do. Are they doing a quick once-over of some HTML? Something like cheap Gemini Flash works well. Do you need something capable of making decisions based on detections, URL screenshots, and scan outcomes? You’ll want more reasoning with Gemini Pro/GPT 4.0 or their o1 models. They’ll do a better job scoring things and have far fewer misses.
That’s a wrap for part 1! Stay tuned for my part 2 where we’ll discuss:
- pyShares and SentryShares
- QSI — URL scanning platform
- Discuss best models & best ways to get up and running quickly