Skip to main content
TR EN

MSc. Thesis Defense: Emre Ekmekçioğlu

BEHAVIORAL CLASSIFICATION OF MALWARE VIA WINDOWS API CALL SEQUENCES USING LARGE LANGUAGE MODELS

 

Emre Ekmekçioğlu
Cyber Security, MSc. Thesis2025


Thesis Jury
Asst. Prof. Orçun Çetin (Thesis Advisor)
Asst. Prof. Süha Orhun Mutluergil,
Prof. Dr. Budi Arief

Date & Time: 18 June 2025 11:00 AM
Place: FENS G015

Keywords: Malware, Large Language Model (LLM), Windows API, Sandboxing,
CAPEv2, Dynamic Analysis, Computer Security

Abstract


The growing scale, sophistication, and automation of malicious software pose a critical challenge to global cybersecurity. As of 2024, over 6.2 billion malware infections were detected worldwide, with projections reaching 6.5 billion by 2025. Every day, cybersecurity systems detect approximately 560,000 new threats, reflecting the industrial scale of modern cybercrime. In this context, understanding the behavior of malware during execution becomes essential for effective detection and classification. Among dynamic analysis techniques, tracing API calls, which capture the interactions between a program and the victim environment, offers valuable insights into the underlying behavioral patterns of malware. However, modeling these long, complex, and often noisy sequences presents a significant challenge. To address this, we explore the use of Large Language Models (LLMs) originally developed for natural language processing, as they excel at processing long, sequential data. Their capacity to learn contextual patterns over extended input windows makes them a promising tool for behavior-based malware classification.

This thesis explores the use of fine-tuned large language models (LLMs) for behavior-based malware classification using dynamic execution traces from modern Windows systems. As traditional static analysis techniques face limitations, dynamic analysis has emerged as a more resilient alternative. However, modeling API call sequences remains challenging due to their sequential complexity and behavioral overlap between malware types.

In our study, we constructed a balanced dataset of 10,371 Windows portable executable (PE) malware samples, selected from existing public repositories and executed in a Windows 10 sandbox environment to extract API call traces. Each sample was labeled with one of nine behavior-defined malware types, enabling supervised multi-class classification based on runtime behavior. We used these traces as textual sequences and performed two fine-tuning stages. First stage includes fine-tuning of four open-source LLMs, Llama-3.1-8B, Mistral-7B v0.3, Qwen3-8B, and GLM-4-9B-0414, across a range of context lengths, starting from 4k and increasing through 8k and 16k, all the way to 32k tokens. At the second stage, we selected the best-performing model and context length combination from the initial fine-tuning phase and subjected it to more compute-intensive fine-tuning to further optimize performance.

In the initial phase, three models, Mistral-7B-v0.3, Llama-3.1-8B, and Qwen3-8B demonstrated the capacity to acquire the classification task. With regard to context length, the Mistral-8B-v0.3 and Llama-3.1-8B models demonstrated superior performance with an 8k context length in comparison to alternative training lengths. Meanwhile, the Qwen3-8B model produced its best results with 8k and 16k context lengths, yielding equivalent outcomes. A comparison of the models' best performances in the first stage reveals that Mistral-7B-v0.3 achieved the highest overall accuracy of 84.1\%, followed by Llama-3.1-8B with 79.1\% and Qwen3-8B with 74.4\%, respectively.

The second stage fine-tuning of the best-performing model Mistral-7B-v0.3 led to further performance gains, with the final model achieving 88.7\% accuracy, and macro and weighted F1-scores of 0.886 and 0.887, respectively. It showed particularly strong results on malware families with distinct and high-volume behaviors, such as Flooder, Crypto-miner, and Ransomware. Meanwhile, classification accuracy was more modest for families like Adware, Spyware, and Dropper, which often exhibit overlapping or subtler API call patterns.

These findings highlight the effectiveness of modern LLMs in learning rich behavioral patterns from API call traces, offering a scalable and feature-free approach to behavioral malware classification.