Over the past year, the team spent sometime looking into Adobe Acrobat. Multiple vulnerabilities were found with varying criticality. A lot of them are worth talking about. There's one specific interesting vulnerability that's worth detailing in public.
Before diving into details, let us see a typical use of the functions streamFromString() and stringFromStream() to encode and decode strings:
The function stringFromStream() expects a ReadStream object obtained by a call to streamFromString(). This object is implemented natively in C/C++. It is quite common for clients of native objects to expect certain behavior and overlook some unexpected cases. We tried to see what will happen when stringFromStream() receives an object that that satisfies the ReadStream interface but behaviors unexpectedly like retuning invalid data that can’t be decoded back using –for example– Shift JIS, and this is how the bug was initially discovered.
2. PROOF OF CONCEPT
It passes an object with a read() method to stringFromStream(). This function returns invalid Shift JIS byte sequence which begins with the bytes 0xfc and 0x23. After running the code, some random memory data was dumped to the debug console which may include some recognizable strings (the output will differ on different machines):
Surprisingly, this bug does not trigger an access violation or crashes the process – we will see why. Perhaps one useful heuristic to automatically detect such bug is to measure the entropy of the function output. Typically, the output entropy will be high if we pass input with high entropy. An output with low entropy could be an indication of a memory disclosure.
3. ROOT CAUSE ANALYSIS
In order to find the root of the bug, we will trace the call of stringFromStream() which is implemented natively in the EScript.api plugin. This is a decompiled pseudocode of the function:
This function decodes the hex string returned by ReadStream’s read() and checks if the encoding is a CJK encoding – among other single-byte encodings such as Windows-1256 (Arabic). It then creates an ASText object from the encoded string using ASTextFromSizedScriptText(). The exact layout of ASText object is undocumented and we had to reverse engineer it:
The u_str field is a pointer to a Unicode UCS-2/UTF-16 encoded string, and mb_str stores the non-Unicode encoded string. ASTextFromSizedScriptText() initializes mb_str. The string mb_str points to is lazily converted to u_str only if needed.
It worth noting that ASTextFromSizedScriptText() does not validate the encoded data apart from looking for the end of the string by locating the null byte. This works fine because 0x00 maps to the same codepoint in all the supported encodings as they are all supersets2 of ASCII and no multibyte codepoint uses 0x00.
The function ASTextGetUnicode() is implemented in AcroRd32.dll lazily converts mb_str first to u_str if u_str is NULL and returns the value of u_str:
The function we named convert_mb_to_unicode() is where the conversion happens. It is referenced by many functions to perform the lazy conversion:
The initial call to Host2UCS() computes the size of the buffer required to perform the decoding. Then, it allocates memory, calls Host2UCS() again for the actual decoding and terminates the decoded string. The function change_u_endianness() swaps the byte order of the decoded data. We need to keep this in mind for exploitation.
The initial call to Host2UCS() computes the size of the buffer needed for decoding:
First, Host2UCS() calls MultiByteToWideChar() to get the size of the buffer required for decoding with the flag MB_ERR_INVALID_CHARS set. This flag makes MultiByteToWideChar() fails if it encountered invalid byte sequence. This call will fail with our invalid input data. Next, it calls MultiByteToWideChar() again but without this flag. Which means the function will successfully return to convert_mb_to_unicode().
When the first call to Host2UCS() returns, convert_mb_to_unicode() allocates the buffer and calls Host2UCS() again for the actual decoding. In this call, Host2UCS() will try to decode the data with MultiByteToWideChar() again with the flag MB_ERR_INVALID_CHARS set, and this will fail as we have seen earlier.
This time it will not call MultiByteToWideChar() again because the u_str_size is not zero and the if condition is not met. This makes Adobe Reader falls back to its own decoder:
Initially, it calls PDEncConvAcquire() to allocate a buffer for holding the context data required for decoding. Then it calls PDEncConvSetEncToUCS() which looks up the character map for the codec. However, this call always fails and returns zero. Which means that the call to PDEncConvXLateString() is never reached and the function will return with u_str uninitialized.
The failing function, PDEncConvSetEncToUCS(), initially maps the codepage number to the name of Adobe Reader character map in the global array CJK_to_UC2_charmaps. For example, Shift JIS maps to 90ms-RKSJ-UCS2:
Once the character map name is resolved, it passes the character map name to sub_6079CCB6():
The function sub_6079CCB6() calls PDReadCMapResource() with the character map name as an argument inside an exception handler.
The function PDReadCMapResource() is where the exception is triggered. This function fetches a relatively large data structure stored in the current thread's local storage area:
It checks for a dictionary within this structure and creates one if it does not exist. Then, it checks for a STL-like vector and creates it too if it does not exist. This dictionary stores the decoder data and it entries are looked up by the character map name ASAtom atom string – 90ms-RKSJ-UCS2 in our case. The vector stores the names of the character maps as an ASAtom.
The code that follows is where the exception is triggered:
It looks up the dictionary using the character map name. If the character map is not in the dictionary, it is not expected to be in the vector too, otherwise it will trigger an exception. In our case, the character map 90ms-RKSJ-UCS2
– atom 0x1366 – is not in the dictionary so ASDictionaryFind() returns NULL. However, if we dumped the vector, we will find it there and this is what causes the exception:
In conclusion, we've demonstrated how we analyzed and root-caused the vulnerability in detail by reversing the code.
Encodings are generally hard to implement for developers. The constant need for encoders and encodings makes them a ripe area for vulnerability research as every format has its own encoders.
That’s it for today, hope you enjoyed the analysis. As always, happy hunting!
10 – 8 – 2020 – Vulnerability reported to vendor.
31 – 10 – 2020 – Vendor confirms the vulnerability.
3 – 11 – 2020 – Vendor issues CVE-2020-24427 for the vulnerability.
CVE-2020-24427: Adobe Reader CJK Codecs Memory Disclosure Vulnerability