Just a month after we published the Insights post “Digging into Gmail URLs”, Google made the use of their new Gmail interface mandatory. The old Gmail interface (let’s call it the “legacy” interface) had been in use for years, so even though it is no longer available online we expect to be dealing with it within our electronic evidence for years to come. The new Gmail interface includes not only considerable visual changes, but changes in URLs which impacted the Gmail URL decoding we discussed in our previous Insights post.
Throughout this Insights post, we will discuss significant differences between URLs related to the legacy and new Gmail interfaces (hereafter, the legacy and new Gmail URLs) as well as the process of decoding information from the now “obfuscated” URLs. By doing so, we will be able to effectively extract important information from both the legacy and new Gmail URLs.
Executive Summary (a/k/a TL;DR Summary)
We have found that it is still possible, with some additional effort, to extract human-readable and useful information from URLs used by the new Gmail interface. New Gmail URLs related to messages being viewed can be decoded in a way that reveals timestamps as we had before (with the exception of messages sent and received from same account starting in March 2018). Unfortunately, we have not been able to identify composition timestamps in new Gmail URLs as we could in legacy Gmail URLs. We have developed a Python tool (Gmail URL Decoder) which incorporates our research and allows us to extract useful information from both the legacy and new Gmail URLs.
About the new URL format
The basic structure of the new Gmail URL format (user number, folders, and search) is the same as it was before, so you are encouraged to revisit the previous Insights post we published on this topic – Digging into Gmail URLs. Most of the fields already explained are still valid and we will focus on those fields that have changed between legacy and new Gmail URLs.
Message viewing
Here we can see the URL of the same message being viewed in both legacy and new interfaces.
It is evident that what has changed is what we previously called the message ID, which we will now refer to as the legacy view token. Remember that with the legacy interface we were able to extract timestamps from the URLs of messages being viewed per the legacy view token. Now let’s explore and compare different legacy view tokens and their corresponding timestamps as well as the new view token that is present in the new URLs for the same messages.
Legacy View Token | Timestamp (UTC) | New View Token |
---|
16426a350b26c62c | 2018-06-22 08:36:34 | FMfcgxvwzcMvCVqtTprDSvtNVBhnMBzq |
16426a314de05db3 | 2018-06-22 08:36:19 | FMfcgxvwzcMvCVqtRdXPzwXBsWzvKhJC |
163be85d78c6008b | 2018-06-02 03:23:52 | FMfcgxvwzRXLNMWwJgdjMFzrQGrPLLSF |
163bb9faac3baab8 | 2018-06-01 13:53:13 | FMfcgxvwzRWCxJJrmmrjnGxqpXXDqrpn |
163bb010791228bf | 2018-06-01 10:59:56 | FMfcgxvwzRWCnmszhDdjSTmnTZMLHQrP |
163b569b52f1edb4 | 2018-05-31 08:56:33 | FMfcgxvwzJHWbzrDlQjhhrMnmRdjFjnD |
1632035405b7682f | 2018-05-02 09:35:50 | FMfcgxmZVXvrgkcmfQNccpTZsGhxCSGZ |
1631f9c3a74cd345 | 2018-05-02 06:48:42 | FMfcgxmZVXvrWPNgNTwdRlPCnJbFddNt |
1631b6d1d5505143 | 2018-05-01 11:18:45 | FMfcgxmZVXtjnTSSxQfqCDnVNdcgKSnF |
12f356c653c78c4c | 2011-04-08 14:03:06 | FMfcgxRvprtNNfKFPzvXTwHcmpvzKwZB |
It looks like there is some pattern behind the new view tokens since we have some initial characters that are always the same. We can also see that tokens which are particularly close in time have even more characters in common. Finally, we observe that roughly the second half of each token appears to be completely random, even for those extremely close in time.
Our first intuition leads us to think of this as some direct transformation from the previous legacy view token, that is, the new token also appears to contain some kind of numeric timestamp representation. This is because the abstract structure of both looks to be similar: with numeric timestamps, we have similar first digits (most significant ones) for closer ones and totally different digits for the least significant digits, even for closer ones.
We noticed a problem with this assumption: if you compare the last entry in the table above (from a message received in 2011) to any other one, you will notice that the legacy view token’s most significant digits are not really close to all the other ones (samples from 2018) but we still have strong similarity in the first characters between the new view tokens.
The easiest (and, correct) explanation is that assuming the new view token is some kind of transformation from the legacy view token, there is “something” fixed that gets prepended at the beginning of the numeric timestamp representation before being transformed into the new view token.
That is, instead of simply having:
numeric timestamp (from legacy view token) → transformation → new view token
we would be dealing with:
something + numeric timestamp (from legacy view token) → transformation → new view token
Unveiling the transformation process
Initially, we tried the simplest approach to recover the legacy view token from the new view token: leaving the first characters fixed (changing the number of fixed characters), assigning numeric values to characters, and performing different mathematical operations between them in many ways.
After having tried those simple transformations and combinations between them, they seemed not to lead to any relevant result. However, as we explored many new tokens we noticed two interesting things:
There are no vowels present in any of the tokens (neither upper nor lower case)
After having been involved in many “Capture the Flag” challenges at security conferences and other events, our CTF-trained vision started to scream out that those tokens looked like something familiar: base64 encoded strings (although no padding characters “=” or “==” were present at all)
Unsurprisingly, trying to directly apply a base64 decoding to the tokens did not result in anything close to readable. It was around this time that we had a revelation and decided to take a look at Gmail website source and… eureka!
We find embedded in the source an obfuscated javascript function that is performing some operations using exactly what we were looking for:
A reduced character set with letters, not including any vowel: BCDFGHJKLMNPQRSTVWXZbcdfghjklmnpqrstvwxz
A call to atob() method, which is the native javascript method to decode base64 encoded strings
References to some constant strings which include “thread-” and “msg-”
If we extract that function and run it against our new view token we obtain a pleasing result.
Here we see that thread-f: is the prefix we speculated to be present, which is followed by a large integer that turns out to be the decimal representation of the (hexadecimal) legacy view token value.
The code above has been beautified, analyzed and deobfuscated to understand the transformation algorithm and rewritten in Python to make it easier to work with as a simple script. In addition, it has been reversed in order to have the ability to encode a legacy view token into a new view token. The reverse process turned out to be fairly simple as it only involved taking the legacy token, encoding it with base64, removing padding characters, and applying the same transformation algorithm but interchanging the use of the two character sets (the reduced one in exchange of the full one that appears in the obfuscated code as well).
With this, now we can not only recover the legacy token from the new one, but can also get the associated timestamp for each message that is being viewed.
A subtle exception
We have been able to find one (and only one) exception related to new view tokens that extensively differs from all other ones. This is the case for messages sent and received from the same account (that is, when a user sends a message to themselves). In this case, the new URLs not only differ from all the others of the kind “FMfcgx…” but can also differ between themselves.
The good news is that now we know the algorithm that is being used in order to decode those tokens, so we can apply it directly to see what’s going on. Let’s see some examples:
Old view token | New view token | Decoded new view token |
---|
166a1a6e77903f6e | QgrcJHsTnPbZsLtzKKnZnjSgWbTsJbgWmNL | thread-a:r-7956246057532141946 |
166316e92fe65b70 | KtbxLwHDkcrRDgFknhGFpwWXVSwWHDmHRL | thread-a:r4668072437738941715 |
165a2275b18f1824 | FFNDWLvlXsCvcpmszhgMDVsFSpjtQQCM | thread-a:r107250775695396031 |
164b2fecb949bb9f | RdDgqcJHpWcvcDjNzvQsFnntzrvDCJKNtfSZkCQmnpbB | thread-a:mmiai-r-1311628439981527388 |
As we can observe, the differences are due to the fact that these special new view tokens are decoded into strings that contain “a” and not “f” just after the “thread-” prefix. Moreover, the following integer is preceded by an “r” (and sometimes even the “mmiai-” prefix as well) and appears to be either positive or negative and not following any particular order that would clearly reveal the presence of an associated timestamp as before.
It is important to also point out that the new view tokens within messages sent to the same account started to differ from all other messages received since March 2018. That is, before March 2018, even messages that were sent to the same account had the same format as all the other ones “FMfcgx…”. If anyone reading this Insights post could provide any relevant data to refine the exact date, that would be helpful!
What’s more interesting is that in late April 2018, Google introduced the new Gmail interface to the public. This strongly suggests that “thread-a” prefixes would be related only to the new interface whilst “thread-f” prefixes would have some kind of cross-compatibility with previous URL tokens as we can go back and forth between the legacy and new formats.
Message composition
We will now take a look at URLs being generated when composing a new message. The basic structure is the same as before, but now the compose token is not a hexadecimal value from which we can get an associated timestamp as we did with old compose tokens, but an encoded value that again looks like a kind of base64 encoded string. Let’s see a few examples with the new compose token, which can also appear combined with other tokens:
We didn’t waste any more time and tried applying our decoding method, giving us satisfactory results:
New compose token | CllgCKCJFLWMHHkqtnWGvdrBDXZKVTcXMWKvVZDHQXJqkKTMZXXdTVPFRXtxPTDkmBrZsxXRRCL |
---|
Decoded new compose token | thread-a:r8973701401802692499+msg-a:r8975353885019376627 |
---|
New compose token | jrjtXRFMDcbWcnHMLDLkbXkhVVJQlzmQBLdSfbgPFPCHfGlZqfbblkPSMnkNztHdfWpSfTFB |
---|
Decoded new compose token | thread-a:r703070513838286583+msg-a:r1256652636240605348 |
---|
In cases where we composed more than one message at the same time, the encoded version would just be larger and we would see the information from all messages being composed concatenated as soon as we decoded it:
New compose token | LRmDGdgRzcLfWjNkTNxQdWRnQpVgBvJDhgJhSnBMGCVrJgBxhL bdJGNqXXHzQWmqNHxjwmFhZDLtbvcFltDSKFLdPlxBzShTnSldcFl lpmVvvMPBCGsrFTSmgfFwHhjNDwBgjqKWQHkRcBttPcMxDKBN mhBHhM |
---|
Decoded new compose token | a:r703070513838286583+msg-a:r1256652636240605348,thread-a:r5676369719463841496+msg-a:r4760743514676815931 |
---|
New compose token | LRmDGdgRzcLfWjNkTNxQdWRnQpVgBvJDhgJhSnBMGCVrJgBxhL bdJGNqXXHzQWmqNHxjwmFhZDLtbvcFltDSKFLdPlxBzShTnSldcFl lpmVvvMPBCGsrFTSmgfFwHhjNDwBgjqKWQHkRcBttPcMxDKBN mhBHhM |
---|
Decoded new compose token | a:r703070513838286583+msg-a:r1256652636240605348,thread-a:r5676369719463841496+msg-a:r4760743514676815931 |
---|
Notice as well that when more than one message is being composed, the first “thread-” prefix is missing (bug or feature?).
The bad news here is that, as you probably already guessed, we are dealing with new “a” elements (thread-a and msg-a) which don’t look like they contain any associated timestamps. You can observe that the values following these prefixes look totally unrelated and arbitrarily spread out among a huge range of values, including negatives. What’s more, in this new interface, the URL does not get updated as drafts are being automatically saved, as was the case with the legacy interface. Remember that the large number of composition-related URLs related to a single message was our first indicator that a timestamp was present in legacy Gmail URLs. This basically means that, even assuming that there were timestamps contained within these values, we would only be able to reveal a timestamp for when a message was first being composed as opposed to when each draft was saved. This is an academic issue of course, because it does not appear that timestamps are contained in these values.
Gmail URL Decoder: extracting useful information from Gmail URLs
With all the knowledge we gathered during research on Gmail URLs from both legacy and new interfaces, we have been able to extract information that has been useful in our cases.
Our research has led to the development of a Python 3 tool called Gmail URL Decoder. Gmail URL Decoder finds Gmail URLs within both text files and raw data files and then effectively extracts and decodes useful information that they contain.
Here you can see Gmail URL Decoder running against a text file containing a list of URLs (that could have been previously extracted from an active history file by another digital forensics tool):
Gmail URL Decoder can also carve for Gmail URLs in raw data files with no assumed format at all. In order to accomplish this, heuristics are used in order to delimit the boundaries of the found URLs.
As a test of Gmail URL Decoder’s carving functionality, we ran it against a Windows 10 x64 (build 1803) virtual machine file (VDI). Then we used verbose mode to print the results in terminal and compact mode to save the output file without any indentation, just plain json not intended to be human friendly. We also explicitly indicated that we only expected to find URLs from the new interface in our raw data file:
We could then load the output into other applications (we demonstrated this previously with Firefox). Now we will see how to do this with Python in order to work with the data as desired.
As we can see, the output file gets trivially deserialized into native objects (in the case of Python, a list containing dictionaries, each dictionary containing all the information from each result found). The same can be accomplished by virtually any language with a json module.
After a two-week “Early Release” on our Downloads page, the Gmail URL Decoder source code (along with detailed explanations of Gmail URL Decoder options) will be publicly available under the MIT License in the following GitHub repository:
https://github.com/ArsenalRecon/GmailURLDecoder
Summary
We have found that it is still possible, with some additional effort, to extract human-readable and useful information from URLs used by the new Gmail interface. URLs related to Gmail messages being viewed can be decoded in a way that reveals timestamps as we had before (with the exception of messages sent and received from same account starting in March 2018).
Unfortunately, for URLs related to Gmail messages being composed, we have not found timestamps within new Gmail URLs as we had in the legacy Gmail URLs.
Finally, we have developed a Python tool (Gmail URL Decoder) which incorporates our research and allows us to extract useful information from both the legacy and new Gmail URLs. You can find Gmail URL Decoder on our Downloads page now and we will place it on GitHub soon.