Sunday, September 27, 2015

Cache side channel attacks

Cache side channel attacks

Errata 5.11.2015:
As I expected a few errors and less than accurate formulations got into this post. There is two things I’d specifically wish to address here.

My comments on using a non-inclusive cache hierarchy to defeat cache attacks are misleading at best. Obviously since the L1 and L2 caches are on a per core basis. It’s the L3 shared nature combined with the non-inclusive cache hierarchy that allows us to flush these caches from another core in the first place. However these caches are relatively small so that with only a “small” amount of memory activity by the victim or other code running on the same core (this includes the attacker if she can get herself executing on that core) rare events can still be effectively spied upon.


On using performance counters to detect cache side channel attacks I focused only on low frequency attacks in this text. Mr. Herath and my slides from Black hat does not. The idea is that high frequency cache attacks is relatively easily identified as such from performance counters (either through the simple amount of cache misses or say a ratio of cache misses to cache hits). Alternatively machine learning methods could be applied to such performance counter data. The low frequency data is much more difficult to keep apart from noise generate by normal usage of memory in a busy system. Say the attacker is looking for keyboard events. For this he is polling every .5 second and have say 15 good addresses he polls. That gives us 15 cache misses in .5 second periods (possibly even spread out over the .5 seconds) which isn’t much if a video encoder is causing thousands of misses also going up and down periodically as the video encoder switches between motion search (large number of cache misses and lots of hits) and entropy encoding (few hits and misses).  This is where the approach I describe here can filter the noise and force the attacker into territory where we can with confidence detect him.

Prelude

No tech stuff in this part here. Skip to introduction to read the actual blog post. If you care about my ramblings do continue.
It’s been 8 months awesome months since I started my comeback into infosec with this blog. It’s time to do status. A few times I’ve felt misunderstood. The most common misunderstanding is that I have a different concept of research than many in the infosec community. This blog isn’t research. It is play. I do it for fun and games and that’s it. There is no intend on my side to do research, because that requires me to be systematic about it and that again is outside of my time budget. I hope from time to time to present a new idea here, but that’s the loftiest I wish to get with this blog. I do hope that the ambitious infosec newbee might be inspired to look into some of the things I write about and I also hope that the infosec professionals occasionally enjoys a lunch break with my musings.
The most read post so far was the nostalgia post. Followed by anything row hammer, followed by anything with machine learning in the title. Ironically the post I like the best on machine learning doesn’t have those words in the title and isn’t being read much. The post nobody cares about is about H.a.r.e.s. which I think is a shame though I admit is not the most relevant, but H.a.r.e.s. sounds like such a cool idea. The LangSec post was the biggest surprise to me: It’s quite well read.
As a personal note – I find the best infosec blogs is by far http://lackingrhoticity.blogspot.com/ by Mark Seaborn and http://www.malwaretech.com/ by MalwareTech. Mark’s blog is really inspiring on a technical level and MalwareTech’s joy of finding things out is contagious to me.
For now I’ll go offline for a couple of weeks and hope to spend a lot of time with a print out of an article about using performance counters to reverse engineer caches.

Introduction

This blog post benefitted from conversations with some really smart people . I’d like to extend my thanks to them.  Nevertheless this blog is probably full of errors and they are mine alone. I’ve not released source codes because the topic is mostly offensive and the source codes would be more useful for offensive purposes than defensive. If you for some reason think my sources could be useful, get in contact (I’ll be offline the next 3 weeks, but I will get back to you). FInally I'd like to apologize that this post got rather rushed at the end.Infact I'm writting this with a plane to catch. Particular damage was done to the different timers section where I really would've liked to add more information.


Cache side channel attacks have been around since 1996 or so perhaps even longer. Recently they had a renaissance in the literature because of the (re-?) emergence of a 3rd level cache . (The 3rd level cache is identical to the last level cache on systems that have 3 cache levels - this is worth to note when reading the literature). The 3rd level cache on modern intel processors is shared between cores and this effectively allows a program at any privilege level to spy on programs running on other cores simultaneously without being concerned about sandboxes, privilege level and to some degree even through hypervisors. The sheer size of the new 3rd level cache also allows for spying being more accurate than with previous caches. The usage of cache side channel attacks on the L3 cache has been show to be able to successfully allow you to track mouse cursors from a java script on a webpage, grab keyboard input from other users on the same system and grab cryptographic keys on co-located virtual computers in the cloud. It is thus not entirely an academic pursuit.
There are currently three different kinds of cache side channel attacks. I shall write about them all to some extend here so I’ll introduce them in short before I go into a slightly more detailed description of how the cache works.
1.       Evict + time
The attacker measures the time it takes to execute a piece of victim code. Then attacker flushes part of the cache, executes and times the victim code again. The difference in timing tells something about whether the victim uses that part of the cache.

2.       Prime + probe
The attacker now accesses memory to fill part of the cache with his own memory and waits for the victim code to execute. (Prime) Then the attacker measures the time it takes to access the memory that he would carefully placed in cache before. If it’s slow it is because the victim needed the cache and this gives us knowledge about what victim did. (Probe)
3.       Flush + reload
The flush and reload attack utilizes that processes often share memory. By flushing a shared address, then wait for the victim and finally measuring the time it takes to access the address an attacker can tell if the victim placed the address in question in the cache by accessing it.

A short introduction into level 3 cache

So let’s spend a bit of time considering how the level 3 cache is build. I’ll use details from the Sandy Bridge as an example, but it’s mostly similar on different intel architecture. The details however can be important in implementing an attack and defense. I use the sandy bridge for 3 reasons: I only have access to a sandy bridge (details and sizes are based on my wife’s computer), it’s the best researched in the literature and it’s likely to be the simplest.
At some point in the past CPU’s became much faster than memory causing CPU’s to be forced to stall when waiting for memory. Because memory accesses typically concentrate on a relatively small amount of memory at any given time adding a cache of super fast (and expensive – in terms of CPU real estate and money) memory directly in the CPU gave a large performance gain for only a small price. Thus the first level cache was added. The later as CPU’s turned faster and software more memory hungry a slightly less expensive a second level cache was added – real estate prices in the CPU drops with the distance to where magic happens.  And finally a 3rd level was added, which is a lot slower that L1 and L2, but still much faster than memory and comparatively huge.
L1 and L2 cache operations are slightly more advanced than L3 due to the fact that they are per core, so a messaging system exists to keep L1 and L2 coherent between cores. I’ll not discuss that here since my focus is the L3 cache, but suffice to say they are otherwise similarly build. There is however one design feature by intel which is important: The cache hierarchy is inclusive. This means that if memory is loaded in L1 it’s also loaded in L2 and L3. If it’s loaded in L2 it’s also loaded in L3. This is intel specific – AMD K7 for instance also has a cache hierachy but it’s not inclusive. For an attacker an inclusive cache makes life much easier. If he can flush memory from L3, he will automatically flush it in all caches.
When a CPU vendor is building a cache they need to track what part of the memory is in the cache and what’s not. This gives rise to a trade off between how much memory you spend for managing the cache, how flexible the cache is and how fast you can do your look up. The level 3 cache on Sandy bridge balances these trade offs in the following way. Sandy Bridge’s L3 cache and many other CPU's is a so called N-Way set associative cache. This means first the memory is divided into a small number of blocks and each blocks can be mapped to a particular set in the cache.  Each set can store any memory location within the block in N different places in the set, N being a small integer. For Sandy bridge L3 has either 12 or 16 depending on the exact model. It is implemented this way as a compromise between the two extremes: The first is called direct mapping meaning that each memory location has only 1 location where it can be stored in cache making for quick look up, but fierce competition of different parts of memory to be stored in this magic location - it is the N=1 scenario. The second extreme is called fully associative cache and means that any memory location can be cached in anywhere in the cache, the N=cache size scenario (only 1 set). While this makes for a very flexible cache, it also requires searching the entire cache for a match. A good way to think about it is that sets are indexed and ways are searched when looking for a cache hit. Also to further reduce the problem the cache isn't working on single bytes, but on 64 byte chunks called cache lines. Finally Sandy Bridge and many of subsequent CPU's have many cores and each core has a part of the cache located next to it, called a slice. The slices are connected to a ring buffer (intel calls it QuickPath Interconnect) so that any core can access the slice of another core, but only at a penalty. This is a compromise for fast lookup on cache belonging to the core, and access to the entire cache for all cores. Graphically this looks something like this:



Summary example for  my wifes Sandy Bridge: Any given byte in memory can be cached as part of a 64 byte cache line. This cache line is cached in exactly one slice which belongs to a core. In this slice it can be an occupant of exactly one cache set out of 2048. Within the cache set it can occupy any of 12 ways. 2 (slices/ cores) * 2048 (sets) * 12 (ways) * 64  (bytes per cache line) = 3 Mb cache total.
Let's take a look at it differently. A request for data is send from the core 1 down towards the memory. We start where one request has arrived at the L3 cache:
1. First part of the address is used to index the correct slice. If we are lucky it's likely to be on the current core's slice otherwise we pay a small penalty for using the ring buffer to access the slice of core 2.
2. Other bits of the address is now used to index the set to which the address belongs.
3. Finally the N-ways is searched for our address. If it's not found the cache returns a miss  and the request will be forwarded to memory.
4. If  the requested  memory is found in the cache and the cache returns the relevant bytes from the cache line using the lowest 6 bits of the address to index into the cache line.
For the details of how the bits of the physical address of the request maps to the cache see Seaborn(2015).

Eviction
The cache is filled either through memory reads or through the prefetcher. When you read memory very often you’ll shortly thereafter either read or write the same memory or memory in the same cache line. Thus reading memory usually causes the memory you just read to be stored in the appropriate cache set. The prefetcher looks for patterns in your memory access and tries to move cache lines from memory to the cache before you need them. Either way a set is going to be full at some point and to keep the system up to date memory needs to be evicted from the cache. The method by which is determined what cache lines are no longer needed in the cache is called the eviction policy and the act of dropping a cache line is called to evict – or sometimes flush. The first algorithm to come to mind is to drop the least recently used cache line to make space. This is not surprisingly called a LRU eviction policy. The problem with such a policy is that you need to keep a track of the order that cache lines where accessed within the cache and to evict the least recently used. This again requires you to use n*log2(n) bits (where n is the number of ways) and search through the bits to figure out which one to evict. This can  be done in different ways, but it’s always computationally intensive. With bits and time being expensive in the CPU, Sandy Bridge opted for a so called pLRU or “Pseudo least recently used” eviction policy. Instead of accurately keeping track a simple binary tree is represent with a suitable number of bits made to represent the cache set so that each leave is a way in the set. On access the bits are flipped in nodes of the tree that where used and for eviction the opposite path is followed to figure out which way is to be evicted. pLRU is not only faster algorithmically since only the nodes needs update, but also require less bits. The downside is that it only approximates LRU. To read more about eviction policies you may like to read this Jimenez (201x). More modern CPU’s than the Sandy Bridge has more complex (adaptive) eviction policies. We can now figure out that if we wish to remove other programs memory from the cache we can  do so by systematically load our own memory from each cache set. Theoretically for Sandy Bridge we need only N addresses (12 or 16) which belong to a cache set of interest and load each of those twice and the cache will be filled with those addresses. Real life experiments suggest that a little more than twice provides better results. The interesting thing is that it’s quite a doable proposition. In fact it’s so doable that even without knowing the actually addresses we can figure them out by using timing and thus do it from java script without pointers. Oren et al (2015) does this. A detailed discussion of optimally filling cache sets is giving in Gruss &Maurice(2015)

Cache side channel attacks

Now that we have a partial overview of how the cache works we can turn our attention on how the 3 before mentioned cache attacks works the cache to gain information.

Evict + time

In the beginning of this blog I wrote: “The attacker measures the time it takes to execute a piece of victim code. Then attacker evicts part of the cache, executes and times the victim code again. The difference in timing tells something about whether the victim uses that part of the cache.” We can now fill in the blanks. If we know which cache sets the victim is likely to use we setup the cache as to give us an idea about what memory the victim used. If we run the same victim once we know it’s likely to have cached part of the memory it used. Now we can time how long it takes to run with a high resolution timer to give us a baseline. Now we load our own memory systematically so that a number (in which we are interested in terms of the victim) of cache sets becomes full with our data –now we can assume that anything the victim placed in the cache has been evicted. Then we let the victim run again and compare how long it took to our new baseline. We can repeat this process for different cache sets until we’ve figured out which cache set he used. If branches are sufficiently large (larger than the cache line size) we can say which parts of the code the victim uses. If memory structures are bigger or displaced across cache lines we can say which memory structures he used. This information can be used to draw inference on information private to the victim. Typical examples are attacking AES that is using tables for encryption with the key being private to the victim or KASRL where actual addresses are hidden from view. We should notice here that other tasks using the cache will add noise to our measurements so repeating the attack again and again will improve accuracy as on multi core system there is always noise on anything to do with memory. Also the prefetcher is an enemy to be considered as it might load memory in anticipation of being used by the victim without the victim actually using it and thus giving us false information. In practice these things makes attacks less reliable, but by no mean infeasible. The real down side to this attack is that we need to be able to cause execution of victim code which might be difficult in the first place. The positive side of this attack is that we can apply it in java script or native code since we need not know any addresses from the victim to carry out the attack – the actual cache sets used is all the information that is required.

Prime + probe
The Prime + probe attack looses the restriction of evict + time that the attacker needs to be able to start execution. To make this happen we again make sure that interesting cache sets are filled with memory we control. Once this is the case we’ll wait a predefined amount of time to see if a victim uses memory in the region covered by the cache set. If he does he’ll cause one of our addresses to be evicted. This is the prime phase of the attack. The probe phase is now to access each address we originally filled the cache set with. If this is slow, the victim has used memory with in this cache set as well. Probing fast enough we’ll get a series of observations over time of what cache sets are being used on the computer. This allows us to draw lots of inference on what the victim is doing at any time. Typically to map cache set activity the attacker runs lots of attacks on a system under his control to generate a baseline data set. He then uses a machine learning method to map the time series he got from the victim to figure out what the profile means the victim is actually doing. Prime+probe is in my opinion the most powerful of the cache side channel attacks. Again we don’t need actual addresses so we can make it work in java script. We don’t need to actually cause execution of the victim code so just need to loiter around – which is one less prerequisite for the attacker. Thus this attack vector is suitable for almost any situation where we get access to running our own code on the same computer as our victim. Hypervisors, privilege levels, sandboxes and other barriers are irrelevant because we all meet up in the cache. The down side to this attack is that the data we get back tends to be very noisy. We only get which cache sets where use (and partly how many cache lines where loaded), but we have no initial information about which core, thread or even which code cause the access. So to successfully we need to repeat the process a lot and we can spy only on events that are relatively common. Oren et al(2015) covers this kind of attack in more details.

Flush + reload
Flush+reload over comes evict+time requirement of causing execution as well as prime +probe noisy data. It does this by assuming that victim and attacker are using the same memory. This proposition at first seems weird since even in unsafe operating system such as windows 95 each process is given it’s own memory space. But it’s not as weird as it may seem. This is because of a memory management feature by most operating system called page dedublication. If two identical memory pages are loaded in separate processes, operating systems can save physical memory (and more more efficient use of the cache) by mapping the same page in each process. Most operating system does this – even across different users and privilege levels. Typical example would be the code of kernel32.dll in windows. It’s shared by processes but most of the DLL (especially read-only-portions) such as code is only mapped in physical memory once. Should a process decided to write to such memory most operating systems incorporate a feature called copy-on-write, essentially the OS sees to it that an exception is thrown on write access and then makes a new copy of the page for the offending process only before retrying the write operation. Page dedublication is rampant in all modern operating system and even some hypervisors are capable of doing it. The attacker now uses this information by flushing a specific address shared by attacker and victim. Because he needs to flush a specific address this attack seem impractical in java. In native code it’s quite easy to flush an address using the clflush instruction, but there is also no reason why he shouldn’t be able to do it using the normal eviction policy of the cache. Once the address is flushed the attacker waits an interval and then times the access time to the address. If the victim used the address it’s likely to be in cache and thus the access will be fast. Since we know exactly what address we are talking about we have very little noise. Flush+reload can be used as a tailored attack or by comparing what happens on a machine under the attackers control with what happens on a victim computer using machine learning methods. The latter called template cache attacks. The best article on flush+reload attacks is probably: Gruss, Spreitzer, Mangard(2015)

Current counter measures to cache side channel attacks


My general opinion on security problems such as CSC’s is that we should fix the root cause and in this case the root cause is the micro architecture of the cache. The core problem however is the cache being shared across privilege levels and that might be a significant hurdle to climb for CPU designers. Also there is already plenty of existing hardware in use so even if intel where to fix the root problem we’d be stuck with these issues for a considerable amount of time unless we look at other options.
1.       Less accurate timer
Most cache side channel attacks use the timing of a memory access to determine if it’s cached or not cached. This usually works through the RDTSC/RDTSP instructions to read the current time stamp counter of the processor or an operating system/library wrapper thereof. For javascript this can be implemented directly in the wrapper and has so far this has been suggested and already implemented in response to Oren et al.(2015) Obviously if the time is sufficiently coarse we cannot tell a cache hit from a cache miss apart. For user mode attacks RDTSC(P) can be disabled in user mode through a bit in CR4, which will cause it to make an access violation. This again can be used to emulated the instruction with any coarseness we wish, leaving compatibility with most existing software. Now this approach is never going to be perfect. The reason is, that if we start a timer at a random point in time and then carry out a flush or a probe the probability of seeing an increment reading when reading the timer again is increasing in the time the flush or probe took. Thus if we can sample enough events we’ll be able to draw inference what is happening – that is spy. Say keyboard presses are not such an event type as they’ll be more or less unique events – we can’t just ask the user to type his password 10000 times. Where as blitting a mouse cursor touches multiple cache lines at once and in neighboring positions as it’s moved we could probably get a good idea about where it is even with a middle coarse timer. One approach to avoid this for attackers would be using performance counters to count LLC cache misses instead of relying on timing. levels allows for it. Also on virtual machines performance counters are often not implemented. OS and virtualization implementers however should be aware that performance counters provide a powerful tool for reading into side channels even when you only get your own performance data because of sharing of resources in the system. I’ll return to timing later.
2.       Non-inclusive cache hierarchy
Obviously this does nothing against code using the CLFLUSH instruction, but people trying to evict the cache through filling a cache set now need to bother about flushing all levels in the hierarchy separately. Obviously this makes it more difficult but I don’t think it’s a deal breaker – since we can relatively easy evict a cache set in a 3MB cache, it seems like evicting one in a 32 kbyte cache should be a feasible task. I’ve not read about cache attacks on AMD k7, but I’m fairly pessimistic that despite the non-inclusive cache hierarchy that CSC’s are possible on this processor too.
3.       Fully associative caches though they may not hold they same performance, they do make it very difficult to sufficiently fast evicting the cache without the use of CLFlush instructions as we’d essentially have an extreme amounts of ways in our eviction set. Again this would only be a solution on new hardware and to really close off the gap the Clflush instruction should be made priviledged. There are other reasons – row hammer – to follow such a path.

4.       Disable cache-line sharing
Flush + reload attacks can be dealt with effectively if there is no shared memory. This ranges from an operating system/ virtual machine level disabling page deduplication to actively causing copy-on-write on particularly interesting pages. On Windows 7 a simple ReadProcessMemory() followed by WriteProcessMemory() is enough to trigger a copy-on-write and thus disable Flush+reload – you don’t actually need to change memory! There is other reasons to reconsider page dedublication: The copy-on-write takes times and opens up for a side channel attack by timing write operating to pages. I believe Daniel Gruss wrote a paper on the subject, but I could be mistaken.
5.       Accessing and flushing memory “randomly” by a victim or operating system obviously adds lots of noise to the process. I love the irony of using cache side channel attacks to mitigate/detect cache side channel attacks – but despite this irony there is obviously a heavy performance penalty associated with causing cache misses to protect against attacks and in scenarios where the attacker is making many repeated attempts this is likely to be too big.

6.       The intel CPU tries to load stuff into the cache before it’s used prefetching. It’s especially useful for code and reads which are predictable like memcpy(). Obviously the prefetcher continually causes cache hits even when nobody already accessed the memory thus making more difficult to get good data on memory activity for spying purposes. However I find it unlikely that we’ll see a prefetcher that’ll be sufficiently good to do land a very strong blow on the possibilities of an attacker.

7.       Also it’s been suggested to modify software so that each user has a version with a unique memory layout and access path. Imagine a powerful mutation engine at work. This has been suggested as a remedy against code reuse attacks in the past and is likely to make CSC quite a bit more difficult. It is also quite feasible as such a mutation engine was used in a successful virus by Z0mbie back in 2005. The downside is of cause that mutation engines of any kind posses a Q&A problem to software developers. Certainly this method has not yet gained traction in fighting against code reuse attacks so the probability of it gaining tracking for fighting CSC’s seem slim.

8.       Using performance counters is also an option for detection. I suggested in mine and Nishad Herath’s black hat slides to instrumentalize RDTSC(P) or other high resolution timers to first make them less accurate and second to  start counting cache misses using the intel performance counters. This forces a spy into territory where he has to look at more cache misses to get reliable information. This combined with the presence of a high resolution timer makes a very good heuristic for cache side channel attacks. Especially since one potential response would be to make the timer even less accurate – thus only mildly impacting the system as a whole in the event of a false positive. In my experiments false positives have been rare though low-frequency spying sometimes ends with false negatives.  I imagine that continued spying as well as tuning with the coarseness of the timer could make it a viable solution. However let’s take a look at timing from an attacker perspective. Even if the attacker manages without a high resolution timer, it might be well worth looking into performance counters to detect cache side channel attacks – because any high-resolution-timer is likely to leave uncommon artifacts in the performance counters.

Timer issues


From the above it looks like the attacking the timer is the most promising defensive technique as it deals with all three methods of cache side channel attacks. It’s feasible,  relatively easy to implement, at least if we don’t consider Patch Guard and though it might not give 100% piece of mind it at least goes a long way in the right direction. However there might be other ways to get a high resolution timer on the x86. I’ll suggest two such methods here, others may exist:
1.       Delta timer. Using a low-res timer, then wait for the moment when it increments and then do your measurement and try with register only instructions to fill the gap so that a cache miss will exactly cause the timer to increment and a cache hit will not. I call this method of timing a delta timer. Probably somebody in the literature has assigned another name and if so please contact me and tell me. On old single core CPU’s register only instructions had deterministic timing. This is not the case on multi core CPU’s and this makes it a bit less reliable. The figure below shows a graph of 30 observations from my Sandy bridge trying to when trying to time out 8000 nano seconds. This suggest that if we need to use a delta timer to get 95% correct we need at least 10 cache misses to be sure there was cache misses at all. This brings it into a range where the performance counter detection method would work well. However If we accept an error rate of 1/6 the delta timer might do the job with while being difficult to detect using performance counters.  Now remember that this is not a real scenario where we have a lot more noise already. But it’s unfortunately not infeasible that delta timer would work in many real world scenarios – especially for flush+reload since flush+reload is a lot less susceptible to noise anyway than the other methods. But who says the coarseness of the timer should be constant – with a random value (random drift?)the delta timer would be difficult to use – that again is a different scenario than what has been proposed for java scripts.
2.       Thread timer. I’ve dubbed this kind of time “Thread timer” without knowing if there is any literature on that subject. The idea is to have a thread running and by a signal through a global variable start counting. When the global variable is reset it stops counting. The count then serves as a high resolution timer. There is problems in implementing such a timer. Mainly accesses to the global variable is not automatically perfectly synchronized leaving a non deterministic (and significant) time gap from the timer is set to start to it actually starts – and for stopping the same problem exists. Now intel does provide instructions to deal with this – mfence and the lock prefix – however I’ve not investigated those as I figure they’d be difficult to generate in java (I find the java attack vector the most threatening). For native code they are likely to bring around improvement, but not determinism. The first thing I noticed while experimenting was that I got absolutely random results until I started assigning the thread to a specific CPU (core or hyper thread). The behavior is very different on hyper threads and real cores. The unit for the thread timer is the number of times ecx was incremented during 1 uncached read and for 10 Cache misses. The curve for hits look similar. You may notice that it’s shorter than the delta timer. This is because I’ve deleted a handful of observation with the value 0 – that is where the thread timer didn’t start at all! When timing for only one cache miss I get only a few observations with counts. This hints at that we need at least a handful of cache misses to be able to tell misses from hits with a thread timer. Obviously this is not the last words on thread timers, but it hints at that they are fairly reliable on hyper threads and a little less so real cores and that thread timers are unsuitable for spying on events that gives only a few cache misses as feed back. The thread timer code I used is listed below. In short my suggested use of  CR4 to reduce resolution does pose a problem to an attack looking for rare events with only a few cache misses – say key presses.
while (g_Start == 0);
             g_Time = 0;
             __asm
             {
             WaitLooo:
                    //mfence
                    cmp g_Start, 0
                    jz WaitLooo
                    xor ecx, ecx
             again:
                   
                    inc ecx
                    cmp[g_Start], 0
                    jnz again
                    mov[g_Time], ecx
             }

       g_Start = 0;


Maybe I drink too much

In private conversations I’ve suggested in the past to use cache side channel attacks to detect rootkits including SMM root kits. My suggestion goes like this:If we have a root kit type that tries to read usermode private keys in memory. The possibility exists that we can trigger this behavior and use it for an evict + time attack. Imagine an SMM with a hook on SomeOpenPrivateKeyFunction() which reads the input irrespectively of validity and that our input exceeds a cache line but that SomeOpenPrivateKeyFunction ()won’t read past a syntax error in the key. The scenario sounds farfetched but maybe not too farfetched. Then setup a private key the size of two cache lines with a syntax error in the first cache line. Flush the second cache line and then call  SomeOpenPrivateKeyFunction(). Now time accessing the second cache line of the fake private key. If SMM touched it chances are it used the cache and we’ll be able to see the time difference. Obviously there are plenty of things a root kit could do to mitigate, but at least we started a game of hide and seek.
Another interesting (mis-)use for cache side channel attacks would be detection of undesired runtime environments. Emulators and VM’s could be detected. Emulators could be detected because it’s quite difficult to emulate a cache correctly and most just plain skip it. VM’s could be detected by bootkits through the fact that the cache is likely to be full under a VM, but on a real system it should be empty on boot. I’m sure other cache based methods would be available.

 

Conclusion

Cache side channel attacks certainly have a high coolness factor which is why I bothered to write about it. On the surface of it spying on another VM from a sandboxed java script webpage – being able to exfiltrate cryptokey,key strokes, mouse movement etc. sounds like the perfect storm. Add to it that(in my opinion and limited knowledge) nobody found a good strategy for mitigation yet.
Certainly cache side channels poses a real danger to systems. But cache side channel attacks won’t be responsible for a buffer-overflow-type wave of new breeches. The value of figuring out private keys or reducing the entropy of KASLR is just so much lower than a traditional breech. In the face of the mitigation options already available cache side channel attacks becomes even further reduced. Attackers can only figure out private keys if they are used often, attackers cannot really spy on key strokes with reduced resolution of timer. Java script spying is made more difficult and with it’s level of noise it spyes only on repeated events and only until somebody closes the webpage. However the message should be: On sensitive systems these mitigations should be implemented – especially the lower resolution timer and dedublication measures. And that java is already implementing a lower timer resolution is a good thing, even in the presense of other timers and even though I to some extend would’ve done details in different ways.
And finally just because there currently are no good mitigations on the table certainly doesn’t mean there isn’t one and even less than perfect mitigations raises the cost/benefit ratio for an attacker. I for one would love to spend more time with performance counters. I just have to wonder why kind of quirks thread timers and delta timers have in the microops world of performance counters.

Literature
Oren et al (2015): “Spy in the Sandbox, Practical Cache Attacks in Javascript”; Yossef Oren, P.Kemerlis, Simha Sethumadhavan and Angelos D. Keromytis. arXiv: 1502.07373v2
Gruss, Spreitzer & Mangard (2015): “Cache Template Attacks: Automating Attacks on Inclusive Last-Level Caches”
Jiménez(201x) : “Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches”; http://taco.cse.tamu.edu/pdfs/15-jimenez.pdf
Seaborn (2015): “L3 cache mapping on Sandy Bridge CPUs“; Mark Seabrorn. http://lackingrhoticity.blogspot.de/2015/04/l3-cache-mapping-on-sandy-bridge-cpus.html

Gruss &Maurice(2015): “Rowhammer.js”; Daniel Gruss, Clementine Maurice. http://arxiv.org/pdf/1507.06955v1.pdf

Sunday, August 16, 2015

Speaking at Black Hat

Introduction

 The final version of the slides of our talk is here

As some might have noticed I had the honor of speaking at Black Hat 2015 in Las Vegas with my old friend Nishad Herath. The topic of our talk was: “These Are Not Your Grand Daddys CPU Performance Counters”. More specifically our talk was about using the Intel performance counters (PMC’s) for defensive security purposes. Unfortunately the talk didn’t go as well as I’d wanted it to and I certainly don’t think I conveyed what I wished to. Also we did some last minute changes to the slides in the presentation. Hence this blog post with the updated slides and my personal comments to the slides.

I apologize for that format of this blog, but I really want to get done with it and move on. The majority of these comments were written down as notes and I had them on stage with me. A few augmentations has been made for this blog post and I’ve made an effort to make them readable to other people than myself.  These notes where obviously was heavily inspired by discussions with Mr. Herath. So I have to share any credit with him. Never-the-less they where my notes and any errors are mine alone.

The story

The story how I got to speak at black hat is interesting in itself (I think) and it certainly did have an impact on the talk. If this doesn’t interest you skip straight to the next header. Originally it was Mr. Herath talk and I was invited to join late in the process. It wasn’t until the 25th of June that things became official and I started really believing that it would work out. This left just 5 weeks of preparation time for me and at that point I didn’t  even know that the slides where due on the 20th of July. So the big rush started on my side to get up-to-date on details of performance counters what other people had done with Performance counters etc. It didn’t make it easier that Mr. Herath was in an entirely different time zone either. Also worth mentioning was that we’d been asked to spend a significant amount of our time on row hammer. Personally I would’ve rather spend my time elsewhere given that Mark Seaborn and Halvar Flake where already giving a talk on row hammer and they know much more about it any way. Especially I found it silly that with two talks touching on row hammer that they ended up being scheduled in the same time slot.

While we were working on slides, the big speculation was that row hammer was doable in Java Script and we wrote up slides in line with this speculation while working frantically to actually figure out if/how this could actually be done. I succeeded flushing the cache fast enough without clflush (only on Sandy Bridge) the night before Daniel Gruss and Clémentine Maurice published their (really nice) rowhammer.js paper which obviously then turned over our slides. No more speculation, row hammer in JS was fact.

We had added Cache Side Channel attacks as I’d noticed that my CSC code lighted up my row hammer detection as well. It was always meant as a stop-gap in case we’d not use up our time with ROP, row hammer and root kits and I saw it as a way to get some fresh meat on the table. Just a few days before Black Hat a W3C java script draft came to my attention. In this draft they wanted to make the timer less accurate (3 us ) in response to “Spy in the sandbox”. This rang a bell in my head and turned everything upside down in the CSC slides – from having something that only worked reasonably on a very high frequency of polling the side channel we went to something where we could actually do a fair detection on a much lower frequencies. This caused the last minute changes to this section – as you may notice they are pretty severe on the low frequency stuff. Now I don’t pretend that the CSC changes are the final word on Cache Side Channel attacks and performance counters. I think it’s enough to show that PMC’s can be relevant to Cache Side Channel attacks. If you’re interested in this subject I’m writing on another blog to elucidate my findings more accurately than the slides where ever intended to.

It is such fun to work with stuff that is still in movement, but it does throw you through some hoops when you are on a tight dead line.

The final event that really turned things upside down was that Mr. Herath laptop (which we would use for the presentation) gave up and we had to spend the morning before our talk (the only half a day of actually being physically at the same location before the talk) summoning up a new laptop and agreeing on the last minute slide changes – thanks to an anonymous benefactor that lend us a laptop to do changes while we were still in pursuit of one for the presentation. Thus I went to speak at the black hat conference without having ever seen a conference talk and not with the best of preparations either. I don’t think it was a disaster but I should’ve done a lot of things better. I learned a lot and I’ll do better next time – if there will be one. As a kind soul remarked it takes a few iterations to get right – though I agree with this, I certainly hat set the bar higher for myself. The real error I think was insufficient preparation on my part and too much detail and too much material.


My comments on the slides


Performance Counters counts different micro events in the CPU. These micro events tell a lot about the code running on the CPU and since malicious low-level code often behave differently than “normal” code the performance counters can be used as a signal to look for malicious code. As an example ROP code causes excessive “return misses” as the “ret” opcode is not used in pair with a call, but rather to progress a long a gadget chain. Unfortunately events are plentiful so there is a lot of noise in the signal too. However there are ways to harness this noise and make useful inference about the presence of malicious code from the performance counters. The talk in my opinion summarized into one statement:  Performance Counters can be very useful for security purposes in a great many ways. My wishful thinking in this regard would be that these things made their way to security products, that Microsoft got their stuff together and made a decent interface for using PMC’s under Windows and Intel would consider adding security related events to the PMC’s instead of just stuff for optimizing their CPU or your code.

Slide 24
I think it’s worth noting that I did the majority of my work with PMC’s on Windows XP32. The reason behind this move was that I had a system available to me and that I avoid driver signing and patch guard issues. Most other work has been done in linux derivatives or debian both of which has a much nicer interface to PMC’s than Windows. There are of cause ways around Patch Guard, but it wouldn’t be nice to pursue them. Also there might be undocumented Hal API such as the HalpSetSystemInformation() on WinXP to hook into the PMI. For non PMI there is an api on newer systems on a per process basis for usermode. Visual Studio 12 comes with horrible tools for using that API – I’m not aware of any documentation but you can reverse engineer those tools – they are not too big. Sorry – I don’t have much information on this. Maybe I’ll look into it…no promises though.

Slide 30
 This is slide is to me our key slide. Not only does it show how you can minimize the performance impact from using performance counters to look for malicious activity, further more it gives some basic ideas on how you can manage the noise problem with performance counters. The basic idea is that we can often from other analysis say something about the conditions under which malicious code will run and where it won’t. The simple example is row hammering is not very useful if the attacker is already running code in ring 0, thus we need not examine events in ring 0 to detect row hammering.  Of  our 4 examples of detecting/mitigating malicious activity all of the methods uses some aspects of the methodology in this slide.

Slide 35
Ret-prediction is terrible around task-switches, because the CPU’s shadow stack does not get swapped, the real stack however does. The 400k bad case scenario on slide 29 was for (for (int i=0;i<300;i++) Sleep(100); which causes lots of task switches. Thus on task switches we’d see heavy performance penalties and problems with limited depth of the LBR. kBouncer essentially get’s around the limited depth of the LBR register  (which records only the last 16 branches) by instrumentalizing the code of targets of interest around API calls. This can be seen as the instrumentalization method of slide 30. Interestingly the first work around from an attackers point of view was to use the still limited depth of the LBR to bypass kBouncer. However traditional performance counters could be used with instrumentalization as well and then depth would be limited only by acceptable performance loss and memory. The implementation would be slightly more complex than kBouncer (because we’d need instrumentalization to start and stop collection), but we are well within the realm of possibility.

Slide 37
We have the essentially same problem as in slide 35 for kBouncer, however the solution is different. Georg Wicherski limits to ring 0 and get’s around the problem of task switching that way.

Slide 48
I wished to make the point that row hammer kinds of problems are likely to increase in the future as DRAM grows increasingly dense. It also means that the problem is likely to see hardware fixes in the future.

Slide 49
This slide can serve a way of sorting the following row hammer mitigation methods into categories. It should be noted that having two physical rows in the same bank is required even for single side hammering because of the row buffer. This slide gives a false impression.

Slide 52
Even though not good enough on it’s own it might be a useful supplement for other methods, including ours.

Slide 53
The performance measure is on memory access only. It’s doesn’t map well defined to system performance because the intel CPU is able to rearrange instructions while waiting for memory. This will tend to make system loss smaller. On the other hand memory speed is very often the bottle neck on performance dragging things in the other direction. I think the latter effect will dominate, but I have no evidence towards backing this belief.
The answer to the last question is a definite no, though some it might be enough for some RAM.
Slide 54
It’s my guess that power consumption will not be relevant. I found two different sources painting different pictures. But I think that it would make a difference in sleep, but since we cannot row hammer in sleep, the point is mood. While the system is awake other components of a laptop should out spend ram by a large factor. So if implemented right, it should be a non-issue.
Slide 52+55+56
All three are mitigation through faster refresh on slide 49. PARA + pTRR are more target refresh to avoid the steep penalties of refreshing the entire RAM more. The latter two methods seems to me to be the “right” solution – essentially row hammer is a micro architecture problem and it should be fixed in micro architecture too and I consider that likely to be done – see also slide 48 comments on ram growing more dense. However that requires that people get new hardware.

Slide 60
I’ve not been able to do row hammer with Non-temporal instructions without flushing cache with other means first. Seems like that using these instructions from java script is very difficult because JIT compiler don’t generate them. (There was a paper on this, if you’re are interested ask me and I’ll find it) There might be other reasons why intel should consider making CLFlush a privileged instruction: Cache side channel attacks. Addtionally there is little use for CLFlush except for optimizing cache usage with it to speed up code, which seems rather difficult. Again making CLFLush priveledge does not solve the problem, but it doesn’t exactly hurt it either.

Slide 68
Additionally to mitigation through delay we could use try to figure out the victim row and read from it to trigger a refresh (a row is automatically re-written on read
 Also interesting here we use a rarely triggering interrupt (say every 1000 LLC misses to trigger the costly slowdown). It’s an example of using a rare event from slide 30 to trigger more costly analysis. (Totally irrelevant note that I didn’t know where else to put: I at one point used ret-miss events to alert me of process switching during playing around  - instead of hooking my way into getting alerted on task switch)
Performance cost analysis: A normal LLC miss cost around 200 NS and an interrupt costing around 500 NS – triggering an interrupt every 1000 costs only ~2.5% performance and 1000 is a really low number for Row hammer. 100000 would probably be more appropriate.

Slide 71
Essentially the root kit detection here is detecting code executing out-of-bounds. This is outside of a known white list. It we use the probabilistic element of slide 30 to keep performance cost down while staying in ring 0 only makes it feasible to have a white list with known good code. I dislike using the word heuristic from slide 30 in this sense – if an interrupt triggers on this code, there is no doubt the code executed, however nobody says that executing the code will trigger a PMI (thus probabilistic). Finally we hint that using instrumentalization around particularly vulnerable code could increase true positive rate enough to get a chance at finding traditional exploits doing their work.


Slide 86 & Cache side channel attacks in general
I’ll just not that we’re using the instrumentalization (slide 30 again) of the fine grained timer in addition to making it less fine grained to force behavior that is less likely to occur naturally and thus make it detectable. It is in many ways a heuristic approach. I should note too that false positives are possible and you’d have to measure your response in that respect – e.g. make timer even more inaccurate, flush cache, trigger copy on write etc. What should also not be forgotten: This is intended to show that PMC’s can be relevant for Cache side channel attacks. It is not the final word as the attacker does have options to evade, and I’ll get back to that when I finish my next blog. On the good side though – there is more defensive possibilities too…. To be continued….



Monday, July 20, 2015

Reverse Engineering nostalgia

I wrote this blog about a month ago. I thought it might amuse some of the nostalgic minded people around. After having read it, I dropped pushing it because it was too much about me (and I don't wish to blog about me) and not enough about the technology (which I do wish to blog on). Rethinking - maybe it would serve it's purpose of putting somebody in a nostalgic moode - an additional factor is that I've not gotten around to write anything else for a while and vacation time is comming up so it'll probably be a while before I do. with that in mind...

My first program and the first computer

The first program I ever wrote was malware. That is if malice as intent and effect make software malware. I was around 10 years old and my father had just convinced my mom that purchasing a monitor for his New Brain computer would not ruin our family. New Brain was a 32kb memory computer, with a terrible keyboard, 11 char calculator display and programmable in basic. To store programs and data my father had soldered around in my cassette recorder. The newly purchased monitor was monochrome and small. With monochrome I don’t mean black and white. I mean dark brown and beige yellowish. I was at least as fascinated with the computer as my geek father. It was probably no accident that I immediately realized that it was the perfect instrument for mischievous behavior. The program that I’d written was:
10 print “Karen is stupid”
20 goto 10

Karen is my older sister. And it's worth noting that even this simple program ran slow enough that you could see it scrolling down the screen. In those days - probably 1984ish - computers where magical things and one better listen when they had computed something. My sister screamed with rage - which was cool since she probably deserved it


Commodore

A few years later my father had upgrade the “New Brain” to a Commodore C128. By now I was fairly good in writing basic programs and was only half bad in Comal 80.  But I didn’t want to code. I wanted to crack games and have my “nickname” on the start up screen of the games everybody else played. My problem was that I need a machine code monitor and the only machine code monitor that I knew about for C64 was the “Final Cartridge”. I wasn’t allowed to buy it. So I instead booted in C128 mode and started monitor. Anybody who has ever used “monitor” has either given up quick or committed suicide at least that's what I think. Especially since I (and I assume most other people) did not have any C128 software to be debugged. I gave up.



Early PC days

1992 and I had my first break actually “reverse engineering” something. I’d stumbled upon a text called “act-13.txt”. It was about cracking PC games using debug.exe. The irony was that debug.exe was incredibly powerful and terrible at the same time. It served as hex editor, disk editor, debugger,disassembler and assembler build into one terrible scrolling user interface. The act-13.txt described how you used debug.exe to search through a binary file for “CD 13” that is interrupt 13. You’d have to have sufficient luck to also get hold of Ralph Browns interrupt list to have a chance of figuring out how that interrupt worked. Ralph Brown’s list was the documentation for programming the PC at that point. Say the MSDN of the day. Remember you had to find another old guy your age with the same interest who’d actually copy it on a 3.5” diskette for you. I were lucky. Copy protection in those days mostly worked around either letting the user enter a code from a code sheet or have the game distributed on a floppy diskette that had been formatted off spec. For the diskette copy protection we’d thus look for stuff that read physically from the diskette and this was interrupt 13. So search the entire code for int 13, then disassemble, see if you could figure out what it was doing and usually changing a JC/JNC into two nops or a JMP. If you’re a modern hacker you’re probably wondering what the carry flag has to do with anything. But in those day the most common call style was for bool to be returned in the carry flag using CLC and STC instructions respectively.
Call xxxx:xxxx
jnc yyyy

was a common construct. When you see the call xxxx:xxxx chances are you'd see the far call instruction as an exotic instruction, but back then far calls where used all the times as it was the only way to cross 64kb boundaries in code.


The first virus


In the early 90ies I became terribly obsessed with biological inspired terminology in computers. Virus, worms and artificial intelligence. These terms where in every computer journal you could imagine all the time. Dark Avenger and his nemesis Vesselin Bontchev where both heroes of mine. Pushing the boundaries of what could be done, on the verge of changing the world. I wanted to write a virus, but had absolutely no clue. So I tried to code up a neural network and thought at the time I'd succeeded. In retrospect I believe the chance of that being true is around 0. I made a good effort though, but I should've kept quiet - the comment in the school yard that "I'd gotten my inteligence to work" was too much of a softball not to follow me for a couple of years. I must've been around 14 at the time. Viruses where difficult to find, despite my best efforts. My first break through was getting an anti virus. IBM antivirus 2.0 if I recall correctly. It came with a human readable signature file with an jaw dropping 23 signatures. By playing around in this file and Norton Disk Edit I managed to have it detect other files as viruses. Soon after I got infect by the "form.a" virus and figured out how to remove it by running "sys.exe".  Form was a boot sector virus that spread when computers booted from 3.5" floppy drives. During the boot process it hooked interupts and could thus propagate it self when DOS called on these interupts to mount another floppy disc. Ironically boot sector viruses was amongst the first viruses and is the immediate ancestor to boot kits today - now considered a more advanced type of malware.

Getting connected

Within a year of moving out to attend university I were cohabiting with 4 computer geeks like myself. It turned into a massive skill upgrade for me due to synergies and because we ran the BBS "Psychic Damage" which at it's prime was probably the best (1 Zyxel 19k2 node and 1 ISDN 64kb node) H/P/A/V BBS in Denmark. HPAV stood for Hacking, Phreaking, Anarchy (explosives and drugs) and Virus. We had tons of documents on hacking, hacking logs, password files. We however also had house rules. No hacking, no virus. So even though we had a large achieve of virus I was banned from examining them. I stuck by the rules because, I did not want to lose moral high ground on the hacking question. At least one of the other guys would've loved to hack anything and there is little doubt that he was capable of doing it. The first waves of hacking busts by what seemed like Denmarks only security expert at the time, Joergen Bo Madsen, where highly publicized. And Mr. Madsen had mentioned in a lecture that our BBS had the password file from the department of computer science at my university lying around on our BBS. He was right too, but I haven't got a clue how he could've known and that kept me a bit scared. Never the less I would have the opportunity to reverse engineer my first virus in 1994. While making my first intro (a small graphics demo showing off coding skills in the day) I stumbled upon an int 21h, sub function 40h in another intro who's fading routine I wanted to steal. int 21h sub function 40h is the equivalent of WriteFile() today and it had no business in an intro. Turned out the intro was infected with Taipan. By debugging and disassembling a virus that had already infected our system was ok with the house rules, at least the way I bend them. So using the Game Tools Debugger, Softice Debugger and Sourcer disassembler I took apart the virus. Sourcer was the IDA of the day and Softice the most powerful debugger available. However in many cases Game Tools - a freeware debugger - was just more convenient. Softice by NuMega was by far the best debugger in that time period, however it spend too much memory and was unable to debug protected mode code that used the DOS4GW dos extender. For those reasons a great many times one would stray to Game Tools for real mode stuff and to the Watcom Debugger for protected mode stuff. This debugger came with the Watcom C compiler which was behind the DOS4GW so called dos extender. The Watcom debugger was however always very buggy, crashing all the time. Also in this period of time there where only one book you needed to own. Assembly Language Master Class (Wrox Press Master Class) and my roommate Mads owned a copy that where constantly being explored by yours truely.


1995 - 2000

1995 was of cause the year where Windows 95 came out and it was a game changer. It was the final architecture change to protected mode. It was also the year where internet became more than a VAX/VMS connection to "lynx", a text based HTML browser and almost no valuable information to find anywhere. I found my way to EFFNET on IRC and and met a large number of fellow assembler freaks and that boosted possibilities again to the power of 10. I had documentation, help, mentors,  tools and uncharted territory everywhere. One of the things I quickly became interested in was writing executable packers and crypters. There was a mailing list in the day for executable packers: Rose's exelist. Game tools, sourcer and softice was out. There could be only one debugger and that was WinIce from NuMega and anybody familiar with Softice could easily migrate. Obvious writting executable encrypters immediately meant defeating debuggers - especially Winice. I spend hours writing code to detect winice in all sort of devious ways. Not long after Mads bought Matt Pietrek's "Windows 95 System Programming Secrets". At first I didn't understand the book, and it my second go I immediately knew that the cool things he was doing, I could do too. Winice, lot's of coca cola and every secret in Windows 95 was mine. Even after Win2k came out Privilege elevation was so trivial that it wasn't even considered a feat unless it was particularly spectacular. Buffer overflows were in vogue and a dozen a dime. I remember Halvar Flake used to say: Find me a sprintf and I'll give you an exploit. That wasn't quite true, but not far from it. And it was then I (semi-) retired as a malware hobbyist and started working as a programmer. Ironically as the malware era had really just begun.



 act-13.txt can be found here: http://web.textfiles.com/hacking/act-13.txt
New Brain picture was stolen from Wikipedia.com
Machine code monitor picture was shamelessly stolen from C-64 Wiki.


Monday, June 29, 2015

Machine learning with malware part 2: Model selection


Unrelated background information:
I wished I'd gotten a lot further a lot sooner with this. Also this blog post is only barely up to the standard I set for myself. Compressing so much statistics and machine learning into a blog post of reasonable length while avoiding excessive math is a terribly difficult thing to do. I hope that it lifts the vail a bit around what machine learning is and how it can be leveraged in a malware setting, but if you really wish to get down with it there can be no two opinions about you need a real text book with all the rigor that it requires.


 Finally it took a long time because most of my family conspired to celebrate round birthdays. Also climbing season started which keeps me away from the computer. Finally I have this feeling that the "Predator" from Machine Learning Part 1. blog produced insufficient amount of features. So spend a significant amount of time embedding my code emulator (from unpacking with primitive emulation) into Predator at first thinking it would be easy. Emulating on 80000 files malformed files, where a significant amount is missing DLL's etc. turned out to be way more time consuming than I imagined.

The blog post is here:
Machine learning with malware part 2: Model Selection

Thursday, June 18, 2015

LangSec opinions and a case INTO overflows


On a entirely different note: The next machine learning and malware post is almost finished...

A very short introduction to LangSec

Sassaman, Patterson & Bratus gives a better and more formal introduction and they are probably less boring that I am and definately more accurate.
When I first read about LangSec (language security) I did as I always do reading about a new topic. Boil it down as much as I can and my distillate was “If you don’t make errors, your code will be secure”. It is a truism and an annoying one at that, because it either asks the impossible or points fingers at you for being what you are - a human. It is in many way still how I think of LangSec but digging below the surface of this new computer science discipline is never the less very much worth while . In fact I think it poses one of the most promising ways of approaching one of the roots of the insecurity problem. Which is of cause why I bother to write about it. Though it is mostly defensive, anybody playing offense might benefit from the way of thinking. 

I think the two most import insights I’ve found in LangSec is:
1)      User input should not be treated as data. Once user input wasn’t expected it is no longer data, it’s a program running on computer made up of your software, hardware and the state of it. A so called “weird machine”. The program is written in a language that can run on the weird machine does weird things and weird is bad.
2)      Postel’s law that states that you should be “liberal in what data you accept, and strict in what you send”. Kill this law and replace it with “Be strict with user data”.
The reason why 1) is really important is because it translates a security problem which are often diffuse into a very well defined language theoretic problem. Writing secure code becomes a question of developing a language that is well defined for any user data. What I mean by "well defined" is backed by standard theory. Awesome!  
If 1) describes the way to think about information security, 2) is a large part of how you do it in practice. Being strict with input and output radically reduces the chance that we as coders will retreat to assumptions.  Assumptions are the mother of all f-ups.

Despite not working in classical infosec I’ve spend a significant part of my career exploiting that people made their protocols more complex than they had to or that those who implemented the protocol wasn’t particularly strict in interpreting it. As an example I once developed a DVD video copy protection though it isn’t infosec it’s very much an exercise in utilizing that programmers had not taken 1 and 2 to heart. Part of that copy protection is just a Denial of Service (DoS) attack on ripping software. Three components made copy protection for video DVD possible in the first place. The first is that the DVD Video specification has inconsistencies, undefined behavior, unnecessary flexibility, it is huge and confusing. The complete documentation is about  two feet of shelves space.  This almost surely make programmers of rippers and players offer me a weird machine to program in the first place. Secondly neither DVD rippers nor players are strict with the input data. The third element is that rippers and players react differently to fringe input. The challenge is then boiled down to writing a program for the weird machine in rippers that'll cause denial of service of some kind, while making sure that particular program does not disrupt players.

A lot of LangSec has focused on parsers (and the "reverse parser" that is building the data to be parsed) and this seems reasonable. With the two shelve-feet of documentation most of it written only in the notoriously difficult to precisely parse language of human-readable english, errors are bound to be made when implementing it.  LangSec has recommendations how you can improve the processes of writing the documentation in the first place. For instance replace .h files with something that also describes relations of the fields. LangSec also has recommendation on how you should deal with implementing a parser and this is something most coders should read up on and take to heart. It will significantly improve security of software. It's a different angle of attack than the classic approach of leaving security to compilers, operating systems and hardware. Now I'm a great fan of strcpy_s type functions, ASLR, DEP, CFG, sandboxes etc. and all the approaches made in this spirit, but they obviously aren't sufficient for security. 

A real life integer overflow programming error

Below I have listed the sanity checking of the Import Directory in Sebastian Porst's PeLib (ImportDirectory.h). I've chosen this an example of a classic int overflow problem. I had a couple of reasons why I chose this. First reason was that I'd stumbled upon it resonantly and thus was readily available to me. The second reason is that it's a pretty classic example of not taking 1) and 2) above to heart. The third is that it's written by somebody who was sensitive to security issues. Mr. Porst has made himself an impressive career in infosec. Yet he made a mistake. I'm not arguing that Mr. Porst is a bad coder (quite the contrary). I'm arguing if he made such a mistake most programmers not only can but are likely to make this kind of mistake at one point or another. Before this leads to misunderstandings: I consider Mr. Porst's code generally of good quality and Mr. Porst did indeed have good reasons to ignore 1) and 2) above - he wanted his lib to be able to repair broken PE files which means he cannot be strict when parsing PE's.
So let's turn or attention to the code:
      

/**
       * Read an import directory from a file.
       * \todo Check if streams failed.
       * @param strFilename Name of the file which will be read.
       * @param uiOffset Offset of the import directory (see #PeLib::PeHeader::getIDImportRVA).
       * @param uiSize Size of the import directory (see #PeLib::PeHeader::getIDImportSize).
       * @param pehHeader A valid PE header.
       **/
1: int ImportDirectory<bits>::read(const std::string& strFilename, unsigned int 2:uiOffset, unsigned int uiSize, const PeHeaderT<bits>& pehHeader)
3:{
4:     std::ifstream ifFile(strFilename.c_str(), std::ios_base::binary);
5:     if (!ifFile)
6:     {
7:            return ERROR_OPENING_FILE;
8:     }
9:          
10:    unsigned int uiFileSize = fileSize(ifFile);
11:         
12:    if (uiFileSize < uiOffset + uiSize)
13:    {
14:           return ERROR_INVALID_FILE;
15:    }
16: ifFile.seekg(uiOffset, std::ios_base::beg);
17: std::vector<unsigned char> vImportdirectory(uiSize);
18: ifFile.read(reinterpret_cast<char*>(&vImportdirectory[0]), uiSize);

19: PELIB_IMAGE_IMPORT_DIRECTORY<bits> iidCurr;
20: unsigned int uiDesccounter = 0;

21: InputBuffer inpBuffer(vImportdirectory);

22: std::vector<PELIB_IMAGE_IMPORT_DIRECTORY<bits> > vOldIidCurr;

23: do // Read and store all descriptors
24: {
25:    inpBuffer >> iidCurr.impdesc.OriginalFirstThunk;
26:    inpBuffer >> iidCurr.impdesc.TimeDateStamp;
27:    inpBuffer >> iidCurr.impdesc.ForwarderChain;
28:    inpBuffer >> iidCurr.impdesc.Name;
29:    inpBuffer >> iidCurr.impdesc.FirstThunk;
30:
31:    if (iidCurr.impdesc.OriginalFirstThunk != 0 || iidCurr.impdesc.TimeDateStamp != 32:           0 || iidCurr.impdesc.ForwarderChain != 0 ||
33:          iidCurr.impdesc.Name != 0 || iidCurr.impdesc.FirstThunk != 0)
34:    {
35:           vOldIidCurr.push_back(iidCurr);
36:    }
37:
38:    uiDesccounter++;
39:
40:    if (uiSize < (uiDesccounter + 1) * PELIB_IMAGE_IMPORT_DESCRIPTOR::size()) break;
41: } while (iidCurr.impdesc.OriginalFirstThunk != 0 ||
42:    iidCurr.impdesc.TimeDateStamp != 0      ||
43:    iidCurr.impdesc.ForwarderChain != 0 ||
44:    iidCurr.impdesc.Name != 0 || iidCurr.impdesc.FirstThunk != 0);


        

Though there are a few layers of code above this, essentially "uiSize" and "uiOffset" parameters is unverified user data (uiOffset is checked against 0, but no checks otherwise). We have the verification of the parameters in line 12.What Mr. Porst  must have thought is pretty clear if the sum of these two is bigger than the filesize it's wrong. What he forgot was that uiFileSize > uiOffset + uiSize if uiOffset = 0xFFFFFFFF and uiSize = 2 because of an unsigned integer overflow in the calculation[1]. In fact we can craft PE files with arbitrary values of uiOffset and that is not expected. We are now programming a wierd machine. In Mr. Porst's code what we can do with our overflow error is fairly limited. We can cause a regular crash but beyond that the code looks solid - see what happens if we use uiSize=0xFFFFFFFF and uiOffset=2. What had happed if we changed lines 16,17,18 and 21 a wee bit so that we read the entire file and just parse the right portion of the file instead of reading only the import section:
16: ifFile.seekg(0, std::ios_base::beg);
17: std::vector<unsigned char> vImportdirectory(uiFileSize /*uiSize*/);
18: ifFile.read(reinterpret_cast<char*>(&vImportdirectory[0]), uiFileSize /*uiSize*/);
21: InputBuffer inpBuffer(vImportdirectory); inpBuffer.set(uiOffset);

In the well behaved case everything remains functional. But we now have the potential to leak information. The point being the sanity checking doesn't work. With uiSize > uiFileSize and uiOffset making sure that the check in line 12 works we'd be able to read beyond the buffer allocated with a vector as much as we want. If some webservice dumps to the user the imports using this function we'd be able to dump heap content of the webservice following vector from line 17 and that might contain information not meant for anybodies eyes - and that can be quite valuable to attackers go google Heart Bleed. If we had a write operation instead we'd be writing arbitrary memory and with a memory full of user data and lots of vtables lying around we'd have code execution in no time! It's pretty much standard exploit stuff - except well call it by a different name: Programming the wierd machine.

The LangSec perspective on this error
There is any number of ways to fix the problem above. For example checking that the file read in line 18 succeeds would in the unmodified case stop the DoS from happing. You could easily do fixes of this type. And a great many developers do.
What LangSec suggest is instead that you honor what I've write as point 2). We should be strict. A check for uiSize < uiFileSize should be added. A check for the overflow itself too.  Both should abort parsing if they fail. It would solve the problem. Also being strict the check in line 40 should return an error too instead of proceeding. Even though you could probably find dllnames etc. it's still a breach of protocol and aborting processing will minimize the risk of relying on assumptions that'll lead to another instance of a wierd machine. Idealy you'd even go so far that you'd do that as part of sanity checking before you start pushing values into other variables say line 35. Be sure the data is correct and you know exactly what to do with it, before you use it.
If we step back into 1) what we need to notice is that with the malformed case what happens becomes dependent on what's on the heap after the vector - that is our code isn't well defined. We could use a theorem solver to check for this. At least in this case. I found the bug by running some 80000 malware files through it which I suppose would count as a kind of fuzzing. The key point is, if we first make sure any data gives a predictable outcome, even if that outcome means turning down the request for processing we have written safe code.

The old school compiler solution for this bug

The 0x86 platform always had a number of exception interrupts. I sometimes think of them in terms of old school CPU developers making deep thoughts about what could go wrong in a computer. Probably because the first exception I always think of happens to be division by 0 and that happens to be the first in the x86 list - the first I'd think about. On 5th place on the founding fathers of the x86 CPU list comes the "overflow" interupt. It's trigged by the INTO instruction which essentially checks if the overflow flag is set and if then causes an interrupt 4. In short the CPU has since the dawn of time held the solution to the integer overflow problem. add eax, ebx; into. Done. Overflows no longer serves as "mov" type instructions in a wierd machine, but are always reduced to DoS - in fact a kind of DoS most developers know very well how to deal with using structured exception handling. Unfortunately the INTO instruction is almost never generated by compilers. Even the jo/jno instructions wired to the overflow flag is  hardly ever generated by modern compilers. All three are listed in Z0mbie's opcode frequency list with 0% and that they are in there in the first place is more likely to be errors in Z0mbie's disassembler than because they actually sees use. So this remains an illusion. To make it worse integers cannot be overridden in C++ so I can't even just make up an overflow checking + operator. I have no clue how many security breaches are the result of over/underflows in addition and subtraction but it's probably not that uncommon. As we seen above it's an easy mistake to do because the mathematical + which we all know and love turns out to behave differently than the x86 "add". And while I'd love to have a "overflow free" integer available in C++, the langsec solution of doing things right seems like where we'd want to go.(Well if I had a choice I'd do both).

No bugs, no insecurity. Even if it's a truism.

Literature:

Sassaman, Patterson & Bratus: http://langsec.org/insecurity-theory.pdfZombie's opcode statistics: http://z0mbie.daemonlab.org/opcodes.html





[1]              There is another error in the code. uiSize < sizeof(PELIB_IMAGE_IMPORT_DIRECTORY<bits>) will lead to leaking information too. I'll not discuss that error here.

Speculation on rowhammer

Update:

27.07.2015: Daniel Gruss, Clementine Maurice and Stefan Mangard released a paper today detailing how row hammer can be done in java script on Sandy Bridge, Ivy Bridge and Haswell. The method is indeed a pattern read in memory to evict the aggressor address from the cache. They utilize that a cache set only has 12 or 16  entries(ways) and thus using (repeatedly) 13 or 17 addresses that map to a single cache set will cause L3 cache misses on the 13th/17th access, instead of my experiments of keeping the entire cache full (which works like a charm, but it seem too slow for rowhammer). What they ended up with is certainly more advanced that what I've played with, but it's the same ball game. Thus the speculation in the old post below post unfortunately holds true.

It's time to test your ram! Disabling java script will go a long way towards protecting you. White list sites that you consider safe and HTTPS is important to avoid MiM injecting of java scripts. However these measures are insufficient on their own - other advanced scripting languages could probably be abused too (flash, silverlight,....).

A bit more speculation: Apple made an update to their EFI bios to mitigate it. I speculate that they increased the refresh rate, so that there is only 32ms between refreshes instead of the usual 64ms - the reason for this is that a 32ms refresh rate is recommmend for running at high-temperature and thus likely to be readily available. This is insufficient to entirely rule out row hammer as you'd need to drop interval as low as 8ms to be safe. It's important to realize that less time between refresh means speed penalties because you cannot read during the refresh. For 8ms intervals pentalty will be pretty steep, though probably acceptable for 32ms. Also more refresh causes more power consumption. I have seen wildly differing estimate of this, but I speculate that on a running laptop it's not a real issue and if implemented right (say 64ms interval refresh while computer is sleeping) no issue at all .





Original Post
It's cool to be a malware hobbyist. I can write purely speculative blogs. It's like being a comedian doing news - it's just comedy... And this blog is pretty speculative - sorry.



The speculation

In my last post on the subject "Row hammer fix POC and generic root kit detection using performance counters" I wrote that I doubt that we'd ever see a script version of this bug. My reasons for this was that scripts are unlikely to use clflush or any non-temporal (MOVNT)instructions in any predictable manner and would probably be on the slow end to flush the cache through accessing the it in a pattern that the CPU wouldn't predict while hammering quickly enough. The first thing that made me change my mind was when I stumbled upon this article: http://iss.oy.ne.ro/SpyInTheSandbox.pdf which manipulates the cache and then uses it as a side channel to obtain information from outside a sandbox through java script. This chipped away on my confidence. Then came this article https://xuanwulab.github.io/2015/06/09/Research-report-on-using-JIT-to-trigger-RowHammer/ which concludes that JIT compilers would not generate the instructions we needed for row hammering. This scared me because I'd forgotten all about JIT compiling. Because while we might be fast enough without JIT, having JIT would definitely make it fast enough. And finally came a tweet from Lava (@lavados)

Yesterday @BloodyTangerine and I flipped bits on Ivy Bridge without clflush... #rowhammer

So the plot thickens. I don't know how BloodyTangerine and Lavados are flipping the bits but if I were a betting man I'd place my money on the an approach like Spy In the Sandbox. There is a bit of evidence in that direction. The cache is build differently on different chipsets and this could be a reason to mention it. Conclusion: I was wrong. My new opinion is that we'll see a java script row hammer exploit.
Full disclosure: I had 280 char chat with Lava after I wrote this. This conversation is not part of this blog, because everything was said in private. I dare say that if this stuff interests you it would probably be worth while following Lava's tweet.

My Experiments

I played around a bit with the row hammer after my last blog. The reason that row hammer doesn't occur naturally all the time in memory is because the cache catches reads to any address being hammered. This is also why the performance counter on last level cache works well for detecting and preventing row hammering. To get around this Dullien and Seaborn (the original row hammer stuff) used the clflush instruction. I tried to hammer using the MOVNTQ instruction and was not succesful. Then after reading the "Spy in the sandbox" I started writing up a cache flush routine using multiple threads inspired by "Spy in the sandbox". The idea behind using multiple threads is that the biggest cache on modern CPU's is the level 3 cache and that is shared on all cores (and hyper threads) making it much easier to keeping the cache full with content unrelated to the row I'm hammering easier. The 1st and second level caches I don't consider too much of a problem since they quite small and could probably be kept full with unrelated stuff on a single thread. Unfortunately I never finished up this code so I'm not sure if it'd actually work for row hammer. The real issue isn't keeping my hammering row out of the cache, but doing it while hammering enough to cause bit flips.

Why row hammer in scripts would be really bad news

If Lavados and BloodyTangerine is indeed using the method I were playing with - or even a derivative of it then it's really bad. Mark Seaborn's fix with black listing clflush in the validator of NaCL sandbox would not extend because now normal mov instructions would suffice. Even adc, cmp, sub,inc,... would suffice and these are common and useful instructions. Even worse since browsers all too easy use Java script row hammer could easily be extend from a local privileged elevation to a full remote breach of the host attacked. Worse yet such an attack could be inserted automatically through man-in-the-middle on non https connections. Like the China's "Great Cannon" in a worst case scenario. You might argue that ECC ram would mitigate this, but that's only a half truth. ECC will most of the time reduce it to a DoS attack, but row hammer would from time to time flip more than 1 bit and dance right through the ECC on the ram. It wouldn't be the perfect storm because there is a tiny bit of good news, though not much: row hammer is difficult to weaponize and we know how to defeat it - even if the method is far more complex than traditional fixes.

Literature:

xuanwulab (2015): https://xuanwulab.github.io/2015/06/09/Research-report-on-using-JIT-to-trigger-RowHammer/
Oren et al(2015): http://iss.oy.ne.ro/SpyInTheSandbox.pdf
Fogh (2015): Row hammer fix POC and generic root kit detection using performance counters (This blog)