-=( ---------------------------------------------------------------------- )=- -=( Natural Selection Issue #1 -------------------- How AVs Detect Viruses )=- -=( ---------------------------------------------------------------------- )=- -=( 0 : Contents --------------------------------------------------------- )=- 0 : Contents 1 : Introduction 2 : The Basics 3 : Emulation 4 : Last Resort 5 : Conclusion -=( 1 : Introduction ----------------------------------------------------- )=- Over the last 10 years, Anti-Virus products evolved a great deal to well beyond the simple scan-stringers of the early 90s. Yet, beyond some vague concepts like "emulation", the AV industry has been reasonably successful in keeping everyone in the dark as to exactly the mechanisms by which viruses are detected. The time has come (in fact been long overdue) to shed light on a little of the info the AV guys don't want you to know. -=( 2 : The Basics ------------------------------------------------------- )=- AVs still use scan strings to identify the vast majority of viruses. Several newer techniques have been added into their scanning engines which allow them more options incase that signature scanning is not practical. So, let's look at the problem from an AV point of view. You have a file and you wish to determine which, if any, of a big list of viruses patterns it contains. Since there are a large number of viruses (over 40,000) it is not practical to go through a file once for each virus individually. Instead all the simple to ID viruses are checked for first and all at once. Viruses which are simple to ID include those for which regular scan strings exist and for those viruses who use a overly simple method of encryption like add, sub, or xor (there is an article about it in matrix#2 if you're interested). First the file type of the file being scanned is determined. Depending on the file type being checked the corresponding subset of viruses which are capable of infecting that file type are included and the rest excluded from the search (there is no point is scanning for a MBR infector inside a Word document afterall). The exact nature of the scan strings depend on the vendor. Scan strings are typically 20-30 bytes long. Some AVs allows scan strings with wild cards. Others allow for half-byte (nibble) matches. Some even allow for some variation in the length of wildcards (e.g. B8 02 [wildcard - 3 to 5 bytes] B9 00 ) to accommodate for garbage code. It is worthwhile to note that the AV guys don't like using wildcards to match viruses, and instead, rather put multiple scan strings per virus to catch all possibilities (some viruses have 8 or more different scan strings). Anyways, this is what a typical scans string looks like: E8 3D 00 C3 BA 00 01 B9 55 10 B4 40 E8 41 FE C3 B8 02 42 33 C9 33 D2 E8 (an actual scan string of Kampi.4181) Some AV products, after getting a scan string match, then preform a CRC check on the rest of the suspected virus so as to be sure they identified it correctly. In an effort to speed up scanning, some of the AV guys like to have all their scan strings conform to a certain type. That being scan strings which start with the only a small set of bytes instead of being able to start with any byte. This has the effect that a scanner does not have to check every byte in the input file as a possible start of a scan string. This set of start bytes is typically small - around 16 or so - and correspond to some of the most common instructions found in programs. For example, in a DOS file scanner bytes corresponding to some of the following instructions are probably used: "cs:", "cmp ax,#", "push ax", "int", "jz", "mov reg, r/m", "mov ax, #", "call", "jmp (full)" When scanning for unknown viruses, most AV products use emulation, but some still use a form of scan strings. These strings are short, and represent code sequences commonly found in viruses (EXE header checks, open file, the word "virus" [haha], etc). The presence of only one of these strings will not set off a scanner, but the presence of multiple ones may (though it is rare these scanners ever find anything). -=( 3 : Emulation -------------------------------------------------------- )=- So, now that all the easy stuff has been detected, the AV is left with the problem of detecting the strong polymorphic and metamorphic viruses. For this, one of the methods of checking is to use emulation. The emulator attempts to go through the code, and decrypt part or all of the main virus body in memory at which point it again tries to scan string. Since most of the viruses which are still left to scan for are polymorphic, the scanners expect to see stuff like a decryption routines and garbage instructions. They thus formulated a few nice speed ups. The emulator works by keeping track of the areas of virtual memory (where the file is being emulated) that have been changed. Every so often, the emulator is interrupted and the changed bytes in virtual memory (typically their emulators can handle around 64kb of dirty pages) are scanned for the presence of a scan string of one of the remaining viruses. Interruptions can occur after any write to memory - especially those to a code segment, and when the number writes to a single page is much more common than writes to another page. In order to prevent the emulator from running forever, the emulator stops when either 1) the file could not contain any known virus, or 2) a certain amount of instructions have been processed. The upper limit of instruction emulated is usually somewhere around 1 to 2 million but it could be higher now. Emulators also tend to shut down when they are presented with an unknown API (the emulator doesn't know how many parameters the API thus can't clean the stack). Emulation is slow - and hence it is desirable for it to stop before the upper limit of emulation is reached. Early scanners used a heuristic on how long the emulator should run. The file is assumed to have some initial percentage chance to be infected (10% or so). As the file is emulated, this probability updated with Promoter and Inhibitor Rules. Promoter Rules are suspicious things that indicate the presence of a virus. For example, encountering a garbage instruction (destroys content of registers before use) will increase this by about 1%. Inhibitor rules do the opposite. They reduce the possibility of a virus being in the file (rules like: system calls, and few memory writes). Emulation continued with this method while probability of a virus was more than zero. While this method is no longer used for scanning for known viruses, it is likely some form of it is in use for detecting unknown viruses. In newer scanners, as the emulator progresses, the list of viruses which could still be in the file is updated based on the instructions encountered. For example, if a file contains the instruction: "ADD eax, 3". In machine code there are three ways to represent this instruction (great design): 05 03 00 00 00 ; EAX register "short" form 83 C0 03 ; Sign extended general reg add 81 C0 03 00 00 00 ; Not sign extended general Strangely enough, the first is the normal one to use in this case. Go Figure. Anyways, if a poly engine is only capable of making the first and third of these instructions, but the emulator finds the second, then that virus is eliminated from the list of viruses possibly infecting the file. This is called an exclusion string. Exclusion strings are typically small (around 3 bytes) which can eliminate the presence of a virus from a file. There also exist something called Inclusion Strings. Inclusion strings register an unusual set of instructions which could signify a virus and they let the emulator know that it should keep on scanning because something funny is happening. Things that make inclusion strings are things like the third case from the above example, and various do nothing garbage instructions like "ADD al, 0", "CLC / JC", etc. Inclusion strings are also used if a virus always contains a small constant string near the beginning that is too short for a full scan string. Failure to find the inclusion string will eliminate the virus from the viruses that the file could be infected with. Some of the most interesting inclusion/exclusion choices are those in which a uninitialized value has been used as an index to a memory location. An uninitialized value for a memory read is a dead give away that something funny is going on as this is not a common feature of most programs :). A random memory write though is hardly ever used in program or virus, and it indicated that there is something wrong with the file - maybe corrupt - and thus is an exclusion condition. The newer AV emulators also do not have to go through a program in the same order as the program would run. The AV were troubled now for quite some time about conditional branches that would cause the virus to not run. Something like "if ( seconds < 20 ) then early_abort" could cause a virus to exit prematurely which meant that the emulator would fail to find it. To combat this, whenever a conditional branch of this type is encountered, the AVs can save the state of the program before and after the branch, and if it appears that the path the emulator took is a dead end, it can go back and try running the other sequence of code. With this method, they are capable - IN THEORY - to make sure that each and every instruction of a program can be run. The emulator can stop early if all of the known viruses are eliminated from being in the file (and heuristics doesn't want to continue). Little of this should come as any huge surprise or anything. The AV guys at this point are hoping that 99.9% or more of all viruses are eliminated as being in the file. So, since this is defeatable with a good long poly decryptor, some good EPO tactics (Entry Point Obscursion), and of course metamorphism, it'd be nice to see what the AVs use when all else fails. -=( 4 : Last Resort ------------------------------------------------------ )=- The AVs still have a few tricks up their sleeves. They hate using them from what I gather though as it makes their scanners pretty slow as they have to deal with the virus more or less individually (plus it's more work for them). In the case of a virus with EPO, the AV guys have the option of trying to guess where the start of the decryptor and emulating from there normally. This methods sometimes works ok, but some quick exclusion code/condition is a necessity for this kind of scanning. Furthermore, it may not always be possible to locate the decryptor or, good heavens, the virus could even be metamorphic :-) Sometimes, like in the early days of polymorphic viruses, they attempt to detect the polymorphic decryptor itself, rather than the virus code. This method however is seldom used as it yields both false negatives and false positives a high percentage of the time. What is often tried at this point is to detect a virus without the use of a scan string. I mean really - how many real files have the entry point in the last section (writable of course) of a PE file with the section name being "VIRUS", has an incorrect SizeOfData, and has a file size divisible evenly by the number 101? Yes, I agree it's an extremely cheap, dishonest, and cheesy way of detecting viruses, but unfortunately it works. If they can't detect the virus code, they just look for other give away signs, combine it with the little info they know the virus does, and they have a detection routine. For some reason, they call this method "Geometric Detection". It's error prone and yields false positives at times, but it can work well enough. So, what do they look for in the file with their "Geometric Detection"? A lot of things - some of which is used by their heuristics to pick up unknown viruses too. Among these include: - Entry Point Location (in the last section) - Entry Point code starts with a "JMP" - Incorrectly calculated SizeOfCode/SizeOfImage in the header - Suspicious section names - Suspicious imports from KERNEL32 (like by ordinals) - Finding Delta offset like code (CALL/POP) near Entry Point - Multiple PE Headers (in case of a prepender) - Incorrect checksum (Good reason NOT to use it as an infection marker) - A Patched Import table They combine this with virus specific things (they call a 'filter') like: - Any inclusion strings - The infection marker - The approximate code size - The instructions encountered during emulation (type of garbage code) - etc - there are many others of course. This is what they do if they are lazy and or they let their scanning engine get a little old and didn't have time for a complete overhaul to the new way of doing things. There are two more things they can do. The first is to combine a state-machine with their emulator. If a virus is metamorphic by only inserting garbage code, then as the emulator goes along, it compares instructions to ones used in the virus. If the emulator finds the same series of instructions as the virus (the virus without the garbage code), then it is flagged. One would have to assume that after a certain number of mismatches, the emulator would have to give up and claim not found. Thus a large enough amount of garbage code could do the trick. A last and best option that the AV guys have currently is to scan the dataflow while emulating. Suppose there was not a single byte constant from virus generation to generation. The AVs figured out that they must scan for functionality, not the opcodes. For example, when you have some random series of instructions which is completely metamorphic which generate the following: - push offset to "CreateFileA" string - push handle of KERNEL32.DLL - call GetProcAddress The AVs can break from the emulator when they reach the first instruction in GetProcAddress in KERNEL32, and examine the parameters to see what is being requested. If the emulator detects the same combination of calls in the same order as the virus, even a metamorphic virus can be detected in this way. The scanners are switching over to this method of scanning, if they have not done so already. There are only few ways around this method of scanning: 1) Use EPO to hide the start of the virus/decryptor 2) Make sure the emulator doesn't get to the virus code - This can be done via very long polymorphic loops, and brute force decryption loops. If the polymorphic engine is just right (garbage read and WRITE statement), then it should be possible for the AVs to get to no single point where they can check values. Although this can't hide the data (like offsets, etc), if can try to obscure it by masking it with dozens of unrelated garbage instructions. 3) Vary the algorithm - Granted, this method is hard. In theory it should be possible though. What you could do is figure out which API and other functions have to be done before others and then rearrange the order in which those parts are executed from one generation to the next. 4) Garbage API calls - Call random APIs and discard the answers. This should confuse the state machines. Beware - you need to call the APIs with most of the parameters correct as APIs like to crash when it expects a pointer and it gets NULL instead. -=( 5 : Conclusion ------------------------------------------------------- )=- Since the information in this article was not exactly easy to gather - for some reason the AV guys are not forthcoming about it - I cannot guarantee accuracy. If there are any new developments, I'd love to hear about them. Hopefully though, this will give you a good enough idea as to what you are up against, and how to make them invest just a little more time on your virus. They like to boast that it takes vXers months to write the stuff that they can detect in minutes and at worst in days. Let's see if we can fix that dreadful imbalance. :-) -=( ---------------------------------------------------------------------- )=- -=( Natural Selection Issue #1 --------------- (c) 2002 Feathered Serpents )=- -=( ---------------------------------------------------------------------- )=-