Variant analysis
Variant analysis involves the examination and comparison of diverse implementations, protocols, standards and security vulnerabilities within the decentralized internet ecosystem, and finding other occurrences (or "variants") of that problem in a codebase. Existence of a vulnerability in the codebase strongly suggests the potential presence of similar vulnerabilities elsewhere within the codebase. Traditional methods for variant analysis have typically been manual, inefficient, and time-consuming. However, automated methods for variant analysis are now becoming increasingly accessible.
About variant analysis
Variant analysis is the process of using a known security vulnerability as a seed to find similar problems in your code. It’s a technique that security engineers use to identify potential vulnerabilities, and ensure these threats are properly fixed across multiple codebases.
Querying code using CodeQL is the most efficient way to perform variant analysis. You can use the standard CodeQL queries to identify seed vulnerabilities, or find new vulnerabilities by writing your own custom CodeQL queries. Then, develop or iterate over the query to automatically find logical variants of the same bug that could be missed using traditional manual techniques.
Traditional variant analysis
Traditional variant analysis involves manually examining code to identify instances of similar vulnerabilities after uncovering one flaw within a specific system. For instance, if you come across a buffer-overflow vulnerability in your application:
char settings[80];
void set_param_string(char *user_input) {
sprintf(settings, "username = %s", user_input);
}
If user_input is user controlled, the user can specify a string longer than the settings buffer can hold, and the program will crash (or worse). We know this can be fixed by replacing sprintf with the safer snprintf function:
#define SETTINGS_LEN 80
char settings[SETTINGS_LEN];
void set_param_string(char *user_input) {
snprintf(settings, SETTINGS_LEN, "username = %s",
user_input);
}
Automated Variant Analysis
With QL, variant analysis is done by running queries to find syntactic or semantic patterns in the code. Such patterns range from the use of unsafe functions to more complex patterns like user data flowing to the response of an HTTP request, leading to cross-site scripting vulnerabilities. The ability to quickly write and change queries is central to an effective variant analysis tool, since finding variants accurately requires you to iterate over queries to reduce both false negatives and false positives. Let's build up a query to find the unsafe snprintf usage from the previous section.
Recall this snippet of unsafe code:
while(!bFoundPositiveMatch) { /* Loop broken below */
...
iAllNames += snprintf(allNames+iAllNames,
sizeof(allNames)-iAllNames,
"DNSname: %s", szAltName);
...
}
There are two aspects of this call to snprintf that together make it unsafe:
- The format string contains %s. (The vulnerability relies on integer overflow in the return value, and without %s in the format string, the return values of snprintf are likely to be too small to overflow.)
- The return value of snprintf is fed back into its size parameter. (When the return value causes iAllNames to overflow and become negative, the size parameter will become larger than the size of the buffer, and a buffer overflow will result.)
Let's start by finding all calls to snprintf:
import cpp
from FunctionCall call
where call.getTarget().getName() = "snprintf"
select call
We first import the cpp library. QL is language agnostic, and relies on libraries written in QL to define the basic types involved in a C(++) query such as the FunctionCall type. A query over a Java codebase would need to import the Java library instead.
We next define our query. The query looks a bit like an SQL query with its SELECT clause at the end instead of the beginning. It can be read as "calculate the set of all FunctionCalls such that the name of the function being called is snprintf".
Note that, even with this simple query, we are already ahead of text search tools like grep, since call.getTarget() takes macro expansion into account. This query will find every call to sprintf in the codebase. Let's restrict the query to only find calls that contain %s in their format strings:
import cpp
from FunctionCall call
where call.getTarget().getName() = "snprintf"
and call.getArgument(2).getValue().
regexpMatch("(?s).*%s.*")
select call
This only takes into account calls to snprintf that use string literals directly as their format strings, but this is usually by far the most common case. We could potentially extend this query to take into account cases where the strings are defined elsewhere, but we won't cover that here. Finally, let's further restrict the query to those calls to snprintf whose return values flow back into their size arguments. Writing such a query from scratch is non-trivial because it requires modeling the data-flow of the program. However, QL includes comprehensive data-flow libraries for its supported languages. Using the data-flow taint-tracking library, the query remains quite simple:
import cpp
import semmle.code.cpp.dataflow.TaintTracking
from FunctionCall call
where call.getTarget().getName() = "snprintf"
and call.getArgument(2).getValue().
regexpMatch("(?s).*%s.*")
and TaintTracking::localTaint(DataFlow::exprNode(call),
DataFlow::exprNode(call.
getArgument(1)))
select call
TaintTracking::localTaint(source,sink) is true if there is a path in the program's data-flow graph from the source node to the sink node. In the query above, we are using the call itself as the source (DataFlow::exprNode(call) returns the node in the data-flow graph corresponding to the call to snprintf), and the call's second argument (i.e., snprintf's size parameter) as the sink. This accurately finds all calls to snprintf that are vulnerable to this buffer overflow, saving us having to instead manually inspect every call to snprintf for this kind of data-flow.
Variant Analysis to Prevent Future Bugs
Variant analysis can be used to help find undiscovered bugs and vulnerabilities that currently reside in the code, but what about potential bugs yet to be introduced, due to mistakes that developers make in the future? We could of course make traditional variant analysis part of our manual code review process, and check for all classes of bugs that have been the subject of your variant analysis in the past. However, this would quickly become infeasible and prone to error, as your list of things to check for increases, and the size and complexity of your code base does too. Your code review process would eventually grind to a halt. Instead, automated tools like QL can help here (similarly to how they help on-demand variant analysis). Over time, you can build up a list of queries that have been developed as a response to dealing with past bugs and vulnerabilities, and then run them all automatically during the code review process. This frees up your developers to review more interesting things. When running queries in this manner produces results, developers can respond and fix the issues before they're even introduced into your codebase. Additionally, when false positives are discovered, you can take the opportunity to improve and refine your queries to take these situations into account.