Install-level fraud detection with Bayesian networks
Millions of app installs are processed daily by AppsFlyer.
Each install provides a plethora of measurable data points, from timestamps to device sensor indicators, together this provides visual insight to each install and its quality.
Unfortunately, many installs traveling through our ecosystem are fraudulent.
Now, more than ever, fraud protection is a fundamental necessity to the mobile industry, second only to attribution in measurement.
The low-scale challenge
As fraud protection methods evolve, so do fraud operations.
As new technology emerges, new loopholes are identified and exploited by fraudsters and new identification methods follow.
Many fraudulent installs can be identified by install authentication methods, or through their association with fraudulent cluster patterns. However, smaller fraud cases can sometimes fall under the radar, as low volume sites exploit their small sample size.
Identifying fraud on an individual install level is perhaps one of the biggest challenges.
This is not merely a classification issue, as time would prove that we are required to provide a reason behind our classification. This forces us to consistently seek the ideal classifier.
A classifier that learns specific parameters, but also the rules to making an informed decision, and logic for blocking a specific install. An install-level classifier that overcomes the challenges of identifying fraud on low-scale, smaller sites, while still maintaining the ability to accurately block fraud in real-time with a minimal false-positive rate.
As we looked for the ideal classifier we came across Bayesian networks.
Bayesian networks are essentially a probabilistic model measuring dependencies between variables via a directed acyclic graph.
This identification method calculates the probability to view a specific set of install parameters. Essentially, modeling dependencies between pairs of variables and identifying which of those pairs are dependent on one another, and which are not.
Bayesian networks are best used for analyzing events that occurred and predicting the probability of possible known contributing causes.
For example, a Bayesian network could represent the probabilistic relationships between a disease and its symptoms. Once specific symptoms are presented, a Bayesian network can be used to calculate the probability of various diseases.
How does it work?
We use a variant over the Chi-Square test, to test conditional dependence between variables. Assuming that all variables are dependent means that our calculation is correct but intractable.
The network enables us to find combinations of two or more fields. These fields contain very low to non-existent potential of actually occurring.
Some combinations may be trivial, like a new device model with an old OS, but covering each case manually is hard to carry out and maintain at scale.
Bayesian networks specifically help when examining a combination of several parameters.
While each pair might seem legitimate when examined individually, it is the combination of all variables put together which is not statistically possible.
For the sake of this example let’s consider having 50 different variables, each with 10 options – this would yield 10^50 possible combinations of values.
If all variables were independent all we would have to do is learn each of them separately, leaving us with 500 possibilities – very easy to compute.
However, not all of them are independent.
To calculate this accurately, we must first identify which specific variables are independent and which are co-dependent.
This identification will create a Bayesian network, which allows us to accurately compute install probability across many different variables.
We essentially calculate the probability of an install to be fraudulent. However, this probability must be significant in order to pass AppsFlyer’s strict threshold and to be blocked, as we aim to avoid false positives.
With this advanced model already in use across our ecosystem we manage to identify over a million fraudulent installs a day, about 50% of these installs would have otherwise gone undetected and unidentified by previous rule sets. These additional installs save millions of dollars for AppsFlyer’s customers.
By utilizing Bayesian networks and constantly adding new features and methods to our fraud detection capabilities, we are now more equipped to take on current and future fraud challenges, as we move forward with our efforts to fight mobile ad fraud.