【读书笔记】_Fraud analysis_Chp2

本文链接：https://blog.youkuaiyun.com/qq_41540498/article/details/109488984

本文详细介绍了欺诈分析中数据收集、采样和预处理的各个环节，包括不同类型的数据源、数据合并、采样方法、数据元素类型、数据可视化、Benford's Law、描述性统计、缺失值处理、异常值检测、标准化、分类、权重证据编码、变量选择、主成分分析、RIDITs和PRIDIT Analytics等关键概念和方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Chp2 Data Collection, Sampling, and Preprocessing

2.1 Introduction

GIGO: garbage in garbage out principal

2.2 Types of data sources

Transactional data:
stored in a OLTP(online transaction processing) relational databases
RFM variables
Contractual, subscription, or account data:
stored in a CRM(customer relationship management) database
a source of sociodemographic information： slow-moving data dimensions
sources to retrieve sociademographic or factual data : subscription data, data poolers, survey data, publicly available data sources
data poolers :
to gather data and sell it to interested customers
to build predictive models and sell the output of these models as risk scores.
data poolers E.g: Experian, Equifax, CIFAS, Dun&Bradstreet, Thomson Reuters.
FICO score :use Experian, Equifax and Transunion
Surveys:
online: Facebook, LinkedIn, Twitter
offline
Behavioral information:
fast-moving data / dynamic characteristics
include: preference of customers/ usage information/ frequencis of events/ treend variables… turnover/ solvency/ umber of employees
unstructured data: text documents
unstructured data : contextual or network information
qualitative, expert-based data:
a popular example of applying expert-based validation is checking the univariate signs of a regression model.
publicly available data:
such as macroecomomic data(GDP, inflation, unemployment)/ weather observations/ socail media data from Facebook,Twitter,LinkedIn… …

2.3 Merging data sources

rows: instances/ observations/lines
columns: variables/ fields/ characteristics/ attributes/ indicators /features
keys:

2.4 Sampling

a good sampling should be representative for the future entities on which the analytical model will be run.—> choosing the optimal time window
stratified sampling

2.5 types of data elements

continuous data
categorical data
2.1 nominal
2.2 ordinal
2.3 binary

2.6 Visual data exploration and exploratory statistical analysis

2.7 Benford’s Law

the frequency distribution of the first digit in many real-life data sets complies with the Benford’s law expresses as follow:
$p(d)=log_{10}(\frac{d+1}{d})=log_{10}(d+1)-log_{10}(d)$
a partially negative rule: if the law is not satisfied, it’s probable that the involved data were manipulated/tampered and further investigation or testing is required. Conversely, if a data set complies with the Benford’s law ,it can still be fraudulent.

2.8 descriptive statistics

continuous variables:
- basic descriptive statistics:
  - mean
  - median
  - variation/ standard deviation
  - percentile values
- specific descriptive statistics:
  - skewness: symmetry/asymmetry of a distribution
  - kurtosis: peakedness/flatness of a distribution
categorical variables:
- mode: the most frequently occurring value

2.9 missing values

originated because of : nonapplicable info/ undisclosed info/ an error during merging
deal with missing values:
- replace(impute):the average /median of the known values / regression-based imputation/
- delete: if information is missing at random and has no meaningful interpretation /relationship to the target
- keep: if missing values means something

2.10 outlier detection and treatment

valid observations(salary $1000000)/invalid observations(age 300)
unidimensional/multivariate

detection:
- univariate outliers detection:
  - calculate the minimum/maximum values/histograms/box plot/
  - too far away from the edges of the box(1.5*IQR¹) IQR=Q3-Q1(interquartile range)
  - calculate z-scores. $zi=(x_i-\mu)/\sigma.$ if $z_i>3$ .
- multivariate outliers detection:
  - fitting regression lines: observations with large errors.
  - clustering or calculating the Mahalanobis distance.
  - multivariate outliers impact marginally on model performance.

treatment:
invalid observations: treat the outlier as a missing value.
valid observations:
- truncation/capping/winsorzing
- imposes a lower/upper limit for each vairables and bring back any values below/ablove to these limits.
  - limit cacluating:
    $l i m i t = M e d i a n + / - s$
    $s = I Q R / (2 * 0.6745)$
    $z - s c o r e s / I Q R$
- a sigmoid transformation ranging between 0 and 1 can be used for capping as : $f(x)=\frac{1}{(1+e^{-x})}$
- expert-based limits based on business knowledge

not all invalid values are outlying( gender=male & pregnant=yes)
—> need some explicit precautions.
—> construct a set of rules that are formulated based on expert knowledge and experiences(similiar to a fraud detection rule engine in fact)
—> a network representation of the variables may be of use to construct the rule set and reason upon relations that exist between the different variables, with links representing constraints that apply to the combination of variable values and resulting in rules added to the rule set.

2.11 red flags

deviations from normality are called red flags of fraud.
the fundamental red flags of fraud is the anomaly, e.g:

Tax evasion fraud red flags:
- an identical financial statement, since fraudulent companies copy financial statements of nonfraudulent companys.
- a nonexisting accountant ,since name of an accountant is unique
Credit card fraud red flags:
- a small payment followed by a large payment immediately after(a fraudster might first check whether the card is still active before placing a bet)
- regular rather small payment(a technique to avoid getting noticed)
Telocommunications-related fraud red flags:
…

when handling valid outlier using the treatment techniques discussed before, we may impair the ability of descriptive analytics in finding anomalous fraud patterns.

2.12 standardizing data

standardization procedues:

min/max standardization: $\frac{x-min}{max-min}*(newmax-newmin)+newmin.$
z-score standardization: $\frac{x-\mu}{\sigma}$
decimal scaling: $\frac{x}{10^n}$
useful for regression ,not needed in decision tree

2.13 categorization

categorization/ coarse-classification/ classing/ grouping/ binning

Reasons:
- categorical variables: to reduce the number of categories.
- continuous variables: to model nonlinear effects into linear models.
Methods:
- equal interval binning/ equal frequency binning
- Chi-squared analysis
  $\chi^2 = + \frac{(expected-real)^2}{real}$ for each k group
  compare the value wiht a Chi-squared distribution with k-1 degrees
- use pivot tables:
  categorize values based on similar odds.

2.14 weights of evidence coding

for each groups in categorization, we need a transformation to let each group have a monotonically increasing/decreasing relationship with Y
$WOE=ln\frac{dist\ nofraud} {dist\ fraud}=ln\frac{good\%}{bad\%}$

2.15 variable selection

by measuring univariate correlations between each variable and the target:

	continuous target	categorical target
continuous variables	Pearson correlation	Fisher score
categorical variables	Fisher score\ ANOVA	Information value \n Cramer’s \n Gain \n entryopy

Pearson correlation: pho
Fisher score: $\frac{|\bar{X}_g-\bar{X}_b|}{\sqrt{s^2_G+s^2_B}}$
IV: $\Sigma(dist\ good - dist\ bad)*woe$

ivValue	strongness
0.02-	unpredictive
0.02-0.1	weak
0.1-0.3	medium
0.3+	strong predictive

Cramer’s V: $CV=\sqrt{\frac{\chi^2}n}$

drawback: work univariately and do not consider correlation between dimensions individually

2.16 principal components analysis

an alternative method for input or variable selection.
a technique to reduce the dimensionality of data by forming new variables that are linear composites of the original variables.
multicollinearlity: correlation among the explanatory or predictor variables.
–>result in unstable models.
the stability or robustness of a model refers to the stability of the exact values of the parameters of the model that are being estimated based on the sample of observations.
calculation of principal components: using the eigenvector decomposition.

hardly be interpreted,though it will yield a better model in terms of stability as well as predictive performance.

2.17 RIDITs

relative to an identified distribution unit
it reflects the relative abnormality of a particular response.
can be interpreted to be an adjusted or transformed percentile score.
RIDIT score for a categorical response valine i to variable t with ptj indicating the proportion of the population having value i for variable t, is calculated as follows:
$B_{ti} = \Sigma_{j<i} \hat{p}_{tj}-\Sigma_{j>i} \hat{p}_{tj}$