【读书笔记】_Fraud analysis_Chp2

本文详细介绍了欺诈分析中数据收集、采样和预处理的各个环节,包括不同类型的数据源、数据合并、采样方法、数据元素类型、数据可视化、Benford's Law、描述性统计、缺失值处理、异常值检测、标准化、分类、权重证据编码、变量选择、主成分分析、RIDITs和PRIDIT Analytics等关键概念和方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Chp2 Data Collection, Sampling, and Preprocessing

2.1 Introduction

GIGO: garbage in garbage out principal

2.2 Types of data sources

  • Transactional data:
    stored in a OLTP(online transaction processing) relational databases
    RFM variables
  • Contractual, subscription, or account data:
    stored in a CRM(customer relationship management) database
    a source of sociodemographic information: slow-moving data dimensions
    sources to retrieve sociademographic or factual data : subscription data, data poolers, survey data, publicly available data sources
  • data poolers :
    to gather data and sell it to interested customers
    to build predictive models and sell the output of these models as risk scores.
    data poolers E.g: Experian, Equifax, CIFAS, Dun&Bradstreet, Thomson Reuters.
    FICO score :use Experian, Equifax and Transunion
  • Surveys:
    online: Facebook, LinkedIn, Twitter
    offline
  • Behavioral information:
    fast-moving data / dynamic characteristics
    include: preference of customers/ usage information/ frequencis of events/ treend variables… turnover/ solvency/ umber of employees
  • unstructured data: text documents
  • unstructured data : contextual or network information
  • qualitative, expert-based data:
    a popular example of applying expert-based validation is checking the univariate signs of a regression model.
  • publicly available data:
    such as macroecomomic data(GDP, inflation, unemployment)/ weather observations/ socail media data from Facebook,Twitter,LinkedIn… …

2.3 Merging data sources

rows: instances/ observations/lines
columns: variables/ fields/ characteristics/ attributes/ indicators /features
keys:

2.4 Sampling

a good sampling should be representative for the future entities on which the analytical model will be run.—> choosing the optimal time window
stratified sampling

2.5 types of data elements

  1. continuous data
  2. categorical data
    2.1 nominal
    2.2 ordinal
    2.3 binary

2.6 Visual data exploration and exploratory statistical analysis

2.7 Benford’s Law

the frequency distribution of the first digit in many real-life data sets complies with the Benford’s law expresses as follow:
p ( d ) = l o g 10 ( d + 1 d ) = l o g 10 ( d + 1 ) − l o g 10 ( d ) p(d)=log_{10}(\frac{d+1}{d})=log_{10}(d+1)-log_{10}(d) p(d)=log10(dd+1)=log10(d+1)log10(d)
a partially negative rule: if the law is not satisfied, it’s probable that the involved data were manipulated/tampered and further investigation or testing is required. Conversely, if a data set complies with the Benford’s law ,it can still be fraudulent.

2.8 descriptive statistics

  • continuous variables:
    • basic descriptive statistics:
      • mean
      • median
      • variation/ standard deviation
      • percentile values
    • specific descriptive statistics:
      • skewness: symmetry/asymmetry of a distribution
      • kurtosis: peakedness/flatness of a distribution
  • categorical variables:
    • mode: the most frequently occurring value

2.9 missing values

  • originated because of : nonapplicable info/ undisclosed info/ an error during merging
  • deal with missing values:
    • replace(impute):the average /median of the known values / regression-based imputation/
    • delete: if information is missing at random and has no meaningful interpretation /relationship to the target
    • keep: if missing values means something

2.10 outlier detection and treatment

  • valid observations(salary $1000000)/invalid observations(age 300)
  • unidimensional/multivariate

  • detection:
    • univariate outliers detection:
      • calculate the minimum/maximum values/histograms/box plot/
      • too far away from the edges of the box(1.5*IQR1) IQR=Q3-Q1(interquartile range)
      • calculate z-scores. z i = ( x i − μ ) / σ . zi=(x_i-\mu)/\sigma. zi=(xiμ)/σ. if z i > 3 z_i>3 zi>3.
    • multivariate outliers detection:
      • fitting regression lines: observations with large errors.

      • clustering or calculating the Mahalanobis distance.

      • multivariate outliers impact marginally on model performance.


  • treatment:
    invalid observations: treat the outlier as a missing value.
    valid observations:
    • truncation/capping/winsorzing
    • imposes a lower/upper limit for each vairables and bring back any values below/ablove to these limits.
      • limit cacluating:
        l i m i t = M e d i a n + / − s limit=Median +/- s limit=Median+/s
        s = I Q R / ( 2 ∗ 0.6745 ) s=IQR/(2*0.6745) s=IQR/(20.6745)
        z − s c o r e s / I Q R z-scores/IQR zscores/IQR
    • a sigmoid transformation ranging between 0 and 1 can be used for capping as : f ( x ) = 1 ( 1 + e − x ) f(x)=\frac{1}{(1+e^{-x})} f(x)=(1+ex)1
    • expert-based limits based on business knowledge

  • not all invalid values are outlying( gender=male & pregnant=yes)
    —> need some explicit precautions.
    —> construct a set of rules that are formulated based on expert knowledge and experiences(similiar to a fraud detection rule engine in fact)
    —> a network representation of the variables may be of use to construct the rule set and reason upon relations that exist between the different variables, with links representing constraints that apply to the combination of variable values and resulting in rules added to the rule set.

2.11 red flags

deviations from normality are called red flags of fraud.
the fundamental red flags of fraud is the anomaly, e.g:

  • Tax evasion fraud red flags:
    • an identical financial statement, since fraudulent companies copy financial statements of nonfraudulent companys.
    • a nonexisting accountant ,since name of an accountant is unique
  • Credit card fraud red flags:
    • a small payment followed by a large payment immediately after(a fraudster might first check whether the card is still active before placing a bet)
    • regular rather small payment(a technique to avoid getting noticed)
  • Telocommunications-related fraud red flags:

when handling valid outlier using the treatment techniques discussed before, we may impair the ability of descriptive analytics in finding anomalous fraud patterns.

2.12 standardizing data

standardization procedues:

  1. min/max standardization: x − m i n m a x − m i n ∗ ( n e w m a x − n e w m i n ) + n e w m i n . \frac{x-min}{max-min}*(newmax-newmin)+newmin. maxminxmin(newmaxnewmin)+newmin.
  2. z-score standardization: x − μ σ \frac{x-\mu}{\sigma} σxμ
  3. decimal scaling: x 1 0 n \frac{x}{10^n} 10nx
    useful for regression ,not needed in decision tree

2.13 categorization

categorization/ coarse-classification/ classing/ grouping/ binning

  • Reasons:
    • categorical variables: to reduce the number of categories.
    • continuous variables: to model nonlinear effects into linear models.
  • Methods:
    • equal interval binning/ equal frequency binning
    • Chi-squared analysis
      χ 2 = + ( e x p e c t e d − r e a l ) 2 r e a l \chi^2 = + \frac{(expected-real)^2}{real} χ2=+real(expectedreal)2 for each k group
      compare the value wiht a Chi-squared distribution with k-1 degrees
    • use pivot tables:
      categorize values based on similar odds.

2.14 weights of evidence coding

for each groups in categorization, we need a transformation to let each group have a monotonically increasing/decreasing relationship with Y
W O E = l n d i s t   n o f r a u d d i s t   f r a u d = l n g o o d % b a d % WOE=ln\frac{dist\ nofraud} {dist\ fraud}=ln\frac{good\%}{bad\%} WOE=lndist frauddist nofraud=lnbad%good%

2.15 variable selection

by measuring univariate correlations between each variable and the target:

continuous targetcategorical target
continuous variablesPearson correlationFisher score
categorical variablesFisher score\ ANOVAInformation value \n Cramer’s \n Gain \n entryopy
  • Pearson correlation: pho
  • Fisher score: ∣ X ˉ g − X ˉ b ∣ s G 2 + s B 2 \frac{|\bar{X}_g-\bar{X}_b|}{\sqrt{s^2_G+s^2_B}} sG2+sB2 XˉgXˉb
  • IV: Σ ( d i s t   g o o d − d i s t   b a d ) ∗ w o e \Sigma(dist\ good - dist\ bad)*woe Σ(dist gooddist bad)woe
ivValuestrongness
0.02-unpredictive
0.02-0.1weak
0.1-0.3medium
0.3+strong predictive
  • Cramer’s V: C V = χ 2 n CV=\sqrt{\frac{\chi^2}n} CV=nχ2

drawback: work univariately and do not consider correlation between dimensions individually

2.16 principal components analysis

an alternative method for input or variable selection.
a technique to reduce the dimensionality of data by forming new variables that are linear composites of the original variables.
multicollinearlity: correlation among the explanatory or predictor variables.
–>result in unstable models.
the stability or robustness of a model refers to the stability of the exact values of the parameters of the model that are being estimated based on the sample of observations.
calculation of principal components: using the eigenvector decomposition.

hardly be interpreted,though it will yield a better model in terms of stability as well as predictive performance.

2.17 RIDITs

relative to an identified distribution unit
it reflects the relative abnormality of a particular response.
can be interpreted to be an adjusted or transformed percentile score.
RIDIT score for a categorical response valine i to variable t with ptj indicating the proportion of the population having value i for variable t, is calculated as follows:
B t i = Σ j < i p ^ t j − Σ j > i p ^ t j B_{ti} = \Sigma_{j<i} \hat{p}_{tj}-\Sigma_{j>i} \hat{p}_{tj} Bti=Σj<ip^tjΣj>ip^tj

2.18 PRIDIT Analytics

2.19 Segmentation

Reasons:

  • strategic: specific sgements of customers
  • motivated from an operational viewpoint: new custormers
  • take into account significant variable interactions

segmentation will increase the production/monitoring/maintenance costs.


  1. IQR:Interquartile Range ↩︎

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值