Chp2 Data Collection, Sampling, and Preprocessing
2.1 Introduction
GIGO: garbage in garbage out principal
2.2 Types of data sources
- Transactional data:
stored in a OLTP(online transaction processing) relational databases
RFM variables - Contractual, subscription, or account data:
stored in a CRM(customer relationship management) database
a source of sociodemographic information: slow-moving data dimensions
sources to retrieve sociademographic or factual data : subscription data, data poolers, survey data, publicly available data sources - data poolers :
to gather data and sell it to interested customers
to build predictive models and sell the output of these models as risk scores.
data poolers E.g: Experian, Equifax, CIFAS, Dun&Bradstreet, Thomson Reuters.
FICO score :use Experian, Equifax and Transunion - Surveys:
online: Facebook, LinkedIn, Twitter
offline - Behavioral information:
fast-moving data / dynamic characteristics
include: preference of customers/ usage information/ frequencis of events/ treend variables… turnover/ solvency/ umber of employees - unstructured data: text documents
- unstructured data : contextual or network information
- qualitative, expert-based data:
a popular example of applying expert-based validation is checking the univariate signs of a regression model. - publicly available data:
such as macroecomomic data(GDP, inflation, unemployment)/ weather observations/ socail media data from Facebook,Twitter,LinkedIn… …
2.3 Merging data sources
rows: instances/ observations/lines
columns: variables/ fields/ characteristics/ attributes/ indicators /features
keys:
2.4 Sampling
a good sampling should be representative for the future entities on which the analytical model will be run.—> choosing the optimal time window
stratified sampling
2.5 types of data elements
- continuous data
- categorical data
2.1 nominal
2.2 ordinal
2.3 binary
2.6 Visual data exploration and exploratory statistical analysis
2.7 Benford’s Law
the frequency distribution of the first digit in many real-life data sets complies with the Benford’s law expresses as follow:
p
(
d
)
=
l
o
g
10
(
d
+
1
d
)
=
l
o
g
10
(
d
+
1
)
−
l
o
g
10
(
d
)
p(d)=log_{10}(\frac{d+1}{d})=log_{10}(d+1)-log_{10}(d)
p(d)=log10(dd+1)=log10(d+1)−log10(d)
a partially negative rule: if the law is not satisfied, it’s probable that the involved data were manipulated/tampered and further investigation or testing is required. Conversely, if a data set complies with the Benford’s law ,it can still be fraudulent.
2.8 descriptive statistics
- continuous variables:
- basic descriptive statistics:
- mean
- median
- variation/ standard deviation
- percentile values
- specific descriptive statistics:
- skewness: symmetry/asymmetry of a distribution
- kurtosis: peakedness/flatness of a distribution
- basic descriptive statistics:
- categorical variables:
- mode: the most frequently occurring value
2.9 missing values
- originated because of : nonapplicable info/ undisclosed info/ an error during merging
- deal with missing values:
- replace(impute):the average /median of the known values / regression-based imputation/
- delete: if information is missing at random and has no meaningful interpretation /relationship to the target
- keep: if missing values means something
2.10 outlier detection and treatment
- valid observations(salary $1000000)/invalid observations(age 300)
- unidimensional/multivariate
- detection:
- univariate outliers detection:
- calculate the minimum/maximum values/histograms/box plot/
- too far away from the edges of the box(1.5*IQR1) IQR=Q3-Q1(interquartile range)
- calculate z-scores. z i = ( x i − μ ) / σ . zi=(x_i-\mu)/\sigma. zi=(xi−μ)/σ. if z i > 3 z_i>3 zi>3.
- multivariate outliers detection:
-
fitting regression lines: observations with large errors.
-
clustering or calculating the Mahalanobis distance.
-
multivariate outliers impact marginally on model performance.
-
- univariate outliers detection:
- treatment:
invalid observations: treat the outlier as a missing value.
valid observations:- truncation/capping/winsorzing
- imposes a lower/upper limit for each vairables and bring back any values below/ablove to these limits.
- limit cacluating:
l i m i t = M e d i a n + / − s limit=Median +/- s limit=Median+/−s
s = I Q R / ( 2 ∗ 0.6745 ) s=IQR/(2*0.6745) s=IQR/(2∗0.6745)
z − s c o r e s / I Q R z-scores/IQR z−scores/IQR
- limit cacluating:
- a sigmoid transformation ranging between 0 and 1 can be used for capping as : f ( x ) = 1 ( 1 + e − x ) f(x)=\frac{1}{(1+e^{-x})} f(x)=(1+e−x)1
- expert-based limits based on business knowledge
- not all invalid values are outlying( gender=male & pregnant=yes)
—> need some explicit precautions.
—> construct a set of rules that are formulated based on expert knowledge and experiences(similiar to a fraud detection rule engine in fact)
—> a network representation of the variables may be of use to construct the rule set and reason upon relations that exist between the different variables, with links representing constraints that apply to the combination of variable values and resulting in rules added to the rule set.
2.11 red flags
deviations from normality are called red flags of fraud.
the fundamental red flags of fraud is the anomaly, e.g:
- Tax evasion fraud red flags:
- an identical financial statement, since fraudulent companies copy financial statements of nonfraudulent companys.
- a nonexisting accountant ,since name of an accountant is unique
- Credit card fraud red flags:
- a small payment followed by a large payment immediately after(a fraudster might first check whether the card is still active before placing a bet)
- regular rather small payment(a technique to avoid getting noticed)
- Telocommunications-related fraud red flags:
…
when handling valid outlier using the treatment techniques discussed before, we may impair the ability of descriptive analytics in finding anomalous fraud patterns.
2.12 standardizing data
standardization procedues:
- min/max standardization: x − m i n m a x − m i n ∗ ( n e w m a x − n e w m i n ) + n e w m i n . \frac{x-min}{max-min}*(newmax-newmin)+newmin. max−minx−min∗(newmax−newmin)+newmin.
- z-score standardization: x − μ σ \frac{x-\mu}{\sigma} σx−μ
- decimal scaling:
x
1
0
n
\frac{x}{10^n}
10nx
useful for regression ,not needed in decision tree
2.13 categorization
categorization/ coarse-classification/ classing/ grouping/ binning
- Reasons:
- categorical variables: to reduce the number of categories.
- continuous variables: to model nonlinear effects into linear models.
- Methods:
- equal interval binning/ equal frequency binning
- Chi-squared analysis
χ 2 = + ( e x p e c t e d − r e a l ) 2 r e a l \chi^2 = + \frac{(expected-real)^2}{real} χ2=+real(expected−real)2 for each k group
compare the value wiht a Chi-squared distribution with k-1 degrees - use pivot tables:
categorize values based on similar odds.
2.14 weights of evidence coding
for each groups in categorization, we need a transformation to let each group have a monotonically increasing/decreasing relationship with Y
W
O
E
=
l
n
d
i
s
t
n
o
f
r
a
u
d
d
i
s
t
f
r
a
u
d
=
l
n
g
o
o
d
%
b
a
d
%
WOE=ln\frac{dist\ nofraud} {dist\ fraud}=ln\frac{good\%}{bad\%}
WOE=lndist frauddist nofraud=lnbad%good%
2.15 variable selection
by measuring univariate correlations between each variable and the target:
continuous target | categorical target | |
---|---|---|
continuous variables | Pearson correlation | Fisher score |
categorical variables | Fisher score\ ANOVA | Information value \n Cramer’s \n Gain \n entryopy |
- Pearson correlation: pho
- Fisher score: ∣ X ˉ g − X ˉ b ∣ s G 2 + s B 2 \frac{|\bar{X}_g-\bar{X}_b|}{\sqrt{s^2_G+s^2_B}} sG2+sB2∣Xˉg−Xˉb∣
- IV: Σ ( d i s t g o o d − d i s t b a d ) ∗ w o e \Sigma(dist\ good - dist\ bad)*woe Σ(dist good−dist bad)∗woe
ivValue | strongness |
---|---|
0.02- | unpredictive |
0.02-0.1 | weak |
0.1-0.3 | medium |
0.3+ | strong predictive |
- Cramer’s V: C V = χ 2 n CV=\sqrt{\frac{\chi^2}n} CV=nχ2
drawback: work univariately and do not consider correlation between dimensions individually
2.16 principal components analysis
an alternative method for input or variable selection.
a technique to reduce the dimensionality of data by forming new variables that are linear composites of the original variables.
multicollinearlity: correlation among the explanatory or predictor variables.
–>result in unstable models.
the stability or robustness of a model refers to the stability of the exact values of the parameters of the model that are being estimated based on the sample of observations.
calculation of principal components: using the eigenvector decomposition.
hardly be interpreted,though it will yield a better model in terms of stability as well as predictive performance.
2.17 RIDITs
relative to an identified distribution unit
it reflects the relative abnormality of a particular response.
can be interpreted to be an adjusted or transformed percentile score.
RIDIT score for a categorical response valine i to variable t with ptj indicating the proportion of the population having value i for variable t, is calculated as follows:
B
t
i
=
Σ
j
<
i
p
^
t
j
−
Σ
j
>
i
p
^
t
j
B_{ti} = \Sigma_{j<i} \hat{p}_{tj}-\Sigma_{j>i} \hat{p}_{tj}
Bti=Σj<ip^tj−Σj>ip^tj
2.18 PRIDIT Analytics
2.19 Segmentation
Reasons:
- strategic: specific sgements of customers
- motivated from an operational viewpoint: new custormers
- take into account significant variable interactions
segmentation will increase the production/monitoring/maintenance costs.
IQR:Interquartile Range ↩︎