SAS Module 2 Data Exploration and Preparation

最新推荐文章于 2025-12-01 21:29:54 发布

原创最新推荐文章于 2025-12-01 21:29:54 发布 · 353 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#sas #数据分析

SAS 专栏收录该内容

5 篇文章

订阅专栏

本文探讨了SAS中的数据探索与预处理技术，包括维度建模的基础、数据类型如名义、顺序、区间和比例测量，以及属性类型如离散和连续属性。介绍了使用SAS Data Studio进行数据准备的方法，例如创建和管理列、应用过滤器、转换数据类型、追加数据、结合表格等。还讨论了数据冗余、不精确匹配键、不一致属性和转置数据等挑战。

SAS

Module 2 Data Exploration and Preparation

Dimensional Modeling Fundamentals:

Fact: basic measurements and are generally “numeric” and “additive”, lower level details
Dimension: A way of categorizing and summarizing facts (often use “by” – profit by region/month). Dimensional models are generally represented in star schemas or OLAP cubes in relational databases

Data Types:
Category:

Nominal: categories, states, or “names of things”, identifiers; No order among values of nominal attributes (ex: ID number, Zip code, Title )
Binary: Nominal attribute with only 2 states (Male/Female, True/False)
Ordinal: Similar to Nominal, but values have a meaningful order (ranking) but magnitude between successive values is not known

Measure:

Interval: Measured on a scale of equal-sized units; Values have order (dates); No true zero-point
Ratio: Inherent zero-point; We can speak of values as being an order of magnitude larger than the unit of measurement (grades, balance)

Attribute Types:

Discrete Attribute: Has only a finite or countably infinite set of values (ex: zip code, ISBN); Binary attributes are a special case of discrete attributes
Continuous Attribute: Has real numbers as attribute values (ex: weight, height), can only be measured and represented using a finite number of digits; Typically represented as floating-point variables

Data Preparation in SAS: SAS Data Studio - Prepare Data
Prepare data first in order to make visualize analytics easier

Create and Manage Columns: Rename, Convert Column, Change case(大小写), Split, Remove, Trim whitespace
Apply Filters
Convert Data Types (Measure to Category)
Append
Combine two tables based on a common identifier to form one integrated table
Join
Challenges:
- Data Redundancy: Two identical or highly correlated columns can cause confusion in data modeling as it can obscure the actual drivers of an output. So we need to remove redundancy data before dong join
- Imprecise Match Keys: Same entities are represented differently in different tables. Standardize the data first will help with the issue. In SAS, we can also create “Matchcodes” with different level of sensitivity (higher number means higher level of sensitivity, more accurate) to perform “fuzzy joins”
- Inconsistent Attributes: Need to standardize the data format and data scale
Transpose Data: transpose variables to be observations
Insert Custom Code (Data step, CASL)
Create Data Preparation Plan Files and Jobs
Cleanse Data
Data Enhancement: Variable Tranformation
- Replacing a variable by the results of a function of this variable for understanding the data better (log, square, normalization)
- Variable/Feature Creation: Generating new variables/features based on existing variables (dummy variables, parsing components - date to day, month, year)

Data Explore and Visualization in SAS:

Table must be loaded to use. A “lightning bolt” indicates it is loaded, “*” indicates it needs to be loaded
If some bad data appears, go back to prepare data tab to fix it
Histograms often tell more than Box-plots