featuretools

最新推荐文章于 2025-06-06 09:01:43 发布

sorroooo

最新推荐文章于 2025-06-06 09:01:43 发布

阅读量525

点赞数 1

CC 4.0 BY-SA版权

分类专栏：机器学习文章标签：特征工程

本文链接：https://blog.youkuaiyun.com/weixin_33948160/article/details/93722840

机器学习专栏收录该内容

4 篇文章

订阅专栏

THE STATE OF FEATURE TOOLS

OVER VIEW
The feature tools is capable of generating features that express a rich feature space, also an end-to-end Data Science Machine was developed to autotune a machine learning pathway to extract the most value out of the synthesized features and produce submissions for online data science competitions.

Algorithm for Deep Feature Synthesis

Input

The input to Deep Feature Synthesis is a set of interconnected entities and the tables associated with them. An instance of the entity has features which fall into one of the following data types: numeric, categorical, timestamps and freetext.

For a given dataset:
$E1…KE^{1 \ldots K}$ :given entities,where each entity table has 1 $…J\ldots J$ features.
$x_{i, j}^{k}$ : the value for feature $j$ for the $i^{t h}$ instance of the $k^{t h}$ entity.

Entity Features(efeat)

Entity features are the first level features which is calculated by considering the features and their values in the entity table corresponding to the entity $E^{k}$ alone. Entity features derive features by computing a value for each entry $x_{i, j}$ ,such as functions applied to the column array $x_{ :, j}$ ; conversion a categorical string data type to a pre-decided unique numeric value or rounding of a numerical value.
Also it can be designed by apply a function using a column vector and single entries:
$x_{i, j^{\prime}}=e \text {feat}\left(x_{ :, j}, i\right)$

Relational level features

The second set of features is derived by jointly analyzing two related entities, $E^{l}$ and $E^{K}$ . These two entities relate to each other in one of two ways: forward or backward.

Forward: A forward relationship is between an instance $m$ of entity $E^{l}$ , and a single instance of another entity $i$ in $E^{K}$ , which $i$ has an explicit dependence on $m$ (every $i$ according to the only one $m$ ).
Direct Features (dfeat): applied over the forward relationships. Features in a related entity $\in E^{k}$ are directly transferred as features for the $\in E^{l}$ .

Backward: The backward relation is the relationship from an instance $i$ in $E^{K}$ to all the instances $m={1…M}m=\{1 \ldots M\}$ in $E^{l}$ that have forward relationship to $k$ .
Relational features (rfeat): Relational features are applied over the backward relationships:
$x_{i, j^{\prime}}^{k}=r f e a t\left(x_{ :, j}^{l} | e^{k}=i\right)$

Some Detials
Rfeat functions: AVG(), MAX(), MIN(), SUM(), STD(), and COUNT().
Efeat functions: length() to calculate the number of characters in a text field, and WEEKDAY() and MONTH() to convert dates to the day of the week or month they occurred.
Extract The Most Value features
To reduce the size of the feature space:
first, use Truncated SVD transformation and select $ηc\eta_{c}$ components of the SVD;
then, rank each SVD-feature by calculating its f-value w.r.t to the target value, and select the $γ%\gamma \%$ highest ranking features.

Details About The Code

Entity:
Class Entity represents an entity in a Entityset, and stores relevant metadata and data; an entity is analogous to a table in a relational dataset.

Args:

id (str): Id of Entity.
df (pd.DataFrame): Dataframe providing the data for the entity.
entityset (EntitySet): Entityset for this Entity.
variable_types (dict[str -> dict[str -> type]]) : An entity’s variable_types dict maps string variable ids to types , or (type, kwargs) to pass keyword arguments to the Variable.
index (str): Name of id column in the dataframe.
time_index (str): Name of time column in the dataframe.
secondary_time_index (dict[str -> str]): Dictionary mapping columns in the dataframe to the time index column they are associated with.
last_time_index (pd.Series): Time index of the last event for each instance across all child entities.
make_index (bool, optional) : If True, assume index does not exist as a column in
dataframe, and create a new column of that name using integers the (0,len(dataframe)). Otherwise, assume index exists in dataframe.

Relationship:
The class Relationship is to represent a relationship between entities.

Args:

parent_variable: Instance of variable in parent entity. Must be a Discrete Variable.
child_variable: Instance of variable in child entity. Must be a Discrete Variable.

EntitySet:
The class EntitySet is a collection of entities and the relationships between them.
Args:

id (str) : Unique identifier to associate with this instance.
entities (dict[str -> tuple(pd.DataFrame, str, str)]): Dictionary of entities. Entries take the format {entity id -> (dataframe, id column, (time_column), (variable_types))}. Note that time_column and variable_types are optional.
relationships (list[(str, str, str, str)]): List of relationships between entities. List items are a tuple with the format (parent entity id, parent variable, child entity id, child variable).

Feature Primitives:
Feature Primitives are the building blocks of Featuretools. They define individual computations that can be applied to raw datasets to create new features. A primitive only constrains the input and output data types, they can be used to transfer calculations known in one domain to another.
Aggregation Primitives: These primitives take related instances as an input and output a single value. They are applied across a parent-child relationship in an entity set.
Transform Primitives: These primitives take one or more variables from an entity as an input and output a new variable for that entity. They are applied to a single entity.
Custom Primitives: Users can define their own primitive. To define a primitive, a user will:

Specify the type of primitive Aggregation or Transform
Define the input and output data types
Write a function in python to do the calculation
Annotate with attributes to constrain how it is applied

DeepFeatureSynthesis:
The class DeepFeatureSynthesis automatically produce features for a target entity in an Entityset.
Args:

target_entity_id (str): Id of entity for which to build features.
entityset (EntitySet): Entityset for which to build features.
agg_primitives (list[str or :class:.primitives.], optional): list of Aggregation Feature types to apply. Default: [“sum”, “std”, “max”, “skew”, “min”, “mean”, “count”, “percent_true”, “num_unique”, “mode”]
trans_primitives (list[str or :class:.primitives.TransformPrimitive], optional): list of Transform primitives to use. Default: [“day”, “year”, “month”, “weekday”, “haversine”, “num_words”, “num_characters”]
where_primitives (list[str or :class:.primitives.PrimitiveBase], optional):only add where clauses to these types of Primitives. Default: [“count”]
max_depth (int, optional) : maximum allowed depth of features. Default: 2. If -1, no limit.
max_hlevel (int, optional) : #TODO how to document. Default: 2. If -1, no limit.
max_features (int, optional) : Cap the number of generated features to this number. If -1, no limit.
allowed_paths (list[list[str]], optional): Allowed entity paths to make features for. If None, use all paths.
seed_features (list[:class:.FeatureBase], optional): List of manually defined features to use.

SVD:

In the paper of feature tools they mentiond that SVD can be used to select the valuable features synthesised by the feature tools, here is the detail of this method.

Singular value decomposition(SVD) can be looked at from three mutually compatible points of view. On the one hand, we can see it as a method for transforming correlated variables into a set of uncorrelated ones that better expose the various relationships among the original data items. At the sametime,SVD is a method for identifying and ordering the dimensions along which data points exhibit the most variation. This ties in to the third way of viewing SVD, which is that once we have identified where the most variationis, it’s possible to find the best approximation of the original data points using fewer dimensions. Hence, SVD can be seen as a method for data reduction.
A rectangular matrix $A$ can be broken down into the product of three matrices: anorthogonal matrix $U,$ a diagonal matrix $S$ , and the transpose of anorthogonal matrix $V$ :
$A_{m n}=U_{m m} S_{m n} V_{n n}^{T}$
where $U^{T} U=I$ , $V^{T} V=I$ ;
the columns of $U$ are orthonormal eigen vectors of $A A^{T}$ ;
the columns of $V$ are orthonormal eigen vectorsof $A^{T} A$ ;
$S$ is a diagonal matrix containing the square roots of eigen values from $U$ or $V$ in descending order.

Summary

Feature tools use entities and entity sets to store data and the relationships between entities. Then use the primetives to construct complicated fetures which can be automatically done by the large functions defined in feature tools or users can construct their own feature functions.
And beacuse of the relationship of entities and the two class of primetive functions, the features made by the feature tools are always reasonable and meaningful.
Further, there are many issues that need to be considered if we want to apply such an auto-feature-tool to our data platform. For example: the way to process with large amount of data; how to select the most valuable features; and so on.