featuretools

THE STATE OF FEATURE TOOLS

OVER VIEW
The feature tools is capable of generating features that express a rich feature space, also an end-to-end Data Science Machine was developed to autotune a machine learning pathway to extract the most value out of the synthesized features and produce submissions for online data science competitions.

Algorithm for Deep Feature Synthesis

  • Input

The input to Deep Feature Synthesis is a set of interconnected entities and the tables associated with them. An instance of the entity has features which fall into one of the following data types: numeric, categorical, timestamps and freetext.

For a given dataset:
E1…KE^{1 \ldots K}E1K :given entities,where each entity table has 1…J\ldots JJ features.
xi,jkx_{i, j}^{k}xi,jk : the value for feature jjj for the ithi^{t h}ith instance of the kthk^{t h}kth entity.

  • Entity Features(efeat)

Entity features are the first level features which is calculated by considering the features and their values in the entity table corresponding to the entity EkE^{k}Ek alone. Entity features derive features by computing a value for each entry xi,jx_{i, j}xi,j ,such as functions applied to the column array x:,jx_{ :, j}x:,j ; conversion a categorical string data type to a pre-decided unique numeric value or rounding of a numerical value.
Also it can be designed by apply a function using a column vector and single entries:
xi,j′=efeat(x:,j,i) x_{i, j^{\prime}}=e \text {feat}\left(x_{ :, j}, i\right) xi,j=efeat(x:,j,i)

  • Relational level features

The second set of features is derived by jointly analyzing two related entities, ElE^{l}El and EKE^{K}EK . These two entities relate to each other in one of two ways: forward or backward.

Forward: A forward relationship is between an instance mmm of entity ElE^{l}El , and a single instance of another entity iii in EKE^{K}EK , which iii has an explicit dependence on mmm(every iii according to the only one mmm).
Direct Features (dfeat): applied over the forward relationships. Features in a related entity i∈Eki \in E^{k}iEk are directly transferred as features for the m∈Elm \in E^{l}mEl.

Backward: The backward relation is the relationship from an instance iii in EKE^{K}EK to all the instances m={1…M}m=\{1 \ldots M\}m={1M} in ElE^{l}El that have forward relationship to kkk.
Relational features (rfeat): Relational features are applied over the backward relationships:
xi,j′k=rfeat(x:,jl∣ek=i) x_{i, j^{\prime}}^{k}=r f e a t\left(x_{ :, j}^{l} | e^{k}=i\right) xi,jk=rfeat(x:,jlek=i)

  • Some Detials
    Rfeat functions: AVG(), MAX(), MIN(), SUM(), STD(), and COUNT().
    Efeat functions: length() to calculate the number of characters in a text field, and WEEKDAY() and MONTH() to convert dates to the day of the week or month they occurred.

  • Extract The Most Value features
    To reduce the size of the feature space:
    first, use Truncated SVD transformation and select ηc\eta_{c}ηc components of the SVD;
    then, rank each SVD-feature by calculating its f-value w.r.t to the target value, and select the γ%\gamma \%γ% highest ranking features.

Details About The Code

Entity:
Class Entity represents an entity in a Entityset, and stores relevant metadata and data; an entity is analogous to a table in a relational dataset.

Args:

  • id (str): Id of Entity.
  • df (pd.DataFrame): Dataframe providing the data for the entity.
  • entityset (EntitySet): Entityset for this Entity.
  • variable_types (dict[str -> dict[str -> type]]) : An entity’s variable_types dict maps string variable ids to types , or (type, kwargs) to pass keyword arguments to the Variable.
  • index (str): Name of id column in the dataframe.
  • time_index (str): Name of time column in the dataframe.
  • secondary_time_index (dict[str -> str]): Dictionary mapping columns in the dataframe to the time index column they are associated with.
  • last_time_index (pd.Series): Time index of the last event for each instance across all child entities.
  • make_index (bool, optional) : If True, assume index does not exist as a column in
    dataframe, and create a new column of that name using integers the (0,len(dataframe)). Otherwise, assume index exists in dataframe.

Relationship:
The class Relationship is to represent a relationship between entities.

Args:

  • parent_variable: Instance of variable in parent entity. Must be a Discrete Variable.
  • child_variable: Instance of variable in child entity. Must be a Discrete Variable.

EntitySet:
The class EntitySet is a collection of entities and the relationships between them.
Args:

  • id (str) : Unique identifier to associate with this instance.
  • entities (dict[str -> tuple(pd.DataFrame, str, str)]): Dictionary of entities. Entries take the format {entity id -> (dataframe, id column, (time_column), (variable_types))}. Note that time_column and variable_types are optional.
  • relationships (list[(str, str, str, str)]): List of relationships between entities. List items are a tuple with the format (parent entity id, parent variable, child entity id, child variable).

Feature Primitives:
Feature Primitives are the building blocks of Featuretools. They define individual computations that can be applied to raw datasets to create new features. A primitive only constrains the input and output data types, they can be used to transfer calculations known in one domain to another.
Aggregation Primitives: These primitives take related instances as an input and output a single value. They are applied across a parent-child relationship in an entity set.
Transform Primitives: These primitives take one or more variables from an entity as an input and output a new variable for that entity. They are applied to a single entity.
Custom Primitives: Users can define their own primitive. To define a primitive, a user will:

  • Specify the type of primitive Aggregation or Transform
  • Define the input and output data types
  • Write a function in python to do the calculation
  • Annotate with attributes to constrain how it is applied

DeepFeatureSynthesis:
The class DeepFeatureSynthesis automatically produce features for a target entity in an Entityset.
Args:

  • target_entity_id (str): Id of entity for which to build features.
  • entityset (EntitySet): Entityset for which to build features.
  • agg_primitives (list[str or :class:.primitives.], optional): list of Aggregation Feature types to apply. Default: [“sum”, “std”, “max”, “skew”, “min”, “mean”, “count”, “percent_true”, “num_unique”, “mode”]
  • trans_primitives (list[str or :class:.primitives.TransformPrimitive], optional): list of Transform primitives to use. Default: [“day”, “year”, “month”, “weekday”, “haversine”, “num_words”, “num_characters”]
  • where_primitives (list[str or :class:.primitives.PrimitiveBase], optional):only add where clauses to these types of Primitives. Default: [“count”]
  • max_depth (int, optional) : maximum allowed depth of features. Default: 2. If -1, no limit.
  • max_hlevel (int, optional) : #TODO how to document. Default: 2. If -1, no limit.
  • max_features (int, optional) : Cap the number of generated features to this number. If -1, no limit.
  • allowed_paths (list[list[str]], optional): Allowed entity paths to make features for. If None, use all paths.
  • seed_features (list[:class:.FeatureBase], optional): List of manually defined features to use.

SVD:

In the paper of feature tools they mentiond that SVD can be used to select the valuable features synthesised by the feature tools, here is the detail of this method.

Singular value decomposition(SVD) can be looked at from three mutually compatible points of view. On the one hand, we can see it as a method for transforming correlated variables into a set of uncorrelated ones that better expose the various relationships among the original data items. At the sametime,SVD is a method for identifying and ordering the dimensions along which data points exhibit the most variation. This ties in to the third way of viewing SVD, which is that once we have identified where the most variationis, it’s possible to find the best approximation of the original data points using fewer dimensions. Hence, SVD can be seen as a method for data reduction.
A rectangular matrix AAA can be broken down into the product of three matrices: anorthogonal matrix U,U,U, a diagonal matrix SSS, and the transpose of anorthogonal matrix VVV:
Amn=UmmSmnVnnT A_{m n}=U_{m m} S_{m n} V_{n n}^{T} Amn=UmmSmnVnnT
where UTU=IU^{T} U=IUTU=I, VTV=IV^{T} V=IVTV=I;
the columns of UUU are orthonormal eigen vectors of AATA A^{T}AAT;
the columns of VVV are orthonormal eigen vectorsof ATAA^{T} AATA;
SSS is a diagonal matrix containing the square roots of eigen values from UUU or VVV in descending order.

Summary

Feature tools use entities and entity sets to store data and the relationships between entities. Then use the primetives to construct complicated fetures which can be automatically done by the large functions defined in feature tools or users can construct their own feature functions.
And beacuse of the relationship of entities and the two class of primetive functions, the features made by the feature tools are always reasonable and meaningful.
Further, there are many issues that need to be considered if we want to apply such an auto-feature-tool to our data platform. For example: the way to process with large amount of data; how to select the most valuable features; and so on.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值