The new Summary Collect Statistics feature starting with Teradata Release 14

文章详细介绍了在Teradata 14版本中如何优化表级统计信息收集过程,推荐批量收集所有列统计信息而非逐列进行,尤其适用于大型表,以提高性能并帮助解析器做出更准确的预估。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >


Diagnostic helpstats on for session; (先执行这句话,然后再explain的话会提示你要添加的collect stas)

explain

select * from talbe

Summary statistics are a very useful feature, which is new in Teradata Release 14. Contrary to traditional statistics, which are collected on columns or indexes, summary statistics are table level statistics:

COLLECT SUMMARY STATISTICS ON <TABLE>;

Table level statistics are the information about the number of rows per table, the average data block sizes, and the average row sizes.

The number of rows per table is a very important measure for the Teradata Optimizer to estimate the cost of a full table scan. So, how was this information made available in previous releases of Teradata?

In previous releases of Teradata, this was achieved by collecting column statistics on the dummy “PARTITON” and the Primary Index columns.

Summary statistics which are defined on a table, will be refreshed each time the statistics on any column of the table are refreshed. In order to avoid the waste of resources, the best practice is to refresh all column statistics at once instead of doing it column by column:

The preferable solution is this one:

COLLECT STATISTICS ON <TABLE>; — all column statistics are refreshed, summary statistics are only refreshed once.

Although less intense on resources than traditional column/index statistics, this should still be avoided:

COLLECT STATISTICS ON <TABLE> <COLUMN_1>; — first collection of summary statistics

COLLECT STATISTICS ON <TABLE> <COLUMN_2>; — needless 2nd collection of summary statistics

Conclusion: Summary statistics collection is a very useful feature to refresh the number of rows information per table quickly.

  1. Dimitrios says

    May 7, 2014 at 11:07 pm

    Hello,

    This new feature has only advantages , except from one thing , the way that we have to change the procedure which collect the stats in order to use this feature(s) (plus the max value length / max internal and sample ).

    It is also recommended to big tables , where collecting stats consumes a lot of io/cpu , just
    Collect Summary Stats on DB.TB ; ,after every DML transaction on it , in order to have updated summary stats on the table , in this way we help the parser to make better extrapolations .

  2. Johne490 says

    August 20, 2014 at 6:23 am

    Very informative post.Really thank you! Awesome.

  3. Sreeraj says

    March 30, 2015 at 7:23 pm

    Thank you. very informative.
    Question: for TD 14 onwards, the best practice is to collect all column statistics at once instead of column by column. is this true for big tables as well? I was under the impression that we need to collect stats only for certain columns based on the PI / join and WHERE. could you please clarifiy?

  4. Roland WenzlofskyRoland Wenzlofsky says

    March 30, 2015 at 7:31 pm

    Hi Sreeraj. I think this is a misunderstanding.

    “All columns at once” means to collect statistics with the new syntax which allows to collect the stats on different columns of the same table at once.

    Before Teradata 14 we had to collect stats column by column i.e. issue one statement per column. The only performance advantage previously possible was synchronized scanning on the spool (this was achieved by starting the collect statistics statements for one table at the same time).

    “All columns at once” does not mean to collect statistics on each and every column of a table…

    Roland


Running a regression with daily data for each stock per month involves several steps. Here's a detailed guide: ### Step-by-Step Guide #### 1. Data Collection - **Daily Stock Returns**: Collect daily closing prices for each stock and calculate the daily returns. - **Market Return**: Collect the daily returns of the market index (e.g., CRSP Value-Weighted Market Index). - **VIX Data**: Collect daily VIX data from the Chicago Board Options Exchange (CBOE). #### 2. Data Preparation - **Stock Returns**: Calculate the excess returns for each stock by subtracting the risk-free rate from the daily returns. - **Market Excess Returns**: Calculate the excess returns for the market index similarly. #### 3. Define the Regression Model The regression model you want to run is: \[ r_{i,t} = \alpha_i + \beta_{M,i} \text{MKT}_t + \beta_{V,i} \text{VIX}_t + \epsilon_{i,t} \] where: - \( r_{i,t} \) is the excess return of stock \( i \) on day \( t \). - \( \text{MKT}_t \) is the excess return of the market on day \( t \). - \( \text{VIX}_t \) is the change in VIX on day \( t \). - \( \alpha_i \) is the intercept. - \( \beta_{M,i} \) is the market beta. - \( \beta_{V,i} \) is the VIX beta. - \( \epsilon_{i,t} \) is the error term. #### 4. Run the Regression for Each Stock For each stock \( i \) and each month \( t \): 1. **Extract Daily Data**: Extract the daily excess returns of the stock, market, and VIX for the month. 2. **Calculate VIX Changes**: Compute the daily changes in VIX (\( \Delta \text{VIX}_t \)). 3. **Run the Regression**: Use a statistical software package (e.g., Python with `statsmodels`, R, or MATLAB) to run the regression. ### Example in Python Here is an example using Python and the `statsmodels` library: ```python import pandas as pd import numpy as np import statsmodels.api as sm # Assume you have a DataFrame `data` with columns: 'stock_returns', 'market_returns', 'vix' # and a 'date' column with daily dates. # Convert date to datetime format data['date'] = pd.to_datetime(data['date']) # Set the date as the index data.set_index('date', inplace=True) # Define the start and end dates for the month start_date = '2023-01-01' end_date = '2023-01-31' # Filter the data for the specified month monthly_data = data.loc[start_date:end_date] # Calculate daily changes in VIX monthly_data['delta_vix'] = monthly_data['vix'].diff().fillna(0) # Prepare the independent variables X = monthly_data[['market_returns', 'delta_vix']] X = sm.add_constant(X) # Add a constant term for the intercept # Prepare the dependent variable y = monthly_data['stock_returns'] # Run the regression model = sm.OLS(y, X) results = model.fit() # Print the regression results print(results.summary()) ``` ### Example in R Here is an example using R: ```r # Assume you have a data frame `data` with columns: 'stock_returns', 'market_returns', 'vix' # and a 'date' column with daily dates. # Convert date to Date format data$date <- as.Date(data$date) # Define the start and end dates for the month start_date <- as.Date("2023-01-01") end_date <- as.Date("2023-01-31") # Filter the data for the specified month monthly_data <- subset(data, date >= start_date & date <= end_date) # Calculate daily changes in VIX monthly_data$delta_vix <- c(0, diff(monthly_data$vix)) # Prepare the independent variables X <- model.matrix(~ market_returns + delta_vix, data = monthly_data) # Prepare the dependent variable y <- monthly_data$stock_returns # Run the regression model <- lm(y ~ X - 1) # -1 to remove the intercept already included in X # Print the regression results summary(model) ``` ### Automation for Multiple Stocks and Months To automate the process for multiple stocks and months, you can loop through the stocks and months. Here is an example in Python: ```python # Assume `all_data` is a DataFrame with columns: 'stock_id', 'date', 'stock_returns', 'market_returns', 'vix' # Function to run the regression for a single stock and month def run_regression(stock_id, start_date, end_date, all_data): monthly_data = all_data[(all_data['stock_id'] == stock_id) & (all_data['date'] >= start_date) & (all_data['date'] <= end_date)] monthly_data['delta_vix'] = monthly_data['vix'].diff().fillna(0) X = monthly_data[['market_returns', 'delta_vix']] X = sm.add_constant(X) y = monthly_data['stock_returns'] model = sm.OLS(y, X) results = model.fit() return results.params # List of unique stock IDs and months unique_stocks = all_data['stock_id'].unique() months = pd.date_range(start='2023-01-01', end='2023-12-31', freq='MS') # Store results results_df = [] # Loop through each stock and month for stock_id in unique_stocks: for month_start in months: month_end = month_start + pd.offsets.MonthEnd(1) params = run_regression(stock_id, month_start, month_end, all_data) params['stock_id'] = stock_id params['month'] = month_start results_df.append(params) # Convert list of results to DataFrame results_df = pd.DataFrame(results_df) # Save results to CSV results_df.to_csv('regression_results.csv', index=False) ``` ### Key Points - **Data Alignment**: Ensure that the dates align correctly for the stock returns, market returns, and VIX. - **Handling Missing Data**: Fill or drop missing values appropriately. - **Performance Considerations**: Running regressions for a large number of stocks and months can be computationally intensive. Consider optimizing your code or using parallel processing techniques. By following these steps, you can systematically run the required regressions for each stock and each month using daily data.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值