PySpark编程问题与解答

最新推荐文章于 2025-11-13 01:06:41 发布

原创

最新推荐文章于 2025-11-13 01:06:41 发布 · 692 阅读

18 ·

CC 4.0 BY-SA版权

文章标签：

#PySpark # 数据框操作 # UDF

17、修改以下代码块，使用摄氏度代替华氏度。如果将该修改后的 UDF（用户定义函数）应用于相同的数据框，结果会有何不同？

输出是相同的。归一化过程不会基于温度的单位而改变。

以下是修改后的函数：

def scale_temperature_C(temp_by_day: pd.DataFrame) -> pd.DataFrame:
    """Returns a simple normalization of the temperature for a site, in Celcius.
    If the temperature is constant for the whole window, defaults to 0.5."""
    def f_to_c(temp):
        return (temp - 32.0) * 5.0 / 9.0

18、已知有一个名为 gsod 的 Spark 数据框，以及一个名为 scale_temperature 的函数。该函数返回的数据框包含六个列：stn, year, mo, da, temp, 和 temp_norm。现在要对 gsod 数据框按 ‘year’ 和 ‘mo’ 进行分组，并应用分组映射 UDF，代码如下：gsod_exo = gsod.groupby(“year”, “mo”).applyInPandas(scale_temperature, schema=???) 。请完成代码中 schema 的定义，并说明这样应用分组映射 UDF 会发生什么。

模式应该是：

schema = "year string, mo string, stn string, da string, temp double, temp_norm double"

如果这样应用分组映射 UDF，数据将按 year 和 mo 分组，然后对每个分组应用 scale_temperature 函数，最终返回一个包含指定列的 Spark 数据框。

19、修改以下代码块，使其以 ArrayType 形式返回线性回归的截距和斜率。（提示：截距在拟合模型的 intercept_ 属性中）

from sklearn.linear_model import LinearRegression
from typing import Sequence
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pandas as pd

@F.pandas_udf(T.ArrayType(T.DoubleType()))
def rate_of_change_temperature(day: pd.Series, temp: pd.Series) -> Sequence[float]:
    """Returns the intercept and slope of the daily temperature for a given period of time."""
    model = LinearRegression().fit(
        X=day.astype(int).values.reshape(-1, 1),
        y=temp
    )
    return [model.intercept_, model.coef_[0]]