2021-10-27 假如人生可以overwrite

最新推荐文章于 2025-05-16 12:10:28 发布

shiter

最新推荐文章于 2025-05-16 12:10:28 发布

阅读量2.3k

点赞数 1

CC 4.0 BY-SA版权

分类专栏：程序人生生活感悟老王和他的IT界朋友们文章标签： scala spark 程序人生

本文链接：https://blog.youkuaiyun.com/wangyaninglm/article/details/121006035

老王和他的IT界朋友们同时被 3 个专栏收录

95 篇文章

订阅专栏

程序人生

65 篇文章

订阅专栏

生活感悟

37 篇文章

订阅专栏

这篇博客讲述了作者在得知大舅去世消息后对童年时光的回忆，以及大舅因炒股和不良生活习惯导致健康状况恶化的故事。同时，作者通过自身编程经历，分享了一个关于Spark编程中`overwrite()`方法误用导致数据丢失的教训，反思了人生中错误的不可避免性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

2021年10月27日，今天股票很绿，天气也不是太好，中午突然得知大舅去世的消息。我一时有些感慨，国庆放假才去探望过他，现在忽然走了。

回想起小学暑假，最欢乐的一些日子是在大舅家渡过的。那时候我表哥表姐，对我照顾有加。在休息日带着我游泳，攀岩，开卡丁车，爬秦岭，在飞机跑道上骑摩托车。。。那些日子开启了我人生的很多新体验，我很怀念那些时光。

据表哥表姐说起，大舅是炒股赔了不少钱，整日坐在电脑跟前，快进快出，积蓄全交了手续费、学费。性子急了患上糖尿病。每次想起这些，我这个强迫症就也想打开蚂蚁，再给基金充值100。

后来几次再去看大舅，他身体每况愈下，但都还记着让我给我爸拿捎带几盒烟抽----软中华。虽然我总不解为何要送烟，推脱说不用了，大舅便叫妗子硬塞，看到我放在包里这才满意，可能表达了一种记挂吧。

国庆去看他时，得知他已经是肺癌和膀胱癌扩散，时日无多。既知可能是最后一面，我想，人活一世，如果可以重来。我多半会劝劝他，少抽烟，买茅台，多健身。

可能我把事情想简单了。

下午写代码的时候，就着了道。假如人生可以 overwrite ，我宁愿是我能有多个备份，加上ctrl +z

spark 写目录有个如下的方法：

model.write.overwrite().save(".")

这么写TMD 的有大问题。尤其这个overwrite（），上面的代码会直接在程序运行时候重写当前文件系统目录，覆盖代码，数据恢复软件都找不回来。我很不清楚为何能有这么厉害的权限

而且代码还能运行成功。把我半年来写的本地测试框架工程删的一干二净。只留下一个p 都不能干的数据模型。。。

我突然回想起，这样的错误，我TM犯了两次。

上一次是在aws 的 EMR 上也是用同样的骚操作，我想把aws S3 上的文件写回本地，来了个好像overwrite 加上是：

save("local:///test/user/")

把自己的测试目录删的干干净净。

更加危险的操作，如果是：我估计是多半连根目录都能干掉。。。

save("../../")

所以人生可以重来，就能不犯错嘛？打游戏，多个存档这么简单嘛。

世间的事大抵如此。

以上内容纯属巧合，如有雷同，我也没辙

点击查看社群交流渠道

大家没事，可以读读源码，看看他们这个save ，overwrite 逻辑，到底怎么回事。。。

https://github.com/apache/spark/blob/v3.2.0/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala
https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/util.html#MLWriter.overwrite

python 代码中是这么调用的：


def overwrite(self):
        """Overwrites if the output path already exists."""
        self._jwrite.overwrite()
        return self

>>> df.write.mode('append').parquet(os.path.join(tempfile.mkdtemp(), 'data'))
        """
        # At the JVM side, the default value of mode is already set to "error".
        # So, if the given saveMode is None, we will not call JVM-side's mode method.
        if saveMode is not None:
            self._jwrite = self._jwrite.mode(saveMode)
        return self

spark scala 源码类似这样，我节选了一部分：


 /**
   * Saves the ML instances to the input path.
   */
  @Since("1.6.0")
  @throws[IOException]("If the input path already exists but overwrite is not enabled.")
  def save(path: String): Unit = {
    new FileSystemOverwrite().handleOverwrite(path, shouldOverwrite, sparkSession)
    saveImpl(path)
  }

  /**
   * `save()` handles overwriting and then calls this method.  Subclasses should override this
   * method to implement the actual saving of the instance.
   */
  @Since("1.6.0")
  protected def saveImpl(path: String): Unit

  /**
   * Overwrites if the output path already exists.
   */
  @Since("1.6.0")
  def overwrite(): this.type = {
    shouldOverwrite = true
    this
  }
private[ml] class FileSystemOverwrite extends Logging {

  def handleOverwrite(path: String, shouldOverwrite: Boolean, session: SparkSession): Unit = {
    val hadoopConf = session.sessionState.newHadoopConf()
    val outputPath = new Path(path)
    val fs = outputPath.getFileSystem(hadoopConf)
    val qualifiedOutputPath = outputPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
    if (fs.exists(qualifiedOutputPath)) {
      if (shouldOverwrite) {
        logInfo(s"Path $path already exists. It will be overwritten.")
        // TODO: Revert back to the original content if save is not successful.
        fs.delete(qualifiedOutputPath, true)
      } else {
        throw new IOException(s"Path $path already exists. To overwrite it, " +
          s"please use write.overwrite().save(path) for Scala and use " +
          s"write().overwrite().save(path) for Java and Python.")
      }
    }
  }
}