26、股票数据建模：从数据处理到模型应用-优快云博客

本文链接：https://blog.youkuaiyun.com/s1t2u3/article/details/155260040

股票数据建模：从数据处理到模型应用

1. 基础概念与函数定义

在股票数据建模中，我们需要对分类器进行评估。对于每个分类器，我们会计算其误差平方和（SSE），并取平均值来评估在特定参数下分类的性能。

首先，定义了一些关键函数：

(defn accum
  ([] [])
  ([v x] (conj v x)))

(defn x-validate [vocab-size hidden-count period coll]
  (v/k-fold #(make-train vocab-size hidden-count period %)
            #(test-on %1 period %2)
            accum
            10
            coll))

accum 函数用于将误差值累积到一个向量中。 x-validate 函数则使用 make-train 创建并训练新的网络，使用 test-on 测试网络，并使用 accum 收集误差率。

2. 寻找最佳参数

我们可以调整两个关键参数：隐藏层的神经元数量和预测的未来时间周期。这些参数构成了一个巨大的搜索空间，虽然无法尝试所有组合，但可以选择部分进行测试，以调整神经网络获得最佳结果。

定义了以下函数来探索这个搜索空间：

(defn explore-point [vocab-count period hidden-count training]
  (println period hidden-count)
  (let [error (x-validate
                vocab-count hidden-count period training)]
    (println period hidden-count
             '=> \tab (u/mean error) \tab error)
    (println)
    error))

(def ^:dynamic *hidden-counts* [5 10 25 50 75])

(defn final-eval [vocab-size period hidden-count
                  training-set test-set]
  (let [nnet (make-train
               vocab-size hidden-count period training-set)
        error (test-on nnet period test-set)]
    {:period period
     :hidden-count hidden-count
     :nnet nnet
     :error error}))

(defn explore-params
  ([error-ref vocab-count training]
   (explore-params
     error-ref vocab-count training *hidden-counts* 0.2))
  ([error-ref vocab-count training hidden-counts test-ratio]
   (let [[test-set dev-set] (u/rand-split training test-ratio)
         search-space (for [p periods, h hidden-counts] [p h])]
     (doseq [pair search-space]
       (let [[p h] pair,
             error (explore-point vocab-count p h dev-set)]
         (dosync
           (commute error-ref assoc pair error))))
     (println "Final evaluation against the test set.")
     (let [[period hidden-count]
           (first (min-key #(u/mean (second %)) @error-ref))
           final (final-eval
                   vocab-count period hidden-count
                   dev-set test-set)]
       (dosync
         (commute error-ref assoc :final final))))
   @error-ref))

explore-point 函数用于测试一个参数组合， explore-params 函数则遍历搜索空间，调用 explore-point 并收集误差率。

3. 数据加载与处理

在进行股票数据建模前，需要加载和处理相关数据。
- 加载股票价格 ：

user=> (require
         [me.raynes.fs :as fs]
         [financial]
         [financial.types :as t]
         [financial.nlp :as nlp]
         [financial.nn :as nn]
         [financial.oanc :as oanc]
         [financial.csv-data :as csvd]
         [financial.utils :as u])

user=> (def stocks (csvd/read-stock-prices "d/d-1996-2001.csv"))
user=> (def stock-index (nn/index-by :date stocks))

通过上述代码，从 CSV 文件中加载股票价格并按日期进行索引。
- 加载新闻文章 ：

user=> (def slate (doall
                    (map oanc/load-article
                         (oanc/find-slate-files
                           (io/file "d/OANC-GrAF")))))
user=> (def corpus (nlp/process-articles slate))
user=> (def freqs (nlp/tf-idf-all corpus))
user=> (def vocab (nlp/get-vocabulary corpus))

加载新闻文章后，处理文章内容，计算 TF-IDF 频率并获取词汇表。

4. 创建训练和测试集

将股票价格数据和新闻文章数据合并为训练集：

user=> (def training
         (nn/make-training-set stock-index vocab freqs))

此时，每个文章都有一个输入向量和与数据相关的不同股票价格的输出序列。

5. 寻找神经网络的最佳参数

使用训练数据和参数范围来探索网络参数空间：

user=> (def error-rates (ref {}))
user=> (nn/explore-params error-rates (count vocab) training)

这个过程非常耗时，运行一段时间后发现模型对超过一两天的预测效果不佳，可停止运行。由于传递了引用，我们可以停止处理并保留已有的结果。

计算误差的平均值并打印结果：

user=> (def error-means
         (into {}
               (map #(vector (first %) (u/mean (second %)))
                    @error-rates)))
user=> (pprint (sort-by second error-means))

结果如下：
| 未来时间周期 - 隐藏节点数 | 平均误差 |
| — | — |
| [[# 10] | 1.0435393 |
| [[# 5] | 1.5253379 |
| [[# 25] | 5.0099998 |
| [[# 50] | 32.00977 |
| [[# 100] | 34.264244 |
| [[# 200] | 60.73007 |
| [[# 300] | 100.29568 |

根据结果，我们选择预测一天的未来时间周期和 10 个隐藏节点进行后续训练。

6. 训练和验证神经网络

user=> (def nn (nn/make-network (count vocab) 10))
user=> (def day1 (first nn/periods))
user=> (nn/train-for nn day1 training)

训练过程中会输出每次迭代的误差：

Iteration # 1 Error: 22.025400% Target-Error: 1.000000%
Iteration # 2 Error: 19.332094% Target-Error: 1.000000%
Iteration # 3 Error: 14.241920% Target-Error: 1.000000%
Iteration # 4 Error: 6.283643% Target-Error: 1.000000%
Iteration # 5 Error: 0.766110% Target-Error: 1.000000%

训练完成后，我们得到了一个训练好的神经网络。

7. 在新数据上运行网络

为了在新数据上运行网络，我们需要准备新的新闻文章和股票价格数据。

(def idf-cache (nlp/get-idf-cache corpus))
(def sample-day (time/date-time 2014 3 20 0 0 0))
(def used-vocab (set (map first idf-cache)))

(def articles (doall
                (->> "d/slate/"
                  fs/list-dir
                  (map #(str "d/slate/" %))
                  (map #(oanc/load-text-file sample-day %))
                  (nlp/load-text-files used-vocab idf-cache))))

(def recent-stocks (csvd/read-stock-prices "d/d-2013-2014.csv"))
(def recent-index (nn/index-by :date recent-stocks))

(def inputs
  (map #(nn/make-feature-vector recent-index used-vocab %)
       articles))

user=> (pprint
         (flatten
           (map vec
                (map #(nn/run-network nn %) inputs))))

运行结果如下：

(0.5046613110846201
 0.5046613110846201
 0.5046613135395166
 0.5046613110846201
 0.5046613110846201
 0.5046613110846201
 0.5046613110846201
 0.5046613110846201
 0.5046613112651592
 0.5046613110846201)

这些结果非常一致，从 sigmoid 函数来看，模型预测未来一天股票价格变化不大。实际上，3 月 20 日股票收盘价为 69.77 美元，3 月 21 日收盘价为 70.06 美元，仅上涨了 0.29 美元，与模型预测相符。

8. 分析的局限性与思考

在进行股票数据建模分析时，需要考虑一些局限性：
- 数据方面 ：
- 需要来自多个数据源的文章。
- 需要更广泛时间范围的文章。
- 需要在时间周期内有更密集的文章。
- 机器学习与市场建模方面 ：
- 简单地将股票数据与机器学习结合是有风险的，不能将其视为快速致富的方案。
- 新闻文章与股票价格之间的关系较为微弱，股票价格可能本身就难以从新闻报道中预测。

要做好股票数据建模，需要至少了解金融建模和机器学习两个方面的知识，以制定更好的模型。

综上所述，股票数据建模是一个复杂的过程，从数据处理到模型训练和应用，每个环节都需要仔细考虑和处理。同时，要认识到分析的局限性，不断改进和完善模型。

mermaid 流程图：

graph LR
    A[加载股票价格和新闻文章数据] --> B[创建训练和测试集]
    B --> C[寻找神经网络最佳参数]
    C --> D[训练和验证神经网络]
    D --> E[在新数据上运行网络]
    E --> F[分析结果并考虑局限性]

通过以上步骤，我们完成了股票数据建模的整个流程，从数据的获取和处理，到模型的训练和优化，再到在新数据上的应用和结果分析。每个环节都紧密相连，需要我们仔细操作和深入思考，以提高模型的准确性和可靠性。同时，要认识到股票市场的复杂性和不确定性，不能仅仅依赖模型进行投资决策。

股票数据建模：从数据处理到模型应用

9. 关键步骤总结

为了更清晰地理解整个股票数据建模的流程，以下是关键步骤的总结：
| 步骤 | 操作内容 | 代码示例 |
| — | — | — |
| 1 | 加载必要的命名空间 | (require [me.raynes.fs :as fs] [financial] [financial.types :as t] [financial.nlp :as nlp] [financial.nn :as nn] [financial.oanc :as oanc] [financial.csv-data :as csvd] [financial.utils :as u]) |
| 2 | 加载股票价格数据 | (def stocks (csvd/read-stock-prices "d/d-1996-2001.csv")) (def stock-index (nn/index-by :date stocks)) |
| 3 | 加载新闻文章数据 | (def slate (doall (map oanc/load-article (oanc/find-slate-files (io/file "d/OANC-GrAF"))))) (def corpus (nlp/process-articles slate)) (def freqs (nlp/tf-idf-all corpus)) (def vocab (nlp/get-vocabulary corpus)) |
| 4 | 创建训练集 | (def training (nn/make-training-set stock-index vocab freqs)) |
| 5 | 寻找最佳参数 | (def error-rates (ref {})) (nn/explore-params error-rates (count vocab) training) |
| 6 | 训练神经网络 | (def nn (nn/make-network (count vocab) 10)) (def day1 (first nn/periods)) (nn/train-for nn day1 training) |
| 7 | 准备新数据 | (def idf-cache (nlp/get-idf-cache corpus)) (def sample-day (time/date-time 2014 3 20 0 0 0)) (def used-vocab (set (map first idf-cache))) (def articles (doall (->> "d/slate/" fs/list-dir (map #(str "d/slate/" %)) (map #(oanc/load-text-file sample-day %)) (nlp/load-text-files used-vocab idf-cache)))) (def recent-stocks (csvd/read-stock-prices "d/d-2013-2014.csv")) (def recent-index (nn/index-by :date recent-stocks)) (def inputs (map #(nn/make-feature-vector recent-index used-vocab %) articles)) |
| 8 | 运行网络 | (pprint (flatten (map vec (map #(nn/run-network nn %) inputs)))) |

10. 深入理解代码逻辑

在整个过程中，一些函数起到了关键作用，下面深入分析这些函数的逻辑：
- x-validate 函数 ：

(defn x-validate [vocab-size hidden-count period coll]
  (v/k-fold #(make-train vocab-size hidden-count period %)
            #(test-on %1 period %2)
            accum
            10
            coll))

该函数使用 v/k-fold 进行 K 折交叉验证。 #(make-train vocab-size hidden-count period %) 用于创建并训练新的网络， #(test-on %1 period %2) 用于测试网络， accum 用于累积误差值， 10 表示进行 10 折交叉验证。
- explore-params 函数 ：

(defn explore-params
  ([error-ref vocab-count training]
   (explore-params
     error-ref vocab-count training *hidden-counts* 0.2))
  ([error-ref vocab-count training hidden-counts test-ratio]
   (let [[test-set dev-set] (u/rand-split training test-ratio)
         search-space (for [p periods, h hidden-counts] [p h])]
     (doseq [pair search-space]
       (let [[p h] pair,
             error (explore-point vocab-count p h dev-set)]
         (dosync
           (commute error-ref assoc pair error))))
     (println "Final evaluation against the test set.")
     (let [[period hidden-count]
           (first (min-key #(u/mean (second %)) @error-ref))
           final (final-eval
                   vocab-count period hidden-count
                   dev-set test-set)]
       (dosync
         (commute error-ref assoc :final final))))
   @error-ref))

该函数首先将数据划分为测试集和开发集，然后遍历搜索空间，调用 explore-point 函数测试每个参数组合，并将误差率存储在 error-ref 中。最后，选择平均误差最小的参数组合进行最终评估。

11. 实际应用与注意事项

在实际应用中，我们可以根据模型的预测结果进行一些决策，但需要注意以下几点：
- 数据质量 ：确保股票价格数据和新闻文章数据的准确性和完整性。数据中的错误或缺失值可能会影响模型的性能。
- 模型泛化能力 ：虽然通过交叉验证和参数调整可以提高模型的性能，但要注意模型的泛化能力。避免过度拟合训练数据，导致在新数据上的表现不佳。
- 市场变化 ：股票市场是动态变化的，模型的性能可能会随着时间的推移而下降。需要定期更新数据和重新训练模型，以适应市场的变化。

12. 未来发展方向

随着技术的不断发展，股票数据建模也有一些未来的发展方向：
- 融合更多数据来源 ：除了新闻文章和股票价格数据，可以考虑融合社交媒体数据、宏观经济数据等，以提高模型的预测能力。
- 使用更复杂的模型 ：可以尝试使用深度学习模型，如循环神经网络（RNN）、长短期记忆网络（LSTM）等，以处理序列数据和捕捉更复杂的模式。
- 实时预测 ：实现实时数据的处理和模型的实时更新，以提供更及时的预测结果。

mermaid 流程图：

graph LR
    A[数据质量检查] --> B[模型训练与优化]
    B --> C[实际应用与监控]
    C --> D{性能是否满足要求?}
    D -- 是 --> E[继续应用]
    D -- 否 --> F[调整模型或数据]
    F --> B

通过以上的分析和总结，我们对股票数据建模有了更深入的理解。在实际应用中，要根据具体情况选择合适的方法和技术，不断优化模型，以提高预测的准确性和可靠性。同时，要认识到股票市场的复杂性和不确定性，不能仅仅依赖模型进行投资决策。