model ensemble guide

本文详细介绍了模型融合技术在提高机器学习任务准确率方面的应用。从Kaggle竞赛的角度出发,探讨了如何通过投票融合、平均融合及堆叠泛化等方法整合多个模型预测结果。文章还分析了模型融合减少泛化误差的原因,并提供了多种融合方法及其效果对比。


http://mlwave.com/kaggle-ensembling-guide/

这里有很多非常好的博客。



Model ensembling is a very powerful technique to increase accuracy on a variety of ML tasks. In this article I will share my ensembling approaches for Kaggle Competitions.

For the first part we look at creating ensembles from submission files. The second part will look at creating ensembles through stacked generalization/blending.

I answer why ensembling reduces the generalization error. Finally I show different methods of ensembling, together with their results and code to try it out for yourself.

This is how you win ML competitions: you take other peoples’ work and ensemble them together.” Vitaly Kuznetsov NIPS2014

Creating ensembles from submission files

The most basic and convenient way to ensemble is to ensemble Kaggle submission CSV files. You only need the predictions on the test set for these methods — no need to retrain a model. This makes it a quick way to ensemble already existing model predictions, ideal when teaming up.

Voting ensembles.

We first take a look at a simple majority vote ensemble. Let’s see why model ensembling reduces error rate and why it works better to ensemble low-correlated model predictions.

Error correcting codes

During space missions it is very important that all signals are correctly relayed.

If we have a signal in the form of a binary string like:

1110110011101111011111011011

and somehow this signal is corrupted (a bit is flipped) to:

1010110011101111011111011011

then lives could be lost.

coding solution was found in error correcting codes. The simplest error correcting code is a repetition-code: Relay the signal multiple times in equally sized chunks and have a majority vote.

Original signal:
1110110011

Encoded:
10,3 101011001111101100111110110011

Decoding:
1010110011
1110110011
1110110011

Majority vote:
1110110011

Signal corruption is a very rare occurrence and often occur in small bursts. So then it figures that it is even rarer to have a corrupted majority vote.

As long as the corruption is not completely unpredictable (has a 50% chance of occurring) then signals can be repaired.

A machine learning example

Suppose we have a test set of 10 samples. The ground truth is all positive (“1”):

1111111111

We furthermore have 3 binary classifiers (A,B,C) with a 70% accuracy. You can view these classifiers for now as pseudo-random number generators which output a “1” 70% of the time and a “0” 30% of the time.

We will now show how these pseudo-classifiers are able to obtain 78% accuracy through a voting ensemble.

A pinch of maths

For a majority vote with 3 members we can expect 4 outcomes:

All three are correct
  0.7 * 0.7 * 0.7
= 0.3429

Two are correct
  0.7 * 0.7 * 0.3
+ 0.7 * 0.3 * 0.7
+ 0.3 * 0.7 * 0.7
= 0.4409

Two are wrong
  0.3 * 0.3 * 0.7
+ 0.3 * 0.7 * 0.3
+ 0.7 * 0.3 * 0.3
= 0.189

All three are wrong
  0.3 * 0.3 * 0.3
= 0.027

We see that most of the times (~44%) the majority vote corrects an error. This majority vote ensemble will be correct an average of ~78% (0.3429 + 0.4409 = 0.7838).

Number of voters

Like repetition codes increase in their error-correcting capability when more codes are repeated, so do ensembles usually improve when adding more ensemble members.

Repetition codes performance on graph

Using the same pinch of maths as above: a voting ensemble of 5 pseudo-random classifiers with 70% accuracy would be correct ~83% of the time. One or two errors are being corrected during ~66% of the majority votes. (0.36015 + 0.3087)

Correlation

When I first joined the team for KDD-cup 2014, Marios Michailidis (KazAnova) proposed something peculiar. He calculated the Pearson correlation for all our submission files and gathered a few well-performing models which were less correlated.

Creating an averaging ensemble from these diverse submissions gave us the biggest 50-spot jump on the leaderboard. Uncorrelated submissions clearly do better when ensembled than correlated submissions. But why?

To see this, let us take 3 simple models again. The ground truth is still all 1’s:

1111111100 = 80% accuracy
1111111100 = 80% accuracy
1011111100 = 70% accuracy.

These models are highly correlated in their predictions. When we take a majority vote we see no improvement:

1111111100 = 80% accuracy

Now we compare to 3 less-performing, but highly uncorrelated models:

1111111100 = 80% accuracy
0111011101 = 70% accuracy
1000101111 = 60% accuracy

When we ensemble this with a majority vote we get:

1111111101 = 90% accuracy

Which is an improvement: A lower correlation between ensemble model members seems to result in an increase in the error-correcting capability.

Use for Kaggle: Forest Cover Type prediction

ForestMajority votes make most sense when the evaluation metric requires hard predictions, for instance with (multiclass-) classification accuracy.

The forest cover type prediction challenge uses the UCI Forest CoverType dataset. The dataset has 54 attributes and there are 6 classes.

We create a simple starter model with a 500-tree Random Forest. We then create a few more models and pick the best performing one. For this task and our model selection an ExtraTreesClassifier works best.

Weighing

We then use a weighted majority vote. Why weighing? Usually we want to give a better model more weight in a vote. So in our case we count the vote by the best model 3 times. The other 4 models count for one vote each.

The reasoning is as follows: The only way for the inferior models to overrule the best model (expert) is for them to collectively (and confidently) agree on an alternative.

We can expect this ensemble to repair a few erroneous choices by the best model, leading to a small improvement only. That’s our punishment for forgoing a democracy and creating a Plato’s Republic.

“Every city encompasses two cities that are at war with each other.” Plato in The Republic

Table 1. shows the result of training 5 models, and the resulting score when combining these with a weighted majority vote.

MODEL PUBLIC ACCURACY SCORE
GradientBoostingMachine 0.65057
RandomForest Gini 0.75107
RandomForest Entropy 0.75222
ExtraTrees Entropy 0.75524
ExtraTrees Gini (Best) 0.75571
Voting Ensemble (Democracy) 0.75337
Voting Ensemble (3*Best vs. Rest) 0.75667
Use for Kaggle: CIFAR-10 Object detection in images

CIFAR-10CIFAR-10 is another multi-class classification challenge where accuracy matters.

Our team leader for this challenge, Phil Culliton, first found the best setup to replicate a good model from dr. Graham.

Then he used a voting ensemble of around 30 convnets submissions (all scoring above 90% accuracy). The best single model of the ensemble scored 0.93170.

A voting ensemble of 30 models scored 0.94120. A ~0.01 reduction in error rate, pushing the resulting score beyond the estimated human classification accuracy.

Code

We have a sample voting script you could use at the MLWave Github repo. It operates on a directory of Kaggle submissions and creates a new submission. Update: Armando Segnini has added weighing.

Ensembling. Train 10 neural networks and average their predictions. It’s a fairly trivial technique that results in easy, sizeable performance improvements.

One may be mystified as to why averaging helps so much, but there is a simple reason for the effectiveness of averaging. Suppose that two classifiers have an error rate of 70%. Then, when they agree they are right. But when they disagree, one of them is often right, so now the average prediction will place much more weight on the correct answer.

The effect will be especially strong whenever the network is confident when it’s right and unconfident when it’s wrong. Ilya Sutskever A brief overview of Deep Learning.

Averaging

Averaging works well for a wide range of problems (both classification and regression) and metrics (AUC, squared error or logaritmic loss).

There is not much more to averaging than taking the mean of individual model predictions. An often heard shorthand for this on Kaggle is “bagging submissions”.

Averaging predictions often reduces overfit. You ideally want a smooth separation between classes, and a single model’s predictions can be a little rough around the edges.

Learning from noise

The above image is from the Kaggle competition: Don’t Overfit!, the black line shows a better separation than the green line. The green line has learned from noisy datapoints. No worries! Averaging multiple different green lines should bring us closer to the black line.

Remember our goal is not to memorize the training data (there are far more efficient ways to store data than inside a random forest), but to generalize well to new unseen data.

Kaggle use: Bag of Words Meets Bags of Popcorn

IconsThis is a movie sentiment analysis contest. In a previous post we used an online perceptron script to get 95.2 AUC.

The perceptron is a decent linear classifier which is guaranteed to find a separation if the data is linearly separable. This is a welcome property to have, but you have to realize a perceptron stops learning once this separation is reached. It does not necessarily find the best separation for new data.

So what would happen if we initialize 5 perceptrons with random weights and combine their predictions through an average? Why, we get an improvement on the test set!

MODEL PUBLIC AUC SCORE
Perceptron 0.95288
Random Perceptron 0.95092
Random Perceptron 0.95128
Random Perceptron 0.95118
Random Perceptron 0.95072
Bagged Perceptrons 0.95427

Above results also illustrate that ensembling can (temporarily) save you from having to learn about the finer details and inner workings of a specific Machine Learning algorithm. If it works, great! If it doesn’t, not much harm done.

Perceptron bagging

You also won’t get a penalty for averaging 10 exactly the same linear regressions. Bagging a single poorly cross-validated and overfitted submission may even bring you some gain through adding diversity (thus less correlation).

Code

We have posted a simple averaging script on Github that takes as input a directory of .csv files and outputs an averaged submission. Update: Dat Le has added a geometric averaging script. Geometric mean can outperform a plain average.

Rank averaging

When averaging the outputs from multiple different models some problems can pop up. Not all predictors are perfectly calibrated: they may be over- or underconfident when predicting a low or high probability. Or the predictions clutter around a certain range.

In the extreme case you may have a submission which looks like this:

Id,Prediction
1,0.35000056
2,0.35000002
3,0.35000098
4,0.35000111

Such a prediction may do well on the leaderboard when the evaluation metric is ranking or threshold based like AUC. But when averaged with another model like:

Id,Prediction
1,0.57
2,0.04
3,0.96
4,0.99

it will not change the ensemble much at all.

Our solution is to first turn the predictions into ranks, then averaging these ranks.

Id,Rank,Prediction
1,1,0.35000056
2,0,0.35000002
3,2,0.35000098
4,3,0.35000111

After normalizing the averaged ranks between 0 and 1 you are sure to get an even distribution in your predictions. The resulting rank-averaged ensemble:

Id,Prediction
1,0.33
2,0.0
3,0.66
4,1.0
Historical ranks.

Ranking requires a test set. So what do you do when want predictions for a single new sample? You could rank it together with the old test set, but this will increase the complexity of your solution.

A solution is using historical ranks. Store the old test set predictions together with their rank. Now when you predict a new test sample like “0.35000110” you find the closest old prediction and take its historical rank (in this case rank “3” for “0.35000111”).

Kaggle use case: Acquire Valued Shoppers Challenge

ScissorsRanking averages do well on ranking and threshold-based metrics (like AUC) and search-engine quality metrics (like average precision at k).

The goal of the shopper challenge was to rank the chance that a shopper would become a repeat customer.

Our team first took an average of multiple Vowpal Wabbit models together with an R GLMNet model. Then we used a ranking average to improve the exact same ensemble.

MODEL PUBLIC PRIVATE
Vowpal Wabbit A 0.60764 0.59962
Vowpal Wabbit B 0.60737 0.59957
Vowpal Wabbit C 0.60757 0.59954
GLMNet 0.60433 0.59665
Average Bag 0.60795 0.60031
Rank average Bag 0.61027 0.60187

I already wrote about the Avito challenge where rank averaging gave us a hefty increase.

Finally, when weighted rank averaging the bagged perceptrons from the previous chapter (1x) with the new bag-of-words tutorial (3x) on fastML.com we improve that model’s performance from 0.96328 AUC to 0.96461 AUC.

Code

A simple work-horse rank averaging script is added to the MLWave Github repo.

Competitions are effective because there are any number of techniques that can be applied to any modeling problem, but we can’t know in advance which will be most effective. Anthony Goldbloom Data Prediction Competitions — Far More than Just a Bit of Fun

Whiskey blending

From ‘How Scotch Blended Whisky is Made’ on Youtube

Stacked Generalization & Blending

Averaging prediction files is nice and easy, but it’s not the only method that the top Kagglers are using. The serious gains start with stacking and blending. Hold on to your top-hats and petticoats: Here be dragons. With 7 heads. Standing on top of 30 other dragons.

Netflix

Netflix organized and popularized the first data science competitions. Competitors in the movie recommendation challenge really pushed the state of the art on ensemble creation, perhaps so much so that Netflix decided not to implement the winning solution in production. That one was simply too complex.

Nevertheless, a number of papers and novel methods resulted from this challenge:

All are interesting, accessible and relevant reads when you want to improve your Kaggle game.

Netflix Prize Leaderboard

This is a truly impressive compilation and culmination of years of work, blending hundreds of predictive models to finally cross the finish line. We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment. Netflix Engineers

Stacked generalization

Stacked generalization was introduced by Wolpert in a 1992 paper, 2 years before the seminal Breiman paper “Bagging Predictors“. Wolpert is famous for another very popular machine learning theorem: “There is no free lunch in search and optimization“.

The basic idea behind stacked generalization is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.

Let’s say you want to do 2-fold stacking:

  • Split the train set in 2 parts: train_a and train_b
  • Fit a first-stage model on train_a and create predictions for train_b
  • Fit the same model on train_b and create predictions for train_a
  • Finally fit the model on the entire train set and create predictions for the test set.
  • Now train a second-stage stacker model on the probabilities from the first-stage model(s).

A stacker model gets more information on the problem space by using the first-stage predictions as features, than if it was trained in isolation.

It is usually desirable that the level 0 generalizers are of all “types”, and not just simple variations of one another (e.g., we want surface-fitters, Turing-machine builders, statistical extrapolators, etc., etc.). In this way all possible ways of examining the learning set and trying to extrapolate from it are being exploited. This is part of what is meant by saying that the level 0 generalizers should “span the space”.

[…] stacked generalization is a means of non-linearly combining generalizers to make a new generalizer, to try to optimally integrate what each of the original generalizers has to say about the learning set. The more each generalizer has to say (which isn’t duplicated in what the other generalizer’s have to say), the better the resultant stacked generalization. Wolpert (1992) Stacked Generalization

Blending

Blending is a word introduced by the Netflix winners. It is very close to stacked generalization, but a bit simpler and less risk of an information leak. Some researchers use “stacked ensembling” and “blending” interchangeably.

With blending, instead of creating out-of-fold predictions for the train set, you create a small holdout set of say 10% of the train set. The stacker model then trains on this holdout set only.

Blending has a few benefits:

  • It is simpler than stacking.
  • It wards against an information leak: The generalizers and stackers use different data.
  • You do not need to share a seed for stratified folds with your teammates. Anyone can throw models in the ‘blender’ and the blender decides if it wants to keep that model or not.

The cons are:

  • You use less data overall
  • The final model may overfit to the holdout set.
  • Your CV is more solid with stacking (calculated over more folds) than using a single small holdout set.

As for performance, both techniques are able to give similar results, and it seems to be a matter of preference and skill which you prefer. I myself prefer stacking.

If you can not choose, you can always do both. Create stacked ensembles with stacked generalization and out-of-fold predictions. Then use a holdout set to further combine these models at a third stage.

Stacking with logistic regression

Stacking with logistic regression is one of the more basic and traditional ways of stacking. A script I found by Emanuele Olivettihelped me understand this.

When creating predictions for the test set, you can do that in one go, or take an average of the out-of-fold predictors. Though taking the average is the clean and more accurate way to do this, I still prefer to do it in one go as that slightly lowers both model and coding complexity.

Kaggle use: “Papirusy z Edhellond”

GondorI used the above blend.py script by Emanuele to compete in this inClass competition. Stacking 8 base models (diverse ET’s, RF’s and GBM’s) with Logistic Regression gave me my second best score of 0.99409 accuracy, good for first place.

Kaggle use: KDD-cup 2014

Using this script I was able to improve a model from Yan Xu. Her model before stacking scored ~0.605 AUC. With stacking this improved to ~0.625.

Stacking with non-linear algorithms

Popular non-linear algorithms for stacking are GBM, KNN, NN, RF and ET.

Non-linear stacking with the original features on multiclass problems gives surprising gains. Obviously the first-stage predictions are very informative and get the highest feature importance. Non-linear algorithms find useful interactions between the original features and the meta-model features.

Kaggle use: TUT Headpose Estimation Challenge

TUT headpose The TUT Headpose Estimation challenge can be treated as a multi-class multi-label classification challenge.

For every label a separate ensemble model was trained.

The following table shows the result of training individual models, and their improved scores when stacking the predicted class probabilities with an extremely randomized trees model.

MODEL PUBLIC MAE PRIVATE MAE
Random Forests 500 estimators 6.156 6.546
Extremely Randomized Trees 500 estimators 6.317 6.666
KNN-Classifier with 5 neighbors 6.828 7.460
Logistic Regression 6.694 6.949
Stacking with Extremely Randomized Trees 4.772 4.718

We see that stacked generalization with standard models is able to reduce the error by around 30%(!).

Read more about this result in the paper: Computer Vision for Head Pose Estimation: Review of a Competition.

Code

You can find a function to create out-of-fold probability predictionsin the MLWave Github repo. You could use numpy horizontal stacking (hstack) to create blended datasets.

Feature weighted linear stacking

Feature-weighted linear stacking stacks engineered meta-features together with model predictions. The hope is that the stacking model learns which base model is the best predictor for samples with a certain feature value. Linear algorithms are used to keep the resulting model fast and simple to inspect.

Blended prediction

Vowpal Wabbit can implement a form of feature-weighted linear stacking out of the box. If we have a train set like:

1 |f f_1:0.55 f_2:0.78 f_3:7.9 |s RF:0.95 ET:0.97 GBM:0.92

We can add quadratic feature interactions between the s-featurespace and the f-featurespace by adding -q fs. The features in the f-namespace can be engineered meta-features like in the paper, or they can be the original features.

Quadratic linear stacking of models

This did not have a name so I made one up. It is very similar to feature-weighted linear stacking, but it creates combinations of model predictions. This improved the score on numerous experiments, most noticeably on the Modeling Women’s Healthcare Decision competition on DrivenData.

Using the same VW training set as before:

1 |f f_1:0.55 f_2:0.78 f_3:7.9 |s RF:0.95 ET:0.97 GBM:0.92

We can train with -q ss creating quadratic feature interactions (RF*GBM) between the model predictions.

This can easily be combined with feature-weighted linear stacking: -q fs -q ss, possibly improving on both.

So now you have a case where many base models should be created. You don’t know apriori which of these models are going to be helpful in the final meta model. In the case of two stage models, it is highly likely weak base models are preferred.

So why tune these base models very much at all? Perhaps tuning here is just obtaining model diversity. But at the end of the day you don’t know which base models will be helpful. And the final stage will likely be linear (which requires no tuning, or perhaps a single parameter to give some sparsity).  Mike KimTuning doesn’t matter. Why are you doing it?

Stacking classifiers with regressors and vice versa

Stacking allows you to use classifiers for regression problems and vice versa. For instance, one may try a base model with quantile regression on a binary classification problem. A good stacker should be able to take information from the predictions, even though usually regression is not the best classifier.

Using classifiers for regression problems is a bit trickier. You use binning first: You turn the y-label into evenly spaced classes. A regression problem that requires you to predict wages can be turned into a multiclass classification problem like so:

  • Everything under 20k is class 1.
  • Everything between 20k and 40k is class 2.
  • Everything over 40k is class 3.

The predicted probabilities for these classes can help a stacking regressor make better predictions.

“I learned that you never, ever, EVER go anywhere without your out-of-fold predictions. If I go to Hawaii or to the bathroom I am bringing them with. Never know when I need to train a 2nd or 3rd level meta-classifier” T. Sharf

Stacking unsupervised learned features

There is no reason we are restricted to using supervised learning techniques with stacking. You can also stack with unsupervised learning techniques.

K-Means clustering is a popular technique that makes sense here. Sofia-ML implements a fast online k-means algorithm suitable for this.

Another more recent interesting addition is to use t-SNE: Reduce the dataset to 2 or 3 dimensions and stack this with a non-linear stacker. Using a holdout set for stacking/blending feels like the safest choice here. See here for a solution by Mike Kim, using t-SNE vectors and boosting them with XGBoost: ‘0.41599 via t-SNE meta-bagging‘.

t-SNE

Piotr shows a nice visualization with t-SNE on the Otto Product Classification Challenge data set.

Online Stacking

I spend quit a lot of time working out an idea I had for online stacking: first create small fully random trees from the hashed binary representation. Substract profit or add profit when the tree makes a correct prediction. Now take the most profitable and least profitable trees and add them to the feature representation.

It worked, but only on artificial data. For instance, a linear perceptron with online random tree stacking was able to learn a non-linear XOR-problem. It did not work on any real-life data I tried it on, and believe me, I tried. So from now on I’ll be suspicious of papers which only feature artificial data sets to showcase their new algorithm.

A similar idea did work for the author of the paper: random bit regression. Here many random linear functions are created from the features, and the best are found through heavy regularization. This I was able to replicate with success on some datasets. This will the topic of a future post.

A more concrete example of (semi-) online stacking is with ad click prediction. Models trained on recent data perform better there. So when a dataset has a temporal effect, you could use Vowpal Wabbit to train on the entire dataset, and use a more complex and powerful tool like XGBoost to train on the last day of data. Then you stack the XGBoost predictions together with the samples and let Vowpal Wabbit do what it does best: optimizing loss functions.

The natural world is complex, so it figures that ensembling different models can capture more of this complexity. Ben Hamner ‘Machine learning best practices we’ve learned from hundreds of competitions’ (video)

Everything is a hyper-parameter

When doing stacking/blending/meta-modeling it is healthy to think of every action as a hyper-parameter for the stacker model.

So for instance:

  • Not scaling the data
  • Standard-Scaling the data
  • Minmax scaling the data

are simply extra parameters to be tuned to improve the ensemble performance. Likewise, the number of base models to use can be seen as a parameter to optimize. Feature selection (top 70%) or imputation (impute missing features with a 0) are other examples of meta-parameters.

Like a random gridsearch is a good candidate for tuning algorithm parameters, so does it work for tuning these meta-parameters.

Sometimes it is useful to allow XGBoost to see what a KNN-classifier sees. – Marios Michailidis

Model Selection

You can further optimize scores by combining multiple ensembled models.

  • There is the ad-hoc approach: Use averaging, voting or rank averaging on manually-selected well-performing ensembles.
  • Greedy forward model selection (Caruana et al.). Start with a base ensemble of 3 or so good models. Add a model when it increases the train set score the most. By allowing put-back of models, a single model may be picked multiple times (weighing).
  • Genetic model selection uses genetic algorithms and CV-scores as the fitness function. See for instance inversion‘s solution ‘Strategy for top 25 position‘.
  • I use a fully random method inspired by Caruana’s method: Create a 100 or so ensembles from randomly selected ensembles (without placeback). Then pick the highest scoring model.

Automation

Otto GroupWhen stacking for the Otto product classification competition I quickly got a good top 10 spot. Adding more and more base models and bagging multiple stacked ensembles I was able to keep improving my score.

Once I had reached 7 base models stacked by 6 stackers, a sense of panic and gloom started to set in. Would I be able to replicate all of this? These complex and slow unwieldy models were out of my comfort zone of fast and simple Machine Learning.

I spend the rest of the competition building a way to automate stacking. For base models pure random algorithms with pure random parameters are trained. Wrappers were written to make classifiers like VW, Sofia-ML, RGF, MLP and XGBoost play nicely with the Scikit-learn API.

Whiteboard automated stacking
The first whiteboard sketch for a parallelized automated stacker with 3 buckets

For stackers I let the script use SVM, random forests, extremely randomized trees, GBM and XGBoost with random parameters and a random subset of base models.

Finally the created stackers are averaged when their fold-predictions on the train set produced a lower loss.

This automated stacker was able to rank 57th spot a week before the competition ended. It contributed to my final ensemble. The only difference was I never spend time tuning or selecting: I started the script, went to bed, and awoke to a good solution.

Otto Leaderboard

The automated stacker is able to get a top 10% score without any tuning or manual model selection on a competitive task with over 3000 competitors.

Automatic stacking is one of my new big interests. Expect a few follow-up articles on this. The best result of automatic stacking was found on the TUT Headpose Estimation challenge. This black-box solution beats the current state-of-the-art set by domain experts who created special-purpose algorithms for this particular problem.

Tut headpose leaderboard

Noteworthy: This was a multi-label classification problem. Predictions for both “yaw” and “pitch” were required. Since the “yaw” and “pitch”-labels of a head pose are interrelated, stacking a model with predictions for “yaw” increased the accuracy for “pitch” predictions and vice versa. An interesting result.

Models visualized as a network can be trained used back-propagation: then stacker models learn which base models reduce the error the most.

Ensemble Network

Next to CV-scores one could take the standard deviation of the CV-scores into account (a smaller deviation is a safer choice). One could look at optimizing complexity/memory usage and running times. Finally one can look at adding correlation into the mix — make the script prefer uncorrelated model predictions when creating ensembles.

The entire automated stacking pipeline can be parallelized and distributed. This also brings speed improvements and faster good results on a single laptop.

Contextual bandit optimization seems like a good alternative to fully random gridsearch: We want our algorithm to start exploiting good parameters and models and remember that the random SVM it picked last time ran out of memory. These additions to stacking will be explored in greater detail soon.

In the meantime you can get a sneak preview on the MLWave Github repo: “Hodor-autoML“.

The #1 and #2 winners of the Otto product classification challenge used ensembles of over a 1000 different models. Read more about the first place and the second place.

Why create these Frankenstein ensembles?

You may wonder why this exercise in futility: stacking and combining 1000s of models and computational hours is insanity right? Well… yes. But these monster ensembles still have their uses:

  • You can win Kaggle competitions.
  • You can beat most state-of-the-art academic benchmarks with a single approach.
  • You can then compare your new-and-improved benchmark with the performance of a simpler, more production-friendly model
  • One day, today’s computers and clouds will seem weak. You’ll be ready.
  • It is possible to transfer knowledge from the ensemble back to a simpler shallow model (Hinton’s Dark Knowledge, Caruana’sModel Compression)
  • Not all base models necessarily need to finish in time. In that regard, ensembling introduces a form of graceful degradation: loss of one model is not fatal for creating good predictions.
  • Automated large ensembles ward against overfit and add a form of regularization, without requiring much tuning or selection. In principle stacking could be used by lay-people.
  • It is currently one of the best methods to improve machine learning algorithms, perhaps telling use something about efficienthuman ensemble learning.
  • A 1% increase in accuracy may push an investment fund from making a loss, into making a little less loss. More seriously: Improving healthcare screening methods helps save lives.

Update: Thanks a lot to Dat Le for documenting and refactoring thecode accompanying this article. Thanks to Armando Segnini for adding weighted averaging. Thanks a lot everyone for the encouraging comments. My apologies if I have forgotten to link to your previous inspirational work. Further reading at “More is always better – The power of Simple Ensembles” by Carter Sibley, “Tradeshift Benchmark Tutorial with two-stage SKLearn models” by Dmitry Dryomov, “Stacking, Blending and Stacked Generalization” by Eric ChioEnsemble Learning: The wisdom of the crowds (of machines) by Lior Rokach, and “Deep Support Vector Machines” by Marco Wiering.

Terminology: When I say ensembling I mean ‘model averaging’: combining multiple models. Algorithms like Random Forests use ensembling techniques like bagging internally. For this article we are not interested in that.

The intro image came from WikiMedia Commons and is in the public domain, courtesy of Jesse Merz.


当前程序异常终止 program da_system use mpi use iso_c_binding use module_grid use module_grid_interp use module_io use module_data use module_numerical use module_svd use module_process implicit none ! 观测相关数组 - Observation arrays (节点共享内存) integer, parameter :: nobs_actual = 883*555 ! 实际观测数量(编译时常量) real, parameter :: fillvalue = 9999.0 real, pointer :: obs_wind_ptr(:) => null() ! 观测值共享指针 real, pointer :: obs_locations_ptr(:,:) => null() ! 观测位置共享指针 real, pointer :: obs_errors_ptr(:) => null() ! 观测误差共享指针 real, pointer :: obs_weight_ptr(:) => null() ! 观测权重共享指针 ! 大数组共享内存指针 - 分为5个变量 real, pointer :: ensemble_pi_ptr(:,:,:,:) => null() ! pi集合数据共享指针 real, pointer :: ensemble_u_ptr(:,:,:,:) => null() ! u集合数据共享指针 real, pointer :: ensemble_v_ptr(:,:,:,:) => null() ! v集合数据共享指针 real, pointer :: ensemble_th_ptr(:,:,:,:) => null() ! th集合数据共享指针 real, pointer :: ensemble_q_ptr(:,:,:,:) => null() ! q集合数据共享指针 type(model_data), pointer :: back_data_ptr => null() ! 背景场共享指针 type(model_data), pointer :: true_data_ptr => null() ! 真值场共享指针 ! 本地数据 type(model_data) :: analysis_increment type(model_data_point) :: analysis_increment_point real, dimension(ids:ide, kms:kme, jds:jde) :: height_full, height_half ! 主循环变量 - Main loop variables integer :: i_grid, j_grid, k integer :: member, i_obs integer :: i_start, i_end, j_start, j_end ! 局地化范围索引 integer :: myid0_task, myid0_task_total ! 并行参数 integer :: myid, numprocs, ierr integer :: node_comm, node_rank, node_size integer :: inter_comm ! 跨节点通信器(仅节点根进程参与) integer :: win_obs, win_ensemble_pi, win_ensemble_u, win_ensemble_v, win_ensemble_th, win_ensemble_q, win_back, win_true ! 多个共享内存窗口 integer(kind=MPI_ADDRESS_KIND) :: ssize_obs, ssize_data ! 掩码并行相关变量 integer, allocatable :: assimilation_mask(:,:) integer, allocatable :: my_grid_list(:,:) integer :: n_valid_points, n_my_points, grid_idx, obs_count integer :: psize, pstart, pend ! 网格分块相关变量(用于第一阶段掩码生成) integer :: px, py, px_rank, py_rank integer :: istart_proc, iend_proc, jstart_proc, jend_proc integer :: n_local_points ! 第二阶段:动态调度分配(master-worker) integer, parameter :: TAG_REQUEST = 100, TAG_ASSIGN = 101 integer :: task_id, worker_id, status(MPI_STATUS_SIZE) integer, allocatable :: all_i_coords(:), all_j_coords(:) integer :: total_points, idx ! 第三阶段:动态调度主循环 ! ====== 动态调度主循环(批量分发+非阻塞) ====== integer, parameter :: TAG_DONE = 102 integer :: tasks_per_worker ! 动态计算每批次任务数 integer, allocatable :: task_ids(:) ! ====== 动态调度主循环相关变量(必须在声明区声明) ====== integer :: next_task, n_workers, request, i, n_this_batch, req, j logical :: flag integer :: n_tasks_sent, n_tasks_done ! ====== 任务池机制相关变量 ====== integer, allocatable :: worker_requests(:) ! 每个worker的请求句柄 integer, allocatable :: worker_send_reqs(:) ! 每个worker的发送句柄 integer, allocatable :: worker_status(:,:) ! 每个worker的状态 logical, allocatable :: worker_active(:) ! worker是否活跃 logical, allocatable :: worker_req_pending(:) ! worker请求是否待处理 integer :: n_active_workers, completed_worker, send_req integer :: task_pool_size, current_task integer :: target_batches_per_worker ! 每个worker目标批次数 real :: progress_percent ! 进度百分比 integer :: report_interval ! 报告间隔 ! ================= MPI 初始化与进程信息 ================= call MPI_INIT(ierr) ! 初始化MPI环境,所有进程必须调用 call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) ! 获取全局进程号,所有进程调用 myid_monitor = myid ! 监控进程号(用于调试) call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) ! 获取总进程数,所有进程调用 ! 创建节点通信器 call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, node_comm, ierr) call MPI_Comm_rank(node_comm, node_rank, ierr) ! 获取节点内进程号 call MPI_Comm_size(node_comm, node_size, ierr) ! 获取节点内进程数 ! 创建跨节点通信器(所有进程调用,部分进程返回MPI_COMM_NULL) if (node_rank == 0) then call MPI_Comm_split(MPI_COMM_WORLD, 0, myid, inter_comm, ierr) else call MPI_Comm_split(MPI_COMM_WORLD, MPI_UNDEFINED, myid, inter_comm, ierr) end if ! 计算二维进程分块 (按xy比例分配,考虑边缘半径) - 用于第一阶段掩码生成 call compute_process_grid(numprocs, nx-2*local_radius, ny-2*local_radius, px, py) px_rank = mod(myid, px) ! x方向进程号 py_rank = myid / px ! y方向进程号 ! 计算本进程负责的网格区域(基于有效计算区域) call compute_local_domain(px_rank, py_rank, px, py, nx-2*local_radius, ny-2*local_radius, & istart_proc, iend_proc, jstart_proc, jend_proc) ! 转换为实际网格坐标:有效计算区域是[local_radius+1, nx-local_radius] istart_proc = istart_proc + local_radius ! 起始点:1 -> local_radius+1 jstart_proc = jstart_proc + local_radius iend_proc = iend_proc + local_radius ! 结束点:nx-2*local_radius -> nx-local_radius jend_proc = jend_proc + local_radius if (myid == 0) then write(*,'(A,4I6)') 'DEBUG: Final process domain for mask generation: ', & istart_proc, iend_proc, jstart_proc, jend_proc write(*,'(A,2I6)') 'DEBUG: Domain size: ', (iend_proc-istart_proc+1), (jend_proc-jstart_proc+1) end if if (myid == 0) then write(*,'(A)') strline write(*,'(A)') ' NUDT Regional Data Assimilation' write(*,'(A)') ' with Shared Memory & Two-Stage Mask-based Parallel' write(*,'(A)') strline write(*,'(A,I6)') 'Total processes: ', numprocs write(*,'(A,3I6)') 'Grid dimensions: ', nx, ny, nz write(*,'(A,I6)') 'Local radius: ', local_radius write(*,'(A,2I6)') 'Effective computation area: ', nx-2*local_radius, ny-2*local_radius write(*,'(A,I6)') 'Ensemble size: ', nsample write(*,'(A,I6)') 'SVD rank truncation: ', rank_truncation write(*,'(A,I6)') 'Min observations required: ', min_obs_required write(*,'(A,2I6)') 'Process grid for mask generation (px,py): ', px, py write(*,'(A)') strline end if ! ===== 节点共享内存分配 ===== call setup_shared_memory_obs(node_comm, node_rank, obs_wind_ptr, & obs_locations_ptr, obs_errors_ptr, & obs_weight_ptr, nobs_actual, win_obs) call setup_shared_memory_data(node_comm, node_rank, ensemble_pi_ptr, & ensemble_u_ptr, ensemble_v_ptr, ensemble_th_ptr, ensemble_q_ptr, & back_data_ptr, true_data_ptr, & win_ensemble_pi, win_ensemble_u, win_ensemble_v, win_ensemble_th, win_ensemble_q, & win_back, win_true) ! 每个节点的根进程负责读取/接收本节点数据 if (node_rank == 0) then ! 全局主进程读取高度数据 if (myid == 0) then write(*,'(A)') 'Reading height data ...' call read_height('zz66.dat', height_full, height_half) write(*,'(A)') 'Height data: OK' end if end if ! 节点内同步,确保共享内存数据准备完毕 call MPI_Barrier(node_comm, ierr) ! 所有进程必须调用 ! 节点根进程广播高度数据到所有节点根进程 if (node_rank == 0) then call MPI_Bcast(height_full, size(height_full), MPI_REAL, 0, inter_comm, ierr) call MPI_Bcast(height_half, size(height_half), MPI_REAL, 0, inter_comm, ierr) end if ! 读取并存储本节点共享数据 if (node_rank == 0) then write(*,'(A,I5)') 'Node root process ', myid, ' reading data ...' call read_and_store_shared_data(ensemble_pi_ptr, ensemble_u_ptr, & ensemble_v_ptr, ensemble_th_ptr, ensemble_q_ptr, & back_data_ptr, true_data_ptr, obs_wind_ptr, & obs_locations_ptr, obs_errors_ptr, & obs_weight_ptr, nobs_actual, & height_full, height_half, .false.) end if ! 节点内同步,确保所有进程完成数据读取 call MPI_Barrier(node_comm, ierr) ! 所有进程必须调用 ! 重要:观测位置必须全局统一!只能由0号进程生成一次,然后广播给所有节点 if (myid == 0) then write(*,'(A)') 'Global master generating observation locations (ONCE for all nodes)...' call generate_obs_locations(obs_locations_ptr, nobs_actual) write(*,'(A)') 'Global master observation location generation completed.' ! 调试:检查生成的观测位置 write(*,'(A,2F8.2)') 'Debug: First obs location: ', obs_locations_ptr(1,1), obs_locations_ptr(1,2) write(*,'(A,2F8.2)') 'Debug: Last obs location: ', obs_locations_ptr(nobs_actual,1), obs_locations_ptr(nobs_actual,2) write(*,'(A,F8.2)') 'Debug: Min obs i-coord: ', minval(obs_locations_ptr(:,1)) write(*,'(A,F8.2)') 'Debug: Max obs i-coord: ', maxval(obs_locations_ptr(:,1)) write(*,'(A,F8.2)') 'Debug: Min obs j-coord: ', minval(obs_locations_ptr(:,2)) write(*,'(A,F8.2)') 'Debug: Max obs j-coord: ', maxval(obs_locations_ptr(:,2)) end if ! 广播观测位置到所有进程(跨节点) call MPI_Bcast(obs_locations_ptr, nobs_actual*3, MPI_REAL, 0, MPI_COMM_WORLD, ierr) ! 只需要全局同步,确保所有节点的主进程都完成数据加载 call MPI_Barrier(MPI_COMM_WORLD, ierr) ! 所有进程必须调用 ! 初始化本进程的分析增量(model_data类型已经有固定形状,无需allocate) analysis_increment%pi = fillvalue analysis_increment%u = fillvalue analysis_increment%v = fillvalue analysis_increment%th = fillvalue analysis_increment%q = fillvalue ! ====== 两阶段掩码筛选与负载均衡分配 ====== ! 第一阶段:基于网格分块的并行掩码生成 ! 1. 创建二维掩码数组并初始化 allocate(assimilation_mask(local_radius+1:nx-local_radius, local_radius+1:ny-local_radius)) assimilation_mask = 0 if (myid == 0) then write(*,'(A)') '====== DEBUG: Entering mask generation stage ======' write(*,'(A,4I6)') 'Process domain: istart, iend, jstart, jend = ', & istart_proc, iend_proc, jstart_proc, jend_proc write(*,'(A,2I6)') 'Mask array bounds: ', local_radius+1, nx-local_radius, local_radius+1, ny-local_radius write(*,'(A)') 'Stage 1: Generating assimilation mask using grid-block parallelization...' write(*,'(A,I4,A,I4)') 'Each process handles a ', & (iend_proc-istart_proc+1), ' x ', (jend_proc-jstart_proc+1), ' block' ! 调试:检查观测数据 write(*,'(A,I8)') 'Total observations: ', nobs_actual write(*,'(A,F8.2,F8.2)') 'First observation location: ', & obs_locations_ptr(1,1), obs_locations_ptr(1,2) write(*,'(A,F8.2,F8.2)') 'Last observation location: ', & obs_locations_ptr(nobs_actual,1), obs_locations_ptr(nobs_actual,2) write(*,'(A,I4)') 'Local radius: ', local_radius write(*,'(A,I4)') 'Min observations required: ', min_obs_required write(*,'(A,I4,A,I4)') 'Valid grid range: i=', local_radius+1, ' to ', nx-local_radius write(*,'(A,I4,A,I4)') 'Valid grid range: j=', local_radius+1, ' to ', ny-local_radius end if ! 2. 每个进程并行生成自己负责区域的掩码 if (myid == 0) then write(*,'(A)') '====== DEBUG: Starting mask generation loop ======' write(*,'(A,4I6)') 'Loop bounds: jstart, jend, istart, iend = ', & jstart_proc, jend_proc, istart_proc, iend_proc end if do j_grid = jstart_proc, jend_proc do i_grid = istart_proc, iend_proc obs_count = 0 ! 统计该网格点周围local_radius范围内的观测数量 do k = 1, nobs_actual ! 检查观测位置是否在有效网格范围内 if (obs_locations_ptr(k,1) >= local_radius+1 .and. obs_locations_ptr(k,1) <= nx-local_radius .and. & obs_locations_ptr(k,2) >= local_radius+1 .and. obs_locations_ptr(k,2) <= ny-local_radius) then ! 然后检查是否在当前网格点的local_radius范围内 if (abs(nint(obs_locations_ptr(k,1)) - i_grid) <= local_radius .and. & abs(nint(obs_locations_ptr(k,2)) - j_grid) <= local_radius) then obs_count = obs_count + 1 end if end if end do ! 如果观测数量满足要求,标记为可同化 if (obs_count >= min_obs_required) then assimilation_mask(i_grid, j_grid) = 1 end if ! 调试:输出前几个网格点的观测统计(只在进程0上输出) if (myid == 0 .and. ((i_grid-istart_proc)*(jend_proc-jstart_proc+1) + (j_grid-jstart_proc+1)) <= 10) then write(*,'(A,I4,A,I4,A,I4,A,I4)') 'Grid (', i_grid, ',', j_grid, ') has ', obs_count, ' observations, mask=', assimilation_mask(i_grid, j_grid) end if end do end do ! 3. 全局归约掩码:所有进程的掩码合并 call MPI_Allreduce(MPI_IN_PLACE, assimilation_mask, size(assimilation_mask), & MPI_INTEGER, MPI_MAX, MPI_COMM_WORLD, ierr) ! 4. 统计所有可同化网格点数量 n_valid_points = count(assimilation_mask >= 1) ! 注意:可能有重复统计,用>=1 if (myid == 0) then ! 调试:统计掩码分布 write(*,'(A,I8)') 'Debug: Raw valid points (before normalization): ', n_valid_points write(*,'(A,I8)') 'Debug: Max mask value: ', maxval(assimilation_mask) write(*,'(A,I8)') 'Debug: Points with mask >= 1: ', count(assimilation_mask >= 1) write(*,'(A,I8)') 'Debug: Points with mask > 1: ', count(assimilation_mask > 1) end if ! 5. 将掩码标准化为0/1 where (assimilation_mask >= 1) assimilation_mask = 1 elsewhere assimilation_mask = 0 end where ! 重新统计标准化后的有效网格点 n_valid_points = count(assimilation_mask == 1) if (myid == 0) then write(*,'(A)') 'Stage 2: Redistributing valid grid points for load balancing...' write(*,'(A,I8)') 'Total valid points to distribute: ', n_valid_points write(*,'(A,I8)') 'Points per process (average): ', n_valid_points / numprocs write(*,'(A,I8)') 'Last process will handle: ', n_valid_points - (numprocs-1)*(n_valid_points/numprocs) end if if (n_valid_points > 0) then ! 生成所有有效网格点坐标 total_points = n_valid_points allocate(all_i_coords(total_points), all_j_coords(total_points)) idx = 0 do j_grid = local_radius+1, ny-local_radius do i_grid = local_radius+1, nx-local_radius if (assimilation_mask(i_grid, j_grid) == 1) then idx = idx + 1 all_i_coords(idx) = i_grid all_j_coords(idx) = j_grid end if end do end do if (myid == 0) then write(*,'(A)') 'Stage 2: Dynamic scheduling (master-worker) for load balancing...' write(*,'(A,I8)') 'Total valid points to distribute: ', n_valid_points end if end if call MPI_Barrier(MPI_COMM_WORLD, ierr) ! ====== 动态计算任务分配策略 ====== if (n_valid_points > 0) then ! 计算合理的批次大小:确保每个进程都有足够的工作 n_workers = numprocs - 1 if (n_workers > 0) then ! 基础批次大小:总任务数 / (进程数 * 目标批次数) ! 目标是让每个进程处理多个批次,保持负载均衡 target_batches_per_worker = 5 ! 每个worker目标批次数 tasks_per_worker = max(1, n_valid_points / (n_workers * target_batches_per_worker)) ! 限制批次大小在合理范围内 tasks_per_worker = max(1, min(tasks_per_worker, 100)) write(0,'(A,I8,A,I4,A,I4)') 'Task distribution: ', n_valid_points, ' tasks, ', & n_workers, ' workers, batch size: ', tasks_per_worker write(0,'(A,I6)') 'Estimated total batches: ', (n_valid_points + tasks_per_worker - 1) / tasks_per_worker else tasks_per_worker = n_valid_points ! 单进程情况 end if else tasks_per_worker = 1 end if if (myid ==0) then write(*,'(A)') strline write(*,'(A)') ' ID task, min(π), min(u), min(v), min(θ), min(q)' write(*,'(A)') ' (x, y), max(π), max(u), max(v), max(θ), max(q)' write(*,'(A)') strline end if ! 第三阶段:动态调度主循环 ! ====== 任务池机制(完全非阻塞,无排队等待) ====== allocate(task_ids(tasks_per_worker)) if (n_valid_points == 0) then if (myid <= 3) then write(0,'(A,I4,A)') 'Process ', myid, ' assigned 0 grid points, skipping assimilation loop.' end if ! 重要:即使没有任务,worker也要通知master自己已完成 if (myid /= 0) then ! Worker发送完成信号,让master知道自己没有工作要做 call MPI_Send(myid, 1, MPI_INTEGER, 0, TAG_DONE, MPI_COMM_WORLD, ierr) end if else if (myid == 0) then ! ====== Master进程:任务池调度器 ====== n_workers = numprocs - 1 if (n_valid_points == 0) then ! 特殊情况:没有任务时,直接等待所有worker的TAG_DONE信号 write(0,'(A)') 'Master: No tasks to assign, waiting for all workers to report completion...' do i = 1, n_workers call MPI_Recv(request, 1, MPI_INTEGER, i, TAG_DONE, MPI_COMM_WORLD, status, ierr) write(0,'(A,I4,A)') 'Master: received completion signal from worker ', i, ' (no tasks case)' end do write(0,'(A)') 'Master: All workers have reported completion (no tasks case)' else ! 正常情况:有任务时的调度逻辑 allocate(worker_requests(n_workers)) allocate(worker_send_reqs(n_workers)) allocate(worker_status(MPI_STATUS_SIZE, n_workers)) allocate(worker_active(n_workers)) allocate(worker_req_pending(n_workers)) ! 初始化:为每个worker启动非阻塞接收 worker_active = .true. worker_req_pending = .false. n_active_workers = n_workers current_task = 1 n_tasks_sent = 0 n_tasks_done = 0 do i = 1, n_workers call MPI_Irecv(request, 1, MPI_INTEGER, i, TAG_REQUEST, MPI_COMM_WORLD, worker_requests(i), ierr) worker_req_pending(i) = .true. end do write(0,'(A,I4,A)') 'Master: Task pool initialized with ', n_active_workers, ' workers' ! 主循环:任务池调度 do while (current_task <= n_valid_points .or. n_active_workers > 0) ! 检查是否有worker请求任务 do i = 1, n_workers if (worker_active(i) .and. worker_req_pending(i)) then call MPI_Test(worker_requests(i), flag, worker_status(:,i), ierr) if (flag) then worker_req_pending(i) = .false. ! 分配任务 if (current_task <= n_valid_points) then n_this_batch = min(tasks_per_worker, n_valid_points - current_task + 1) task_ids(1:n_this_batch) = [(current_task + j - 1, j=1,n_this_batch)] if (n_this_batch < tasks_per_worker) then task_ids(n_this_batch+1:tasks_per_worker) = -1 end if current_task = current_task + n_this_batch n_tasks_sent = n_tasks_sent + 1 write(0,'(A,I4,A,I4,A,I4)') 'Master: assigning batch ', n_tasks_sent, & ' (', n_this_batch, ' tasks) to worker ', i else ! 无更多任务,发送终止信号 task_ids = -1 worker_active(i) = .false. n_active_workers = n_active_workers - 1 write(0,'(A,I4,A,I4)') 'Master: terminating worker ', i, & ', active workers: ', n_active_workers end if ! 非阻塞发送任务 call MPI_Isend(task_ids, tasks_per_worker, MPI_INTEGER, i, TAG_ASSIGN, & MPI_COMM_WORLD, worker_send_reqs(i), ierr) end if end if end do ! 检查worker完成信号 do i = 1, n_workers if (worker_active(i) .and. .not. worker_req_pending(i)) then call MPI_Test(worker_send_reqs(i), flag, worker_status(:,i), ierr) if (flag) then ! 发送完成,等待worker完成信号 call MPI_Irecv(request, 1, MPI_INTEGER, i, TAG_DONE, MPI_COMM_WORLD, req, ierr) call MPI_Wait(req, worker_status(:,i), ierr) n_tasks_done = n_tasks_done + 1 ! 重新启动请求接收 if (worker_active(i)) then call MPI_Irecv(request, 1, MPI_INTEGER, i, TAG_REQUEST, MPI_COMM_WORLD, worker_requests(i), ierr) worker_req_pending(i) = .true. end if end if end if end do ! 让出CPU,避免忙等待 if (n_active_workers == 0) exit ! 所有worker都已终止 ! 进度报告(每完成一定比例的任务) if (mod(current_task - 1, max(1, n_valid_points / 20)) == 0 .and. current_task > 1) then progress_percent = real(current_task - 1) / real(n_valid_points) * 100.0 write(0,'(A,F6.2,A,I6,A,I6,A,I4)') 'Master: progress ', progress_percent, & '% (', current_task - 1, '/', n_valid_points, '), active workers: ', n_active_workers end if end do write(0,'(A,I4,A,I4)') 'Master: All tasks completed. Sent: ', n_tasks_sent, ', Done: ', n_tasks_done ! 清理资源 deallocate(worker_requests, worker_send_reqs, worker_status, worker_active, worker_req_pending) end if ! 结束 if (n_valid_points == 0) else 块 else ! ====== Worker进程:持续工作机制 ====== if (n_valid_points == 0) then ! 没有任务时,worker不需要做任何工作,已经在前面发送了TAG_DONE信号 write(0,'(A,I4,A)') 'Worker ', myid, ' has no tasks to process' else ! 有任务时的正常工作流程 write(0,'(A,I4,A)') 'Worker ', myid, ' starting continuous work mode' do ! 请求任务 call MPI_Send(myid, 1, MPI_INTEGER, 0, TAG_REQUEST, MPI_COMM_WORLD, ierr) ! 接收任务 call MPI_Recv(task_ids, tasks_per_worker, MPI_INTEGER, 0, TAG_ASSIGN, MPI_COMM_WORLD, status, ierr) if (task_ids(1) == -1) then write(0,'(A,I4,A)') 'Worker ', myid, ' received termination signal' exit end if ! 处理任务批次 do j = 1, tasks_per_worker if (task_ids(j) == -1) exit i_grid = all_i_coords(task_ids(j)) j_grid = all_j_coords(task_ids(j)) call DA_MAIN(i_grid, j_grid, & obs_wind_ptr, obs_locations_ptr, obs_errors_ptr, nobs_actual, & ensemble_pi_ptr, ensemble_u_ptr, ensemble_v_ptr, ensemble_th_ptr, ensemble_q_ptr, & back_data_ptr, true_data_ptr, & analysis_increment_point) call combine_point_inc(analysis_increment, analysis_increment_point, i_grid, j_grid) if (mod(myid, max(1, numprocs/5)) == 0 .and. myid > 0) then write(*,'(A,I6,I5,A,5ES11.3)') ' ',myid, j,' ', & minval(analysis_increment_point%pi), minval(analysis_increment_point%u), & minval(analysis_increment_point%v), minval(analysis_increment_point%th), & minval(analysis_increment_point%q) write(*,'(A,I4,A,I4,A,5ES11.3)') ' (', i_grid, ',', j_grid, ') ', & maxval(analysis_increment_point%pi), maxval(analysis_increment_point%u), & maxval(analysis_increment_point%v), maxval(analysis_increment_point%th), & maxval(analysis_increment_point%q) end if end do ! 发送完成信号 call MPI_Send(myid, 1, MPI_INTEGER, 0, TAG_DONE, MPI_COMM_WORLD, ierr) ! 每处理若干批次输出一次进度(根据总任务数调整) report_interval = max(1, n_valid_points / (max(1, tasks_per_worker) * 50)) ! 50次报告 if (tasks_per_worker > 0 .and. mod(task_ids(1) / max(1, tasks_per_worker), report_interval) == 0) then write(0,'(A,I4,A,I6,A,I6)') 'Worker ', myid, ' processed batch starting at task ', & task_ids(1), ', batch size: ', count(task_ids /= -1) end if end do write(0,'(A,I4,A)') 'Worker ', myid, ' completed all assigned work' end if ! 结束 if (n_valid_points == 0) else 块 end if end if if (allocated(task_ids)) deallocate(task_ids) ! 6. 清理内存 if (allocated(assimilation_mask)) deallocate(assimilation_mask) if (allocated(my_grid_list)) deallocate(my_grid_list) if (allocated(all_i_coords)) deallocate(all_i_coords) if (allocated(all_j_coords)) deallocate(all_j_coords) if (myid == monitor_id) then write(*,'(A)') strline write(*,'(A)') 'All stages completed successfully!' write(*,'(A)') strline end if ! 等待全局结束 if (myid == 0) then write(*,'(A)') 'Master: Waiting for all processes to reach final barrier...' end if call MPI_Barrier(MPI_COMM_WORLD, ierr) if (myid == 0) then write(*,'(A)') 'Master: All processes reached final barrier successfully' end if call global_reduce_analysis_increment(analysis_increment, fillvalue) if (myid == 0) then ! write(*,'(A)') 'Computing final analysis field ...' ! 输出分析场统计信息 write(*,'(A)') strline write(*,'(A)') ' Global Analysis Increment Statistics' write(*,'(A)') strline write(*,'(A,2ES12.5)') ' π increment range: ', minval(analysis_increment%pi), maxval(analysis_increment%pi) write(*,'(A,2ES12.5)') ' u increment range: ', minval(analysis_increment%u), maxval(analysis_increment%u) write(*,'(A,2ES12.5)') ' v increment range: ', minval(analysis_increment%v), maxval(analysis_increment%v) write(*,'(A,2ES12.5)') ' θ increment range: ', minval(analysis_increment%th), maxval(analysis_increment%th) write(*,'(A,2ES12.5)') ' q increment range: ', minval(analysis_increment%q), maxval(analysis_increment%q) write(*,'(A)') strline write(*,'(A)') 'Writing analysis increment to file ...' call write_analysis_increment('da_inc.dat', analysis_increment, height_full, height_half) write(*,'(A)') 'Writing combined analysis increment to GRAPES input format ...' call write_combine_inc2ana('grapesinput', 'grapesinput_DA', back_data_ptr, analysis_increment, height_full, height_half) end if ! 所有进程准备退出 if (myid == 0) then write(0,'(A)') 'Master: All processes preparing to finalize MPI...' end if ! 先同步,确保所有进程都完成了计算 call MPI_Barrier(MPI_COMM_WORLD, ierr) ! 释放共享内存窗口 - 所有进程都必须调用 call MPI_Win_free(win_obs, ierr) call MPI_Win_free(win_ensemble_pi, ierr) call MPI_Win_free(win_ensemble_u, ierr) call MPI_Win_free(win_ensemble_v, ierr) call MPI_Win_free(win_ensemble_th, ierr) call MPI_Win_free(win_ensemble_q, ierr) call MPI_Win_free(win_back, ierr) call MPI_Win_free(win_true, ierr) ! 释放通信器,所有进程调用 if (inter_comm /= MPI_COMM_NULL) then call MPI_Comm_free(inter_comm, ierr) end if call MPI_Comm_free(node_comm, ierr) if (myid == 0) then write(0,'(A)') 'Master: MPI resources cleaned up successfully' write(0,'(A)') 'Master: Program completed successfully' end if ! 最后的同步 call MPI_Barrier(MPI_COMM_WORLD, ierr) if (myid == 0) then write(*,'(A)') strline write(*,'(A)') ' Data Assimilation: COMPLETED' write(*,'(A)') strline end if ! 结束MPI环境,所有进程必须调用 call MPI_FINALIZE(ierr) contains ! 计算二维进程网格分布 subroutine compute_process_grid(numprocs, nx_eff, ny_eff, px, py) integer, intent(in) :: numprocs, nx_eff, ny_eff integer, intent(out) :: px, py integer :: i, best_px, best_py real :: ratio, grid_ratio, best_ratio_diff grid_ratio = real(nx_eff) / real(ny_eff) best_ratio_diff = huge(1.0) best_px = 1 best_py = numprocs ! 寻找最佳的px*py=numprocs分解,使得px/py接近nx_eff/ny_eff do i = 1, int(sqrt(real(numprocs))) + 1 if (mod(numprocs, i) == 0) then ratio = real(i) / real(numprocs / i) if (abs(ratio - grid_ratio) < best_ratio_diff) then best_ratio_diff = abs(ratio - grid_ratio) best_px = i best_py = numprocs / i end if end if end do px = best_px py = best_py end subroutine compute_process_grid ! 计算本进程负责的局部区域 subroutine compute_local_domain(px_rank, py_rank, px, py, nx_eff, ny_eff, & istart, iend, jstart, jend) integer, intent(in) :: px_rank, py_rank, px, py, nx_eff, ny_eff integer, intent(out) :: istart, iend, jstart, jend integer :: ilen, jlen, irem, jrem ! 计算每个进程的基本网格数 ilen = nx_eff / px jlen = ny_eff / py ! 计算余数 irem = mod(nx_eff, px) jrem = mod(ny_eff, py) ! 计算x方向范围 if (px_rank < irem) then istart = px_rank * (ilen + 1) + 1 iend = istart + ilen else istart = px_rank * ilen + irem + 1 iend = istart + ilen - 1 end if ! 计算y方向范围 if (py_rank < jrem) then jstart = py_rank * (jlen + 1) + 1 jend = jstart + jlen else jstart = py_rank * jlen + jrem + 1 jend = jstart + jlen - 1 end if end subroutine compute_local_domain ! 设置观测数据共享内存 subroutine setup_shared_memory_obs(node_comm, node_rank, obs_wind_ptr, & obs_locations_ptr, obs_errors_ptr, & obs_weight_ptr, nobs_actual, win_obs) integer, intent(in) :: node_comm, node_rank, nobs_actual real, pointer, intent(inout) :: obs_wind_ptr(:) real, pointer, intent(inout) :: obs_locations_ptr(:,:) real, pointer, intent(inout) :: obs_errors_ptr(:) real, pointer, intent(inout) :: obs_weight_ptr(:) integer, intent(out) :: win_obs integer(kind=MPI_ADDRESS_KIND) :: ssize integer :: ierr, disp_unit type(c_ptr) :: baseptr real, pointer :: shared_array(:) if (node_rank == 0) then ssize = int(nobs_actual * 6, MPI_ADDRESS_KIND) ! obs_wind + obs_locations(3) + obs_errors + obs_weight = 6*nobs_actual else ssize = 0 end if disp_unit = 4 ! sizeof(real) call MPI_Win_allocate_shared(ssize * disp_unit, disp_unit, MPI_INFO_NULL, & node_comm, baseptr, win_obs, ierr) if (node_rank /= 0) then call MPI_Win_shared_query(win_obs, 0, ssize, disp_unit, baseptr, ierr) end if ! 将C指针转换为Fortran指针 call c_f_pointer(baseptr, shared_array, [nobs_actual * 6]) ! 映射到各个数组 obs_wind_ptr => shared_array(1:nobs_actual) obs_errors_ptr => shared_array(nobs_actual+1:2*nobs_actual) obs_weight_ptr => shared_array(2*nobs_actual+1:3*nobs_actual) ! 正确映射二维数组obs_locations_ptr(nobs_actual, 3) ! 使用c_loc获取起始位置,然后映射为二维数组 ! obs_locations占用shared_array(3*nobs_actual+1 : 6*nobs_actual)的位置 call c_f_pointer(c_loc(shared_array(3*nobs_actual+1)), obs_locations_ptr, [nobs_actual, 3]) end subroutine setup_shared_memory_obs ! 设置大数组共享内存 - 分为5个变量 subroutine setup_shared_memory_data(node_comm, node_rank, ensemble_pi_ptr, & ensemble_u_ptr, ensemble_v_ptr, ensemble_th_ptr, ensemble_q_ptr, & back_data_ptr, true_data_ptr, & win_ensemble_pi, win_ensemble_u, win_ensemble_v, & win_ensemble_th, win_ensemble_q, win_back, win_true) integer, intent(in) :: node_comm, node_rank real, pointer, intent(out) :: ensemble_pi_ptr(:,:,:,:), ensemble_u_ptr(:,:,:,:), ensemble_v_ptr(:,:,:,:), & ensemble_th_ptr(:,:,:,:), ensemble_q_ptr(:,:,:,:) type(model_data), pointer, intent(out) :: back_data_ptr, true_data_ptr integer, intent(out) :: win_ensemble_pi, win_ensemble_u, win_ensemble_v, win_ensemble_th, win_ensemble_q, win_back, win_true integer(kind=MPI_ADDRESS_KIND) :: ssize_ensemble, ssize_data integer :: ierr, disp_unit type(c_ptr) :: baseptr_pi, baseptr_u, baseptr_v, baseptr_th, baseptr_q, baseptr_back, baseptr_true disp_unit = 4 ! sizeof(real) ! 集合数据大小 - 每个变量单独分配 if (node_rank == 0) then ssize_ensemble = int(nx * nz * ny * nsample, MPI_ADDRESS_KIND) ! 单个变量 ssize_data = int(nx * nz * ny * 5, MPI_ADDRESS_KIND) ! 5个变量 else ssize_ensemble = 0 ssize_data = 0 end if ! 分配5个集合变量的共享内存 call MPI_Win_allocate_shared(ssize_ensemble * disp_unit, disp_unit, MPI_INFO_NULL, & node_comm, baseptr_pi, win_ensemble_pi, ierr) if (node_rank /= 0) then call MPI_Win_shared_query(win_ensemble_pi, 0, ssize_ensemble, disp_unit, baseptr_pi, ierr) end if call c_f_pointer(baseptr_pi, ensemble_pi_ptr, [nx, nz, ny, nsample]) call MPI_Win_allocate_shared(ssize_ensemble * disp_unit, disp_unit, MPI_INFO_NULL, & node_comm, baseptr_u, win_ensemble_u, ierr) if (node_rank /= 0) then call MPI_Win_shared_query(win_ensemble_u, 0, ssize_ensemble, disp_unit, baseptr_u, ierr) end if call c_f_pointer(baseptr_u, ensemble_u_ptr, [nx, nz, ny, nsample]) call MPI_Win_allocate_shared(ssize_ensemble * disp_unit, disp_unit, MPI_INFO_NULL, & node_comm, baseptr_v, win_ensemble_v, ierr) if (node_rank /= 0) then call MPI_Win_shared_query(win_ensemble_v, 0, ssize_ensemble, disp_unit, baseptr_v, ierr) end if call c_f_pointer(baseptr_v, ensemble_v_ptr, [nx, nz, ny, nsample]) call MPI_Win_allocate_shared(ssize_ensemble * disp_unit, disp_unit, MPI_INFO_NULL, & node_comm, baseptr_th, win_ensemble_th, ierr) if (node_rank /= 0) then call MPI_Win_shared_query(win_ensemble_th, 0, ssize_ensemble, disp_unit, baseptr_th, ierr) end if call c_f_pointer(baseptr_th, ensemble_th_ptr, [nx, nz, ny, nsample]) call MPI_Win_allocate_shared(ssize_ensemble * disp_unit, disp_unit, MPI_INFO_NULL, & node_comm, baseptr_q, win_ensemble_q, ierr) if (node_rank /= 0) then call MPI_Win_shared_query(win_ensemble_q, 0, ssize_ensemble, disp_unit, baseptr_q, ierr) end if call c_f_pointer(baseptr_q, ensemble_q_ptr, [nx, nz, ny, nsample]) ! 分配背景场共享内存 call MPI_Win_allocate_shared(ssize_data * disp_unit, disp_unit, MPI_INFO_NULL, & node_comm, baseptr_back, win_back, ierr) if (node_rank /= 0) then call MPI_Win_shared_query(win_back, 0, ssize_data, disp_unit, baseptr_back, ierr) end if call c_f_pointer(baseptr_back, back_data_ptr) ! 分配真值场共享内存 call MPI_Win_allocate_shared(ssize_data * disp_unit, disp_unit, MPI_INFO_NULL, & node_comm, baseptr_true, win_true, ierr) if (node_rank /= 0) then call MPI_Win_shared_query(win_true, 0, ssize_data, disp_unit, baseptr_true, ierr) end if call c_f_pointer(baseptr_true, true_data_ptr) end subroutine setup_shared_memory_data ! 主进程读取数据并存储到共享内存 subroutine read_and_store_shared_data(ensemble_pi_ptr, ensemble_u_ptr, & ensemble_v_ptr, ensemble_th_ptr, ensemble_q_ptr, & back_data_ptr, true_data_ptr, obs_wind_ptr, & obs_locations_ptr, obs_errors_ptr, & obs_weight_ptr, nobs_actual, & height_full, height_half, gen_obs_loc) use module_svd use module_io use module_grid real, pointer, intent(inout) :: ensemble_pi_ptr(:,:,:,:), ensemble_u_ptr(:,:,:,:), & ensemble_v_ptr(:,:,:,:), ensemble_th_ptr(:,:,:,:), ensemble_q_ptr(:,:,:,:) type(model_data), pointer, intent(inout) :: back_data_ptr, true_data_ptr real, pointer, intent(inout) :: obs_wind_ptr(:), obs_errors_ptr(:), & obs_locations_ptr(:,:), obs_weight_ptr(:) integer, intent(in) :: nobs_actual real, dimension(ids:ide, kms:kme, jds:jde), intent(in) :: height_full, height_half logical, intent(in), optional :: gen_obs_loc ! 是否生成观测位置 integer :: k, ierr type(model_ensemble_data) :: temp_ensemble_data ! 临时集合数据结构 logical :: do_gen_obs_loc if (present(gen_obs_loc)) then do_gen_obs_loc = gen_obs_loc else do_gen_obs_loc = .true. end if if (myid_monitor == 0) write(*,'(A)') 'Reading ensemble forecasts ...' ! 先分配临时结构体 allocate(temp_ensemble_data%pi(ids:ide, kms:kme, jds:jde, nsample)) allocate(temp_ensemble_data%u(ids:ide, kms:kme, jds:jde, nsample)) allocate(temp_ensemble_data%v(ids:ide, kms:kme, jds:jde, nsample)) allocate(temp_ensemble_data%th(ids:ide, kms:kme, jds:jde, nsample)) allocate(temp_ensemble_data%q(ids:ide, kms:kme, jds:jde, nsample)) call read_ensemble_forecasts(temp_ensemble_data, height_full, height_half) ! 复制数据到共享内存指针 ensemble_pi_ptr = temp_ensemble_data%pi ensemble_u_ptr = temp_ensemble_data%u ensemble_v_ptr = temp_ensemble_data%v ensemble_th_ptr = temp_ensemble_data%th ensemble_q_ptr = temp_ensemble_data%q ! 释放临时结构体 deallocate(temp_ensemble_data%pi, temp_ensemble_data%u, temp_ensemble_data%v, & temp_ensemble_data%th, temp_ensemble_data%q) if (myid_monitor == 0) write(*,'(A)') 'Ensemble data: OK' if (myid_monitor == 0) write(*,'(A)') 'Reading background field ...' call read_background_field(back_data_ptr, height_full, height_half) if (myid_monitor == 0) write(*,'(A)') 'Background field: OK' if (myid_monitor == 0) write(*,'(A)') 'Reading true field ...' call read_truth_field(true_data_ptr, height_full, height_half) if (myid_monitor == 0) write(*,'(A)') 'True field: OK' ! 观测数据处理 if (myid_monitor == 0) write(*,'(A,I0,A)') 'Processing ', nobs_actual, ' observations ...' ! 观测位置已由全局进程生成并广播,这里不再生成 if (myid_monitor == 0) write(*,'(A)') 'Step 1: Observation locations already generated by global master.' if (myid_monitor == 0) write(*,'(A)') 'Step 2: Initializing observation arrays ...' obs_wind_ptr(:) = 0.0 ! 初始化 obs_errors_ptr(:) = 1.005 ! 观测误差 if (myid_monitor == 0) write(*,'(A)') 'Step 2: Observation arrays initialized.' if (myid_monitor == 0) write(*,'(A)') 'Step 3: Calling get_obs_data ...' call get_obs_data(nobs_actual, obs_wind_ptr, obs_locations_ptr, obs_errors_ptr, true_data_ptr%u, obs_weight_ptr) if (myid_monitor == 0) write(*,'(A)') 'Step 3: get_obs_data completed.' end subroutine read_and_store_shared_data ! 全局归约分析增量 - 仅归约到主进程 subroutine global_reduce_analysis_increment(analysis_increment, fillvalue) use mpi type(model_data), intent(inout) :: analysis_increment integer :: ierr integer :: array_size type(model_data) :: global_result ! 全局归约结果 real :: fillvalue ! 必须初始化 global_result 为 fillvalue global_result%pi = fillvalue global_result%u = fillvalue global_result%v = fillvalue global_result%th = fillvalue global_result%q = fillvalue ! 计算数组大小:(ide-ids+1)*(kme-kms+1)*(jde-jds+1) array_size = (ide-ids+1)*(kme-kms+1)*(jde-jds+1) ! 调试:输出归属前的统计信息 ! if (myid == 0) then ! write(*,'(A,2ES12.5)') 'Process 0 before MPI_Reduce π range: ', & ! minval(analysis_increment%pi), maxval(analysis_increment%pi) ! end if ! 仅归约到主进程(myid==0),不进行广播 ! 注意:这里使用MPI_MIN来选择有效的分析增量(避免累加) ! 策略:只有处理过的网格点才有真实值,其他保持fillvalue=999999 call MPI_Reduce(analysis_increment%pi, global_result%pi, & array_size, MPI_REAL, MPI_MIN, 0, MPI_COMM_WORLD, ierr) call MPI_Reduce(analysis_increment%u, global_result%u, & array_size, MPI_REAL, MPI_MIN, 0, MPI_COMM_WORLD, ierr) call MPI_Reduce(analysis_increment%v, global_result%v, & array_size, MPI_REAL, MPI_MIN, 0, MPI_COMM_WORLD, ierr) call MPI_Reduce(analysis_increment%th, global_result%th, & array_size, MPI_REAL, MPI_MIN, 0, MPI_COMM_WORLD, ierr) call MPI_Reduce(analysis_increment%q, global_result%q, & array_size, MPI_REAL, MPI_MIN, 0, MPI_COMM_WORLD, ierr) ! 只有主进程需要更新结果 if (myid == 0) then where (global_result%pi == fillvalue) global_result%pi = 0.0 end where where (global_result%u == fillvalue) global_result%u = 0.0 end where where (global_result%v == fillvalue) global_result%v = 0.0 end where where (global_result%th == fillvalue) global_result%th = 0.0 end where where (global_result%q == fillvalue) global_result%q = 0.0 end where analysis_increment%pi = global_result%pi analysis_increment%u = global_result%u analysis_increment%v = global_result%v analysis_increment%th = global_result%th analysis_increment%q = global_result%q end if end subroutine global_reduce_analysis_increment end program da_system ( 194, 292) 7.578E-03 3.629E+01 1.146E+01 4.143E+01 6.526E-04 307 19 -1.212E-02 -1.150E+01 -1.813E+01 -3.372E+01 -1.355E-03 ( 350, 280) 1.147E-02 3.519E+01 1.470E+01 1.051E+01 3.516E-03 614 19 -1.377E-01 -1.603E+01 -1.698E+01 -1.568E+01 -5.877E-04 ( 195, 292) 8.463E-03 3.842E+01 8.463E+00 1.535E+01 1.253E-04 ===================================================================== All stages completed successfully! ===================================================================== Master: Waiting for all processes to reach final barrier... Master: All processes reached final barrier successfully =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 127799 RUNNING AT e2111r4n1 = EXIT CODE: 9 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 55572 RUNNING AT e2203r3n4 = EXIT CODE: 9 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== Intel(R) MPI Library troubleshooting guide: https://software.intel.com/node/561764 =================================================================================== [hejx@login03 work]$ Master: terminating worker 945, active workers: 1 Master: terminating worker 946, active workers: 0 Master: All tasks completed. Sent: 7854, Done: 7854 Worker 2 received termination signal Worker 2 completed all assigned work Worker 26 received termination signal Worker 26 completed all assigned work srun: error: slurm_msg_sendto: address:port=10.1.142.13:49310 msg_type=8001: Unexpected missing socket error srun: error: slurm_msg_sendto: address:port=10.1.147.12:49946 msg_type=8001: Unexpected missing socket error srun: error: slurm_msg_sendto: address:port=10.1.147.6:47950 msg_type=8001: Unexpected missing socket error srun: error: slurm_msg_sendto: address:port=10.1.142.12:48310 msg_type=8001: Unexpected missing socket error srun: error: slurm_msg_sendto: address:port=10.1.142.21:52088 msg_type=8001: Unexpected missing socket error srun: error: Task 4 reported exit for a second time. srun: error: Task 20 reported exit for a second time. srun: error: Task 14 reported exit for a second time. srun: error: Task 3 reported exit for a second time. srun: error: Task 12 reported exit for a second time. [hejx@login03 work]$
最新发布
07-06
你提供的 Fortran 程序是一个典型的 **MPI 并行 + 共享内存混合编程模型** 的资料同化系统。从你的描述来看,程序在运行结束前出现了异常终止(`BAD TERMINATION`),错误码为 `9`,这通常意味着: --- ## ✅ 异常终止原因分析(Exit Code 9) ### ❗1. **进程被外部强制杀死(Killed by system)** - 错误码 `9` 对应 Linux 中的信号 `SIGKILL`。 - 表示某个进程被操作系统或资源管理器(如 Slurm)强行终止。 - 常见原因包括: - 内存溢出(OOM Killer) - 运行时间超限 - 节点崩溃或网络中断 - MPI 通信死锁导致超时 --- ## ✅ 日志中关键线索分析 ### 📌 关键日志片段: ```text Master: terminating worker 945, active workers: 1 Master: terminating worker 946, active workers: 0 Master: All tasks completed. Sent: 7854, Done: 7854 Worker 2 received termination signal Worker 2 completed all assigned work ... = PID 127799 RUNNING AT e2111r4n1 = EXIT CODE: 9 ``` 说明: - 主进程已经完成所有任务调度; - 工作进程也完成了自己的工作; - **但某些进程仍然被外部终止了**; - 很可能是节点上的资源限制触发了强制杀进程。 --- ## ✅ 可能的根本原因及解决方案 --- ### ❗1. **内存泄漏或共享内存未正确释放** 你在程序中使用了大量共享内存窗口(`MPI_Win_allocate_shared`)和指针映射(`c_f_pointer`)。如果某个进程没有正确释放这些资源,可能导致内存泄漏。 #### ✅ 检查点: - 所有进程是否都调用了 `MPI_Win_free`? - 是否存在重复分配共享内存而未释放的情况? #### ✅ 改进建议: 确保每个 `MPI_Win_allocate_shared` 都有对应的 `MPI_Win_free`,并且所有进程都执行该操作: ```fortran call MPI_Win_free(win_obs, ierr) call MPI_Win_free(win_ensemble_pi, ierr) call MPI_Win_free(win_ensemble_u, ierr) call MPI_Win_free(win_ensemble_v, ierr) call MPI_Win_free(win_ensemble_th, ierr) call MPI_Win_free(win_ensemble_q, ierr) call MPI_Win_free(win_back, ierr) call MPI_Win_free(win_true, ierr) ``` --- ### ❗2. **MPI 通信死锁或非阻塞通信未完成** 你在主进程中使用了复杂的非阻塞通信机制(`MPI_Isend`, `MPI_Irecv`, `MPI_Test`),如果没有妥善处理通信完成状态,可能会导致死锁或资源泄露。 #### ✅ 检查点: - 是否所有 `MPI_Isend` 都被成功完成? - 是否有未匹配的 `MPI_Recv`? - 是否在退出前等待所有通信完成? #### ✅ 改进建议: 在程序结尾添加: ```fortran call MPI_Barrier(MPI_COMM_WORLD, ierr) ``` 并确保所有非阻塞通信都通过 `MPI_Wait` 或 `MPI_Test` 完成。 --- ### ❗3. **Slurm 资源限制导致 OOM Killer 触发** 由于你启用了大量进程(超过 900 个?),很可能部分节点内存不足,触发了内核的 OOM Killer。 #### ✅ 检查点: - 每个进程使用的内存是否合理? - 是否存在大数组拷贝、冗余数据加载? - 是否有节点上同时运行多个任务导致内存过载? #### ✅ 解决方案: - 减少并发进程数; - 优化数据结构减少内存占用; - 在 Slurm 提交脚本中设置合理的内存限制,例如: ```bash #SBATCH --mem=64G #SBATCH --mem-per-cpu=2G ``` --- ### ❗4. **MPI_Finalize 前未同步导致异常退出** 虽然你调用了 `MPI_Barrier(MPI_COMM_WORLD)` 和 `MPI_Finalize()`,但如果某些进程提前退出或出现错误,也可能导致其他进程无法正常结束。 #### ✅ 检查点: - 是否所有进程都到达了 `MPI_Finalize`? - 是否在 `MPI_Finalize` 前再次检查所有通信完成? #### ✅ 改进建议: 在 `MPI_Finalize` 前添加: ```fortran if (myid == 0) write(*,*) "Before MPI_Finalize" call MPI_Barrier(MPI_COMM_WORLD, ierr) call MPI_Finalize(ierr) if (myid == 0) write(*,*) "After MPI_Finalize" ``` --- ## ✅ 总结:推荐排查顺序 | 步骤 | 操作 | 目的 | |------|------|------| | 1 | 检查 Slurm 提交脚本中的内存限制 | 防止 OOM Killer 杀进程 | | 2 | 检查所有 `MPI_Win_free` 是否都被调用 | 避免共享内存泄漏 | | 3 | 检查所有非阻塞通信是否完成 | 避免死锁 | | 4 | 添加调试输出确认所有进程进入 finalize | 确认进程未提前退出 | | 5 | 尝试减少并行度运行测试 | 缩小问题范围 | --- ## ✅ 示例改进代码(增加 Barrier + Finalize 安全检查) ```fortran ! ... 原始代码最后部分 ... call MPI_Barrier(MPI_COMM_WORLD, ierr) ! 释放共享内存窗口 - 所有进程都必须调用 call MPI_Win_free(win_obs, ierr) call MPI_Win_free(win_ensemble_pi, ierr) call MPI_Win_free(win_ensemble_u, ierr) call MPI_Win_free(win_ensemble_v, ierr) call MPI_Win_free(win_ensemble_th, ierr) call MPI_Win_free(win_ensemble_q, ierr) call MPI_Win_free(win_back, ierr) call MPI_Win_free(win_true, ierr) ! 释放通信器 if (inter_comm /= MPI_COMM_NULL) then call MPI_Comm_free(inter_comm, ierr) end if call MPI_Comm_free(node_comm, ierr) call MPI_Barrier(MPI_COMM_WORLD, ierr) if (myid == 0) then write(0,'(A)') 'Master: MPI resources cleaned up successfully' write(0,'(A)') 'Master: Program completed successfully' end if call MPI_Finalize(ierr) if (myid == 0) then write(0,'(A)') 'Master: MPI finalized successfully' end if ``` --- ##
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值