Analysis of 【Dropout】

本文深入剖析了Dropout这一防止过拟合的技术,介绍了其工作原理、实施方式及其与其它正则化技术的结合使用。


原文:https://pgaleone.eu/deep-learning/regularization/2017/01/10/anaysis-of-dropout/

这篇分析dropout的比较好,记录一下。译文在http://www.wtoutiao.com/p/649MGEJ.html


Overfitting is a problem in Deep Neural Networks (DNN): the model learns to classify only the training set, adapting itself to the training examples instead of learning decision boundaries capable of classifying generic instances. Many solutions to the overfitting problem have been presented during these years; one of them have overwhelmed the others due to its simplicity and its empirical good results: Dropout.

Dropout

Dropout image

Visual representation of Dropout, right from the paper. On the left there's the network before applying Dropout, on the right the same network with Dropout applied.
The network on the left it's the same network used at test time, once the parameters have been learned.

The idea behind Dropout is to train an ensemble of DNNs and average the results of the whole ensemble instead of train a single DNN.

The DNNs are built dropping out neurons with  p p probability, therefore keeping the others on with probability  q=1p q=1−p. When a neuron is dropped out, its output is set to zero, no matter what the input or the associated learned parameter is.

The dropped neurons do not contribute to the training phase in both the forward and backward phases of back-propagation: for this reason every time a single neuron is dropped out it’s like the training phase is done on a new network.

Quoting the authors:

In a standard neural network, the derivative received by each parameter tells it how it should change so the final loss function is reduced, given what all other units are doing. Therefore, units may change in a way that they fix up the mistakes of the other units. This may lead to complex co-adaptations. This in turn leads to overfitting because these co-adaptations do not generalize to unseen data. We hypothesize that for each hidden unit, Dropout prevents co-adaptation by making the presence of other hidden units unreliable. Therefore, a hidden unit cannot rely on other specific units to correct its mistakes.

In short: Dropout works well in practice because it prevents the co-adaption of neurons during the training phase.

Now that we got an intuitive idea behind Dropout, let’s analyze it in depth.

How Dropout works

As said before, Dropout turns off neurons with probability  p p and therefore let the others turned on with probability  q=1p q=1−p.

Every single neuron has the same probability of being turned off. This means that:

Given

  • h(x)=xW+b h(x)=xW+b a linear projection of a  di di-dimensional input  x x in a  dh dh-dimensional output space.
  • a(h) a(h) an activation function

it’s possible to model the application of Dropout, in the training phase only, to the given projection as a modified activation function:

f(h)=Da(h) f(h)=D⊙a(h)

Where  D=(X1,,Xdh) D=(X1,⋯,Xdh) is a  dh dh-dimensional vector of Bernoulli variables  Xi Xi.

A Bernoulli random variable has the following probability mass distribution:

f(k;p)={p1pifk=1ifk=0 f(k;p)={pifk=11−pifk=0

Where  k k are the possible outcomes.

It’s evident that this random variable perfectly models the Dropout application on a single neuron. In fact, the neuron is turned off with probability  p=P(k=1) p=P(k=1) and kept on otherwise.

It can be useful to see the application of Dropout on a generic i-th neuron:

Oi=Xia(k=1diwkxk+b)={a(dik=1wkxk+b)0ifXi=1ifXi=0 Oi=Xia(∑k=1diwkxk+b)={a(∑k=1diwkxk+b)ifXi=10ifXi=0

where  P(Xi=0)=p P(Xi=0)=p.

Since during train phase a neuron is kept on with probability  q q, during the testing phase we have to emulate the behavior of the ensemble of networks used in the training phase.

To do this, the authors suggest scaling the activation function by a factor of  q q during the test phase in order to use the expected output produced in the training phase as the single output required in the test phase. Thus:

Train phase Oi=Xia(dik=1wkxk+b) Oi=Xia(∑k=1diwkxk+b)

Test phase Oi=qa(dik=1wkxk+b) Oi=qa(∑k=1diwkxk+b)

Inverted Dropout

A slightly different approach is to use Inverted Dropout. This approach consists in the scaling of the activations during the training phase, leaving the test phase untouched.

The scale factor is the inverse of the keep probability:  11p=1q 11−p=1q, thus:

Train phase Oi=1qXia(dik=1wkxk+b) Oi=1qXia(∑k=1diwkxk+b)

Test phase Oi=a(dik=1wkxk+b) Oi=a(∑k=1diwkxk+b)

Inverted Dropout is how Dropout is implemented in practice in the various deep learning frameworks because it helps to define the model once and just change a parameter (the keep/drop probability) to run train and test on the same model.

Direct Dropout, instead, force you to modify the network during the test phase because if you don’t multiply by  q q the output the neuron will produce values that are higher respect to the one expected by the successive neurons (thus the following neurons can saturate or explode): that’s why Inverted Dropout is the more common implementation.

Dropout of a set of neurons

It can be easily noticed that a layer  h h with  n n neurons, in a single train step, can be seen as an ensemble of  n n Bernoulli experiments, each one with a probability of success equals to  p p.

Thus, the output of the layer  h h have a number of dropped neurons equals to:

Y=i=1dhXi Y=∑i=1dhXi

Since every neuron is now modeled as a Bernoulli random variable and all these random variables are independent and identically distributed, the total number of dropped neuron is a random variable too, called Binomial:

YBi(dh,p) Y∼Bi(dh,p)

Where the probability of getting exactly  k k success in  n n trials is given by the probability mass distribution:

f(k;n,p)=(nk)pk(1p)nk f(k;n,p)=(nk)pk(1−p)n−k

This formula can be easily explained:

  • pk(1p)nk pk(1−p)n−k is the probability of getting a single sequence of  k k successes on  n ntrials and therefore  nk n−k failures.
  • (nk) (nk) is the binomial coefficient used to calculate the number of possible sequence of success.

We can now use this distribution to analyze the probability of dropping a specified number of neurons.

When using Dropout, we define a fixed Dropout probability  p p for a chosen layer and we expect that a proportional number of neurons are dropped from it.

For example, if the layer we apply Dropout to has  n=1024 n=1024 neurons and  p=0.5 p=0.5, we expect that 512 get dropped. Let’s verify this statement:

YP(Y=512)=i=11024XiBi(1024,0.5)=(1024512)0.5512(10.5)10245120.025 Y=∑i=11024Xi∼Bi(1024,0.5)P(Y=512)=(1024512)0.5512(1−0.5)1024−512≈0.025

Thus, the probability of dropping out exactly  np=512 np=512 neurons is of only  0.025 0.025!

A python 3 script can help us to visualize how neurons are dropped for different values of  p p and a fixed value of  n n. The code is commented.

Binomial distribution

The binomial distribution is very peaked around
np np

As we can see from the image above no matter what the  p p value is, the number of neurons dropped on average is proportional to  np np, in fact:

E[Bi(n,p)]=np E[Bi(n,p)]=np

Moreover, we can notice that the distribution of values is almost symmetric around  p=0.5 p=0.5 and the probability of dropping  np np neurons increase as the distance from  p=0.5 p=0.5 increase.

The scaling factor has been added by the authors to compensate the activation values, because they expect that during the training phase only a percentage of  1p 1−pneurons have been kept. During the testing phase, instead, the  100% 100% of neurons are kept on, thus the value should be scaled down accordingly.

Dropout & other regularizers

Dropout is often used with L2 normalization and other parameter constraint techniques (such as Max Norm 1), this is not a case. Normalizations help to keep model parameters value low, in this way a parameter can’t grow too much.

In brief, the L2 normalization (for example) is an additional term to the loss, where  λ[0,1] λ∈[0,1] is an hyper-parameter called regularization strength,  F(W;x) F(W;x) is the model and  E E is the error function between the real  y y and the predicted  y^ y^ value.

L(y,y^)=E(y,F(W;x))+λ2W2 L(y,y^)=E(y,F(W;x))+λ2W2

It’s easy to understand that this additional term, when doing back-propagation via gradient descent, reduces the update amount. If  η η is the learning rate, the update amount of the parameter  wW w∈W is

wwη(F(W;x)w+λw) w←w−η(∂F(W;x)∂w+λw)

Dropout alone, instead, does not have any way to prevent parameter values from becoming too large during this update phase. Moreover, the inverted implementation leads the update steps to become bigger, as showed below.

Inverted Dropout and other regularizers

Since Dropout does not prevent parameters from growing and overwhelming each other, applying L2 regularization (or any other regularization technique that constraints the parameter values) can help.

Making explicit the scaling factor, the previous equation becomes:

wwη(1qF(W;x)w+λw) w←w−η(1q∂F(W;x)∂w+λw)

It can be easily seen that when using Inverted Dropout, the learning rate is scaled by a factor of  q q. Since  q q has values in  ]0,1] ]0,1] the ratio between  η η and  q q can vary between:

r(q)=ηq[η=limq1r(q),+=limq0r(q)] r(q)=ηq∈[η=limq→1r(q),+∞=limq→0r(q)]

For this reason, from now on we’ll call  q q boosting factor because it boosts the learning rate. Moreover, we’ll call  r(q) r(q) the effective learning rate.

The effective learning rate, thus, is higher respect to the learning rate chosen: for this reason normalizations that constrain the parameter values can help to simplify the learning rate selection process.

Summary

  1. Dropout exists in two versions: direct (not commonly implemented) and inverted
  2. Dropout on a single neuron can be modeled using a Bernoulli random variable
  3. Dropout on a set of neurons can be modeled using a Binomial random variable
  4. Even if the probability of dropping exactly  np np neurons is low,  np np neurons are dropped on average on a layer of  n n neurons.
  5. Inverted Dropout boost the learning rate
  6. Inverted Dropout should be using together with other normalization techniques that constrain the parameter values in order to simplify the learning rate selection procedure
  7. Dropout helps to prevent overfitting in deep neural networks.
  1. Max Norm impose a constraint to the parameters size. Chosen a value for the hyper-parameter  c c it impose the constraint  |w|c |w|≤c


这是一篇关于人工智能方向的论文初稿,请帮我完善其中的各个部分。 标题:A Physics-Informed Multi-Modal Fusion Approach for Intelligent Assessment and Life Prediction of Geomembrane Welds in High-Altitude Environments 摘要: The weld seam is the most critical yet vulnerable part of a geomembrane anti-seepage system in high-altitude environments. Traditional assessment methods struggle with inefficiency and an inability to characterize internal defects, while existing prediction models fail to capture the complex degradation mechanisms under multi-field coupling conditions. This study proposes a novel physics-informed deep learning framework for the intelligent assessment and life prediction of geomembrane welds. First, a multi-modal sensing system integrating vision, thermal, and ultrasound is developed to construct a comprehensive weld defect database. Subsequently, a Physics-Informed Attention Fusion Network (PIAF-Net) is proposed, which embeds physical priors (e.g., the oxidation sensitivity of the Heat-Affected Zone) into the attention mechanism to guide the fusion of heterogeneous information, achieving an accuracy of 94.7% in defect identification with limited samples. Furthermore, a Physics-Informed Neural Network with Uncertainty Quantification (PINN-UQ) is established for long-term performance prediction. By hard-constraining the network output with oxidation kinetics and damage evolution equations, and incorporating a Bayesian uncertainty quantification framework, the model provides probabilistic predictions of the remaining service life. Validation results from both laboratory and a case study at the Golmud South Mountain Pumped Storage Power Station (over 3500m altitude) demonstrate the high accuracy (R² > 0.96), robustness, and physical consistency of the proposed framework, offering a groundbreaking tool for the predictive maintenance of critical infrastructure in extreme environments. 关键词: Geomembrane Weld; Multi-Modal Fusion; Physics-Informed Neural Network; Defect Assessment; Life Prediction; High-Altitude Environment 1. Introduction High-density polyethylene (HDPE) geomembranes are pivotal as impermeable liners in major water conservancy projects, such as pumped storage power stations in high-altitude regions of western China [1, 2]. However, the long-term performance and sealing reliability of the entire system are predominantly determined by the quality of the field welds, which are subjected to extreme environmental stresses including low temperature, intense ultraviolet (UV) radiation, significant diurnal temperature cycles, and strong windblown sand [3, 4]. Statistics indicate that over 80% of geomembrane system failures originate from weld seams [5], highlighting them as the primary薄弱环节 (weak link). Current non-destructive evaluation (NDE) methods, such as air pressure testing and spark testing, are largely qualitative, inefficient, and incapable of identifying internal flaws like incomplete fusion [6, 7]. While some researchers have begun exploring machine learning and deep learning for automated defect recognition [8, 9], these data-driven approaches often suffer from two fundamental limitations: (1) a lack of physical interpretability, making their predictions untrustworthy for high-stakes engineering decisions, and (2) poor generalization performance under "small-sample" conditions typical of specialized weld defects [10]. For long-term performance prediction, the classical Arrhenius model remains the most common tool but is primarily suited for homogeneous materials under constant, single-factor thermal aging [11, 12]. It fails to account for the significant microstructural heterogeneity, residual stresses, and the synergistic effects of multi-field coupling inherent in weld seams under real-world high-altitude service conditions [13, 14]. Pure data-driven models like Gaussian Process Regression (GPR) or standard Neural Networks (NNs), while flexible, often exhibit high extrapolation risks and lack physical consistency [15]. To bridge these gaps, this study introduces a physics-informed deep learning framework that seamlessly integrates physical knowledge with data-driven models. The main contributions are threefold: We propose a Physics-Informed Attention Fusion Network (PIAF-Net) that leverages physical priors derived from material aging mechanisms to guide the fusion of multi-modal NDE data, significantly enhancing defect identification accuracy and interpretability under small-sample constraints. We develop a Physics-Informed Neural Network with Uncertainty Quantification (PINN-UQ) for life prediction, which embds oxidation kinetics and damage mechanics laws directly into the loss function, ensuring physical plausibility while providing probabilistic life predictions through a Bayesian framework. We validate the proposed framework rigorously through independent laboratory tests and a real-world engineering case study at a high-altitude pumped storage power station, demonstrating its superior performance, robustness, and practical engineering value. 2. Methodology The overall framework of the proposed methodology is illustrated in Fig. 1, comprising three main stages: multi-modal data acquisition, intelligent defect assessment, and physics-informed life prediction. 2.1 Multi-Modal Data Acquisition and Database Construction A synchronized multi-sensor data acquisition system was developed, comprising: Vision Module: A 5-megapixel CCD camera with uniform LED lighting to capture high-resolution surface images. Features like Local Binary Patterns (LBP), Histogram of Oriented Gradients (HOG), and morphological parameters (weld width uniformity, edge straightness) were extracted. Thermal Module: A mid-wave infrared thermal camera (100 Hz) recorded the dynamic temperature field during the natural cooling of the weld. Key features included cooling rate and temperature distribution uniformity. Ultrasound Module: A high-frequency ultrasonic probe using pulse-echo mode acquired A-scan signals. Features such as sound velocity, attenuation coefficient, and spectral centroid were derived to characterize internal fusion status. A comprehensive weld defect database was constructed, containing 600 samples covering various process defects (virtual weld, over-weld, weak weld, contamination) and aging states (0h, 500h, 1500h of accelerated multi-field coupling aging). 2.2 Physics-Informed Attention Fusion Network (PIAF-Net) for Defect Assessment The architecture of PIAF-Net is shown in Fig. 2. It consists of a dual-stream feature extraction module and a novel physics-informed attention fusion module. *2.2.1 Dual-Stream Feature Extraction* One stream processes appearance information (visual + thermal features) using a pre-trained CNN (e.g., VGG16) and a custom 3D CNN, respectively. The other stream processes internal information (ultrasonic features) using a 1D CNN. This separation allows for dedicated feature abstraction from different physical domains. *2.2.2 Physics-Informed Attention Fusion Module* Instead of learning attention weights purely from data, this module incorporates physical priors p p (e.g., known correlations between ultrasonic signal attenuation and internal lack of fusion, or between abnormal cooling rates and over-weld-induced grain coarsening). The attention weight a i a i ​ for the i i-th modality is computed as: a i = softmax ( ( W p ⋅ p ) ⊙ ( W f ⋅ f i ) ) a i ​ =softmax((W p ​ ⋅p)⊙(W f ​ ⋅f i ​ )) where f i f i ​ is the feature vector, W p W p ​ and W f W f ​ are learnable projection matrices, and ⊙ ⊙ denotes element-wise multiplication. This design forces the model to focus on feature combinations that are physically meaningful. *2.2.3 Meta-Learning for Small-Sample Training* To address the limited defect samples, a Model-Agnostic Meta-Learning (MAML) paradigm was adopted. The model is trained on a multitude of N-way K-shot tasks, enabling it to rapidly adapt to new, unseen defect types with very few examples. 2.3 Physics-Informed Neural Network with Uncertainty Quantification (PINN-UQ) for Life Prediction The PINN-UQ model integrates physical laws governing weld degradation, as summarized from accelerated aging tests (see Fig. 3 for the conceptual physical model). 2.3.1 Physical Mechanism Module The degradation is modeled through a coupled chemical and mechanical process: Non-Homogeneous Oxidation Kinetics: d α d t = A ⋅ f ( C I 0 , T weld ) ⋅ exp ⁡ ( − E a R T ) ⋅ ( 1 − α ) n ⋅ g ( I U V ) dt dα ​ =A⋅f(CI 0 ​ ,T weld ​ )⋅exp(− RT E a ​ ​ )⋅(1−α) n ⋅g(I UV ​ ) where α α is the aging degree, f ( C I 0 , T weld ) f(CI 0 ​ ,T weld ​ ) is a spatial function accounting for initial antioxidant depletion in the Heat-Affected Zone (HAZ), and g ( I U V ) g(I UV ​ ) is the UV intensity function. Damage Evolution Model: d D d t = C 1 ⋅ ( σ eff σ 0 ) m ⋅ N f + C 2 ⋅ ( Abrasion ) dt dD ​ =C 1 ​ ⋅( σ 0 ​ σ eff ​ ​ ) m ⋅N f ​ +C 2 ​ ⋅(Abrasion) where D D is the damage variable, σ eff σ eff ​ is the equivalent thermal stress from temperature cycles, and N f N f ​ is the cycle count. Macroscopic Performance Coupling: P = P 0 ⋅ ( 1 − α ) β ⋅ ( 1 − D ) γ P=P 0 ​ ⋅(1−α) β ⋅(1−D) γ where P P is a macroscopic property (e.g., tensile strength), and β , γ β,γ are coupling coefficients. *2.3.2 PINN-UQ Architecture and Hybrid Loss Function* The network input is the multi-modal feature sequence X fusion ( t ) X fusion ​ (t) and environmental stress data. Crucially, the network's final layer outputs the physical state variables α α and D D, not the performance P P directly. The predicted performance P pred P pred ​ is then calculated using the physical equation above, enforcing physical consistency. The hybrid loss function is defined as: L total = L data + λ ⋅ L physics L total ​ =L data ​ +λ⋅L physics ​ L data = 1 N ∑ i = 1 N ( P pred , i − P meas , i ) 2 L data ​ = N 1 ​ i=1 ∑ N ​ (P pred,i ​ −P meas,i ​ ) 2 L physics = 1 N ∑ i = 1 N [ ( d α d t − R α ) 2 + ( d D d t − R D ) 2 ] L physics ​ = N 1 ​ i=1 ∑ N ​ [( dt dα ​ −R α ​ ) 2 +( dt dD ​ −R D ​ ) 2 ] where R α R α ​ and R D R D ​ are the right-hand sides of the oxidation and damage evolution equations, computed via automatic differentiation. 2.3.3 Uncertainty Quantification Framework A Bayesian Neural Network (BNN) with Monte Carlo (MC) Dropout is employed to quantify both epistemic (model) and aleatoric (data) uncertainties. The predictive distribution is obtained by performing M M stochastic forward passes, providing the mean prediction and its confidence interval. 3. Results and Discussion 3.1 Performance of PIAF-Net for Defect Assessment The performance of PIAF-Net was evaluated using 5-fold cross-validation and compared against baseline models on the same dataset (Table 1). Table 1. Performance comparison of different models for weld defect identification (Mean ± Std). Model Accuracy (%) Precision (%) Recall (%) F1-Score Vision Only (CNN) 85.3 ± 1.5 84.1 ± 2.1 83.7 ± 1.8 0.839 Thermal Only (3D-CNN) 80.2 ± 2.1 79.5 ± 2.8 78.9 ± 2.5 0.792 Simple Feature Concatenation 90.5 ± 1.2 89.8 ± 1.5 89.4 ± 1.7 0.896 PIAF-Net (Proposed) 95.8 ± 0.8 95.2 ± 1.0 94.9 ± 1.1 0.951 PIAF-Net significantly outperformed all single-modality and simple fusion models, demonstrating the effectiveness of physics-guided attention. The t-SNE visualization (Fig. 4a) showed clear clustering of different defect types in the learned feature space, with samples of the same defect type forming continuous trajectories reflecting severity, indicating the model captured physically meaningful representations. 3.2 Performance and Analysis of PINN-UQ for Life Prediction The PINN-UQ model was trained on data from multi-field coupled aging tests and tested on an independent validation set. Fig. 4b shows the model's prediction of tensile strength degradation under full coupling conditions, alongside the 95% confidence interval. The prediction mean (red line) closely matches the experimental measurements (black dots), with a high R² value of 0.963 and a low RMSE of 1.18 MPa. The 95% confidence interval (blue shaded area) effectively encapsulates the dispersion of the experimental data, especially during the accelerated degradation phase after 1500 hours, quantitatively reflecting prediction uncertainty. Analysis of the internally predicted physical variables α α and D D revealed that the aging degree in the HAZ evolved much faster than in the parent material, aligning perfectly with micro-FTIR observations from our mechanistic studies (Chapter 2 of the thesis). This emergent behavior, enforced by the physical constraints, confirms the model's physical consistency. 3.3 Engineering Application and Validation The framework was applied to assess welds that had been in service for 3 years at the Golmud South Mountain Pumped Storage Power Station. PIAF-Net successfully identified two welds with "weak weld" characteristics from 15 in-situ inspections, which were later confirmed by destructive tests to have substandard peel strength. For life prediction, the PINN-UQ model, taking the field-derived features and local environmental spectrum as input, predicted a mean remaining service life of 42 years with a 95% confidence interval of [35, 51] years for the welds. The model also identified the HAZ as the life-limiting factor, providing critical guidance for targeted maintenance. 4. Discussion The superior performance of the proposed framework stems from its deep integration of physical knowledge. In PIAF-Net, the physical priors act as an expert guide, steering the model away from spurious correlations and towards physically plausible feature interactions, which is crucial for generalization with small samples. In PINN-UQ, the physical laws serve as a powerful regularizer, constraining the solution space to physically admissible trajectories. This not only improves extrapolation but also imbues the model with a degree of interpretability often missing in pure "black-box" models. The probabilistic output provided by the UQ framework is of paramount practical importance. It transforms a single-point life estimate into a risk-informed decision support tool, allowing engineers to plan maintenance based on conservative lower-bound estimates (e.g., 35 years) or to assess the probability of failure within a design lifetime. 5. Conclusion This study has developed and validated a novel physics-informed deep learning framework for the intelligent assessment and life prediction of geomembrane welds in high-altitude environments. The main conclusions are: The proposed PIAF-Net model, by embedding physical priors into the attention mechanism, achieves high-accuracy (95.8%), interpretable defect identification with limited labeled data, overcoming the limitations of traditional methods and pure data-driven models. The PINN-UQ model successfully integrates the physics of weld degradation into a data-driven framework, providing accurate (R² > 0.96), physically consistent, and probabilistic predictions of long-term performance and remaining service life. The successful application in a real-world high-altitude engineering case demonstrates the framework's robustness and practical value, paving the way for a paradigm shift from experience-based and reactive maintenance towards model-guided and predictive management of critical infrastructure. Acknowledgments (This section will be completed as needed) References [1] Koerner, R. M., & Koerner, G. R. (2018). Journal of Geotechnical and Geoenvironmental Engineering, 144(6), 04018029. [2] Rowe, R. K. (2020). Geotextiles and Geomembranes, 48(4), 431-446. [3] ... (Other references will be meticulously added from the thesis and relevant literature)
最新发布
11-29
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值