ORA-04031底层查询方法

本文记录了一次ORA-04031错误的排查过程,详细展示了Oracle数据库中共享池内存不足导致的问题及解决办法。通过对数据库内存组件的动态分布查询、跟踪文件分析及使用OERR工具解读错误代码,提供了调整共享池大小的具体步骤。

Errors in fileD:\APP\ADMINISTRATOR\diag\rdbms\primary\tree\trace\tree_j001_8540.trc  (incident=520695):

ORA-04031: 无法分配 4136 字节的共享内存("shared pool","unknown object","sgaheap(1,0)","modification ")

ORA-02063: 紧接着 line (起自 MINI)

库的内存管理是AMM

 

 

查看memory动态分布情况

    set head off 

    select chr(10) from dual; 

    set head on 

    prompt "内存动态组件分配情况:"

   col component for a25 

   col LAST_OPER_TYPE  for a10

   SELECT COMPONENT, 

          CURRENT_SIZE / 1024 / 1024 / 1024 AS CURRENT_GB, 

          MIN_SIZE / 1024 / 1024 / 1024 AS MINGB, 

          MAX_SIZE / 1024 / 1024 / 1024 AS MAXGB, 

          USER_SPECIFIED_SIZE / 1024 / 1024 / 1024 AS SPECIFIEDMB, 

          OPER_COUNT, 

          LAST_OPER_TYPE, 

          LAST_OPER_MODE, 

          LAST_OPER_TIME, 

          GRANULE_SIZE / 1024 / 1024 as GRANULE_MB 

     FROM V$MEMORY_DYNAMIC_COMPONENTS; 

 

COMPONENT              CURRENT_GB        MINGB     MAXGB SPECIFIEDMB OPER_COUNT LAST_OPER_ LAST_OPER_MODE             LAST_OPER_TIME     GRANULE_MB

------------------------- -------------------- ---------- ----------- ---------- ------------------------------------- ------------------ ----------

shared pool                      14.75      13     14.75      14.75               4 GROW        MANUAL                           30-JUN-16                  256

large pool                             .5      .5 .5          0              0 STATIC                                                              256

java pool                       .5      .5 .5          0              0 STATIC                                                              256

streams pool                        .5     .25 .5           0              1SHRINK     DEFERRED                         24-JUN-16                 256

SGA Target                            28           27.25 30         0            28 GROW        DEFERRED                        27-JUN-16                 256

DEFAULT buffer cache             11.25  11.25      14.75           0              8 SHRINK     MANUAL                            30-JUN-16                      256

KEEP buffer cache                0              0           0          0              0STATIC                                                              256

RECYCLE buffer cache                  0              0           0          0              0STATIC                                                              256

DEFAULT 2K buffer cache           0               0           0          0              0STATIC                                                              256

DEFAULT 4K buffer cache           0               0           0          0              0STATIC                                                              256

DEFAULT 8K buffer cache           0               0           0          0              0STATIC                                                              256

DEFAULT 16K buffer cache         0               0           0          0              0STATIC                                                              256

DEFAULT 32K buffer cache         0               0           0          0              0STATIC                                                              256

Shared IO Pool                      0              0           0          0              0STATIC                                                              256

PGA Target                           50              48     50.75          30            31 SHRINK     MANUAL                            30-JUN-16                      256

ASM Buffer Cache                0              0           0          0              0STATIC                                                              256

 

shared pool      13G

PGA 50G

 

用工具oerr

[oracle@node1 ~]$ oerr ora 04031

04031, 00000, "unable to allocate %sbytes of shared memory(\"%s\",\"%s\",\"%s\",\"%s\")"

// *Cause: More shared memory is needed than was allocated in the shared

//         pool.

// *Action: If the shared pool is out ofmemory, either use the

//         DBMS_SHARED_POOL package to pin large packages,

//         reduce your use of shared memory, or increase the amount of

//         available shared memory by increasing the value of the

//         initialization parameters SHARED_POOL_RESERVED_SIZE and

//         SHARED_POOL_SIZE.

//         If the large pool is out of memory, increase the initialization

//          parameter LARGE_POOL_SIZE. 

 

可以手动设置SHARED_POOL_RESERVED_SIZE和SHARED_POOL_SIZE

 

 

通过报错观察("shared pool","unknown object","sgaheap(1,0)","modification ")通过以下sql

SYS@tree> select  KSMCHCLS,count(*) ,sum(KSMCHSIZ)/1024/1024from x$ksmsp group by    KSMCHCLS

 2  /

 

KSMCHCLS  COUNT(*) SUM(KSMCHSIZ)/1024/1024                                    

-------- ---------------------------------                                    

recr       1869925             4066.75346                                     

freeabl    2457879              6377.9045                                    

R-freea         640              4.02889252                                    

perm             35              1702.39667                                    

R-free          79              738.409416                                    

free        317037             2214.50211                                    

 

已选择6行。

 

SYS@tree> select KSMCHIDX"SubPool", 'sga heap('||KSMCHIDX||',0)'  sga_heap,

 2    decode(round(ksmchsiz/1000),0,'0-1K',1,'1-2K', 2,'2-3K',3,'3-4K',

 3   4,'4-5K',5,'5-6k',6,'6-7k',7,'7-8k',8,

 4    '8-9k', 9,'9-10k','> 10K')"size",

 5    count(*),ksmchcls Status,sum(ksmchsiz) Bytes

 6    from x$ksmsp

 7    where KSMCHCOM = 'freememory'

 8    group by ksmchidx, ksmchcls,

 9    'sgaheap('||KSMCHIDX||',0)',ksmchcom,ksmchcls,decode(round(ksmchsiz/1000),0,'0-1K',

 10  1,'1-2K', 2,'2-3K', 3,'3-4K',4,'4-5K',5,'5-6k',6,

 11  '6-7k',7,'7-8k',8,'8-9k', 9,'9-10k','> 10K')

 12  order by 1,2

 13                   /

 

  SubPool SGA_HEAP        size    COUNT(*) STATUS        BYTES                

---------- --------------- ----- ------------------ ----------                

        1 sga heap(1,0)   2-3K           1 R-free         1920                 

        1 sga heap(1,0)   3-4K           1 R-free         2560                

        1 sga heap(1,0)   4-5K           2 R-free         8320                

        1 sga heap(1,0)   8-9k           1 R-free         8320                

         1 sga heap(1,0)   > 10K          9 R-free    106727248                

        1 sga heap(1,0)   0-1K       32126 free        4339536                

        1 sga heap(1,0)   1-2K        7852 free        7086152                

        1 sga heap(1,0)   2-3K        2032 free        3979752                

        1 sga heap(1,0)   3-4K        2149 free        6440488                

        1 sga heap(1,0)   4-5K        5491 free       21125784                

        1 sga heap(1,0)   5-6k           8 free          42048                

 

  SubPool SGA_HEAP        size    COUNT(*) STATUS        BYTES                

---------- --------------- ----- ------------------ ----------                

        1 sga heap(1,0)   6-7k           4 free          25344                

        1 sga heap(1,0)   7-8k           5 free          34280                

        1 sga heap(1,0)   8-9k          13 free         107136                

        1 sga heap(1,0)   9-10k         55 free         504192                

        1 sga heap(1,0)   > 10K       1002 free      347343216                

        2 sga heap(2,0)   > 10K          7 R-free     93915472                

        2 sga heap(2,0)   0-1K       26778 free        3903320                

        2 sga heap(2,0)   1-2K        3049 free        2899776                

        2 sga heap(2,0)   2-3K        1456 free        2965168                

        2 sga heap(2,0)   3-4K       2043 free        6095696                

        2 sga heap(2,0)   4-5K        5392 free       20821520                

 

  SubPool SGA_HEAP        size    COUNT(*) STATUS        BYTES                

---------- --------------- ----- ------------------ ----------                

        2 sga heap(2,0)   5-6k         187 free         938728                

        2 sga heap(2,0)   6-7k         219 free        1315544                

        2 sga heap(2,0)   7-8k         162 free        1125608                

        2 sga heap(2,0)   8-9k         257 free        2089288                

        2 sga heap(2,0)   9-10k        131 free        1172800                

        2 sga heap(2,0)   > 10K       2007 free      249414944                 

        3 sga heap(3,0)   > 10K          9 R-free    120195904                

        3 sga heap(3,0)   0-1K       25981 free        3822576                

        3 sga heap(3,0)   1-2K        2066 free        2273720                 

        3 sga heap(3,0)   2-3K        1520 free        3085896                

        3 sga heap(3,0)   3-4K        2046 free        6133600                

 

  SubPool SGA_HEAP        size    COUNT(*) STATUS        BYTES                

---------- --------------- ----- ------------------ ----------                

        3 sga heap(3,0)   4-5K        5519 free       21246048                

        3 sga heap(3,0)   5-6k          12 free          59376                

        3 sga heap(3,0)   6-7k           3 free          18760                

        3 sga heap(3,0)   7-8k           1 free           6696                

        3 sga heap(3,0)   8-9k           1 free           8256                

        3 sga heap(3,0)   > 10K         22 free      225517176                

        4 sga heap(4,0)   > 10K         10 R-free    120529960                

        4 sga heap(4,0)   0-1K       19989 free        3509144                

        4 sga heap(4,0)   1-2K        4328 free        3722568                

        4 sga heap(4,0)   2-3K        1422 free        2859536                

        4 sga heap(4,0)   3-4K        2049 free        6125832                

 

  SubPool SGA_HEAP        size    COUNT(*) STATUS        BYTES                

---------- --------------- ----- ------------------ ----------                

        4 sga heap(4,0)   4-5K        5786 free       22357232                

        4 sga heap(4,0)   5-6k           1 free           4536                 

        4 sga heap(4,0)   6-7k           1 free           6160                

        4 sga heap(4,0)   7-8k           1 free           7168                

        4 sga heap(4,0)   8-9k          17 free         139328                 

        4 sga heap(4,0)   > 10K         16 free      191409560                

        5 sga heap(5,0)   > 10K          9 R-free    106635744                

        5 sga heap(5,0)   0-1K       27716 free        4342656                

        5 sga heap(5,0)   1-2K        3501 free        3261120                

        5 sga heap(5,0)   2-3K        1309 free        2687240                

        5 sga heap(5,0)   3-4K        2112 free        6350792                

 

  SubPool SGA_HEAP        size    COUNT(*) STATUS        BYTES                

---------- --------------- ----- ------------------ ----------                

        5 sga heap(5,0)   4-5K        5987 free       23133896                

        5 sga heap(5,0)   6-7k           1 free           6344                

        5 sga heap(5,0)   8-9k           2 free          16256                

        5 sga heap(5,0)   9-10k          1 free           8544                

        5 sga heap(5,0)   > 10K         68 free      310485960                

        6 sga heap(6,0)   0-1K           2 R-free          656                

        6 sga heap(6,0)   1-2K           1 R-free          768                

        6 sga heap(6,0)   3-4K           1 R-free         2680                

        6 sga heap(6,0)   4-5K           3 R-free        11864                

        6 sga heap(6,0)   > 10K         10 R-free    119459712                

        6 sga heap(6,0)   0-1K       36813 free        5460272                

 

  SubPool SGA_HEAP        size    COUNT(*) STATUS        BYTES                

---------- --------------- ----- ------------------ ----------                

        6 sga heap(6,0)   1-2K       13091 free       12010656                

        6 sga heap(6,0)   2-3K        3477 free        6737480                

        6 sga heap(6,0)   3-4K        2987 free        8926000                

        6 sga heap(6,0)   4-5K       11191 free       43775688                

        6 sga heap(6,0)   5-6k         586 free        2923912                

        6 sga heap(6,0)   6-7k         475 free        2852320                

        6 sga heap(6,0)   7-8k         328 free        2280096                

        6 sga heap(6,0)   8-9k        1022 free        8315944                

        6 sga heap(6,0)   9-10k        162 free        1455032                

        6 sga heap(6,0)   > 10K        813 free      340146992                 

        7 sga heap(7,0)   1-2K           2 R-free         2240                

 

  SubPool SGA_HEAP        size    COUNT(*) STATUS        BYTES                

---------- --------------- ----- ------------------ ----------                 

        7 sga heap(7,0)   2-3K           1 R-free         2368                

        7 sga heap(7,0)   5-6k           1 R-free         5168                

        7 sga heap(7,0)   > 10K          9 R-free    106687488                

        7 sga heap(7,0)   0-1K      30075 free        4337392                

        7 sga heap(7,0)   1-2K        2704 free        2660512                

        7 sga heap(7,0)   2-3K        1680 free        3379600                

        7 sga heap(7,0)   3-4K        2103 free        6300992                

        7 sga heap(7,0)   4-5K        5824 free       22423800                

        7 sga heap(7,0)   5-6k           1 free           4888                

        7 sga heap(7,0)   6-7k           7 free          44280                

        7 sga heap(7,0)   8-9k           5 free          41368                

 

  SubPool SGA_HEAP        size    COUNT(*) STATUS        BYTES                

---------- --------------- ----- ------------------ ----------                

        7 sga heap(7,0)   9-10k          7 free          64184                

        7 sga heap(7,0)   > 10K        660 free      320295616                

 

已选择90行。

查看4031的次数

 

 select indx,kghlurcr,kghlutrn,kghlufsh,kghluops,kghlunfu,kghlunfs

 2    from  sys.x$kghlu where inst_id =userenv('Instance');

 

     INDX   KGHLURCR   KGHLUTRN  KGHLUFSH   KGHLUOPS   KGHLUNFU  KGHLUNFS   

---------- ---------- ---------- -------------------- ---------- ----------   

        0      49371      49932   5522200   30304101          0          0   

        1      53367      68424   5303976   22297579          0          0   

        2      50916     57568    5509344   18520250          0          0   

        3      51572      69747   5460528   18050945          0          0   

        4      52240      70093   5491960   15437329          0          0   

        5      41644      47627   5578368   21132753          0          0   

        6      43783      52091   5537920   24707939          0          0   

 

已选择7行。

 

SYS@tree> spool off

trc文件:

 

Trace fileD:\APP\ADMINISTRATOR\diag\rdbms\primary\tree\trace\tree_j003_9780.trc

Oracle Database 11g Enterprise EditionRelease 11.2.0.2.0 - 64bit Production

With the Partitioning, OLAP, Data Miningand Real Application Testing options

Windows NT Version V6.1 Service Pack 1

CPU                 : 32 - type 8664, 16 PhysicalCores

Process Affinity    : 0x0x0000000000000000

Memory (Avail/Total): Ph:92980M/131026M,Ph+PgF:102497M/141264M

Instance name: tree

Redo thread mounted by this instance: 1

Oracle process number: 38

Windows thread id: 9780, image: ORACLE.EXE(J003)

 

 

*** 2016-06-30 16:10:03.637

*** SESSION ID:(300.24353) 2016-06-3016:10:03.637

*** CLIENT ID:() 2016-06-30 16:10:03.637

*** SERVICE NAME:(SYS$USERS) 2016-06-3016:10:03.637

*** MODULE NAME:(DBMS_SCHEDULER) 2016-06-3016:10:03.637

*** ACTION NAME:(JOB_MINI_SX_DSF) 2016-06-3016:10:03.637

 

DDE: Problem Key 'ORA 4031' was floodcontrolled (0x2) (incident: 520573)

ORA-04031: 无法分配 4136 字节的共享内存("shared pool","unknown object","sgaheap(1,0)","modification ")

ORA-02063: 紧接着 line (起自 MINI)

DDE: Problem Key 'ORA 4031' was flood controlled(0x2) (incident: 520574)

 

*** 2016-06-30 16:16:00.940

*** SESSION ID:(300.24455) 2016-06-3016:16:00.940

*** CLIENT ID:() 2016-06-30 16:16:00.940

*** SERVICE NAME:(SYS$USERS) 2016-06-3016:16:00.940

*** MODULE NAME:(DBMS_SCHEDULER) 2016-06-3016:16:00.940

*** ACTION NAME:(JOB_MINI_SX_DSF)2016-06-30 16:16:00.940

 

ORA-04031: 无法分配 4136 字节的共享内存("shared pool","unknown object","sgaheap(1,0)","modification ")

ORA-02063: 紧接着 line (起自 MINI)

 

 

如果找不到原因就dump Free Lists链啊

alter session set events 'immediate tracename heapdump level 2';

 

 SELECT d.value|| '/' ||lower(rtrim(i.instance,chr( 0 )))|| '_ora_' ||p.spid|| '.trc' trace_file_name

   from

    (select p.spid

      from v$mystat m

           ,v$session s

           , v$process p

      where m.statistic# = 1

        and s.sid = m.sid

        and p.addr = s.paddr) p,

    (select t.instance

       from v$thread t

           ,v$parameter v

      where v.name = 'thread'

        and (v.value = 0 or t.thread# = to_number(v.value))) i,

    (select value

        from v$parameter

       where name = 'user_dump_dest' ) d;

 

alter session set events 'immediate tracename heapdump off';

 

 

 



from imblearn.over_sampling import SMOTE from imblearn.pipeline import make_pipeline from sklearn.model_selection import RandomizedSearchCV, GridSearchCV from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.model_selection import StratifiedKFold import numpy as np import pandas as pd # 定义更广泛的超参数空间 param_grids = { "Random Forest": { 'randomforestclassifier__n_estimators': [100, 200, 300, 500, 700, 1000], 'randomforestclassifier__max_depth': [3, 5, 7, 10, 15, 20, None], 'randomforestclassifier__min_samples_split': [2, 5, 10, 15], 'randomforestclassifier__min_samples_leaf': [1, 2, 4, 8], 'randomforestclassifier__max_features': ['sqrt', 'log2', 0.3, 0.5, 0.7], 'randomforestclassifier__bootstrap': [True, False], 'randomforestclassifier__class_weight': [None, 'balanced', 'balanced_subsample'] }, "Decision Tree": { 'decisiontreeclassifier__max_depth': [3, 5, 7, 10, 15, 20, 25, None], 'decisiontreeclassifier__min_samples_split': [2, 5, 10, 15, 20], 'decisiontreeclassifier__min_samples_leaf': [1, 2, 4, 8, 12], 'decisiontreeclassifier__max_features': [None, 'sqrt', 'log2', 0.3, 0.5, 0.7], 'decisiontreeclassifier__criterion': ['gini', 'entropy'], 'decisiontreeclassifier__splitter': ['best', 'random'], 'decisiontreeclassifier__class_weight': [None, 'balanced'] }, "Logistic Regression": { 'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'logisticregression__penalty': ['l1', 'l2', 'elasticnet', None], 'logisticregression__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], 'logisticregression__l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9, None], 'logisticregression__class_weight': [None, 'balanced'], 'logisticregression__max_iter': [1000, 2000, 5000] }, "SVC": { 'svc__C': [0.1, 1, 10, 100, 1000], 'svc__kernel': ['linear', 'poly', 'rbf', 'sigmoid'], 'svc__degree': [2, 3, 4, 5], 'svc__gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1], 'svc__coef0': [-1, 0, 1], 'svc__class_weight': [None, 'balanced'], 'svc__probability': [True] }, "Gradient Boosting": { 'gradientboostingclassifier__n_estimators': [50, 100, 200, 300, 500], 'gradientboostingclassifier__learning_rate': [0.001, 0.01, 0.05, 0.1, 0.2, 0.3], 'gradientboostingclassifier__max_depth': [3, 4, 5, 6, 7, 8], 'gradientboostingclassifier__min_samples_split': [2, 5, 10, 15], 'gradientboostingclassifier__min_samples_leaf': [1, 2, 4, 8], 'gradientboostingclassifier__subsample': [0.6, 0.7, 0.8, 0.9, 1.0], 'gradientboostingclassifier__max_features': [None, 'sqrt', 'log2', 0.3, 0.5, 0.7] } } # SMOTE参数也可以优化 smote_params = { 'smote__k_neighbors': [3, 5, 7, 10, 15], 'smote__sampling_strategy': ['auto', 0.5, 0.75, 1.0] } # 将SMOTE参数合并到每个模型的参数网格中 for model_name in param_grids.keys(): param_grids[model_name].update(smote_params) # 基础模型管道 base_pipelines = { "Random Forest": make_pipeline( SMOTE(random_state=42), StandardScaler(), RandomForestClassifier(random_state=42)), "Decision Tree": make_pipeline( SMOTE(random_state=42), StandardScaler(), DecisionTreeClassifier(random_state=42)), "Logistic Regression": make_pipeline( SMOTE(random_state=42), StandardScaler(), LogisticRegression(random_state=42, max_iter=1000)), "SVC": make_pipeline( SMOTE(random_state=42), StandardScaler(), SVC(probability=True, random_state=42)), "Gradient Boosting": make_pipeline( SMOTE(random_state=42), StandardScaler(), GradientBoostingClassifier(random_state=42)) } # 十折交叉验证 X_np = X.to_numpy() y_np = y.to_numpy() cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=10) # 存储结果和最佳参数 results = {name: {'actual': [], 'proba': []} for name in base_pipelines.keys()} best_params = {name: [] for name in base_pipelines.keys()} best_scores = {name: [] for name in base_pipelines.keys()} print("开始超参数优化...") for fold, (train_idx, test_idx) in enumerate(cv.split(X_np, y_np)): X_train, X_test = X_np[train_idx], X_np[test_idx] y_train, y_test = y_np[train_idx], y_np[test_idx] print(f"\n=== 第 {fold+1}/10 折 ===") for name, base_model in base_pipelines.items(): print(f"正在优化 {name}...") # 使用随机搜索进行粗调 random_search = RandomizedSearchCV( base_model, param_distributions=param_grids[name], n_iter=50, # 随机尝试50组参数 cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=42), scoring='roc_auc', n_jobs=-1, random_state=42, verbose=0 ) # 执行随机搜索 random_search.fit(X_train, y_train) # 在随机搜索的最佳参数附近进行网格搜索精调 best_random_params = random_search.best_params_ # 创建精调参数网格(在最佳参数附近搜索) refined_grid = {} for param, value in best_random_params.items(): if isinstance(value, (int, float)): # 对于数值型参数,在最佳值附近搜索 if param in ['randomforestclassifier__n_estimators', 'gradientboostingclassifier__n_estimators']: # 对于树的数量,搜索附近的值 refined_grid[param] = [max(50, value-100), value, min(1000, value+100)] elif param in ['logisticregression__C', 'svc__C']: # 对于C参数,使用对数尺度 log_values = [0.1, 0.5, 1, 2, 5, 10] refined_grid[param] = [value * x for x in log_values] else: refined_grid[param] = [max(0.1, value*0.5), value, min(10, value*2) if value != 0 else 1] else: refined_grid[param] = [value] # 网格搜索精调 grid_search = GridSearchCV( base_model, param_grid=refined_grid, cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=42), scoring='roc_auc', n_jobs=-1, verbose=0 ) grid_search.fit(X_train, y_train) # 保存最佳参数和分数 best_params[name].append(grid_search.best_params_) best_scores[name].append(grid_search.best_score_) # 使用最佳模型进行预测 best_model = grid_search.best_estimator_ y_pred_proba = best_model.predict_proba(X_test)[:, 1] results[name]['actual'].extend(y_test) results[name]['proba'].extend(y_pred_proba) print(f"{name} - 最佳分数: {grid_search.best_score_:.4f}") # 打印每个模型的最佳参数摘要 print("\n=== 超参数优化结果摘要 ===") for name in base_pipelines.keys(): print(f"\n{name}:") print(f"平均最佳分数: {np.mean(best_scores[name]):.4f} (+/- {np.std(best_scores[name]):.4f})") # 分析最常出现的最佳参数 param_frequency = {} for fold_params in best_params[name]: for param, value in fold_params.items(): if param not in param_frequency: param_frequency[param] = {} value_str = str(value) param_frequency[param][value_str] = param_frequency[param].get(value_str, 0) + 1 print("最常出现的最佳参数:") for param, frequencies in list(param_frequency.items())[:5]: # 只显示前5个参数 most_common = max(frequencies.items(), key=lambda x: x[1]) print(f" {param}: {most_common[0]} (出现 {most_common[1]} 次)") 我运行了此代码,在此之后我想绘制ROC曲线、recall、精确度及校准曲线
09-25
import requests import re import jieba import jieba.posseg as pseg import pandas as pd import numpy as np from bs4 import BeautifulSoup from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import train_test_split from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score from collections import Counter import matplotlib.pyplot as plt from wordcloud import WordCloud import seaborn as sns import os from sklearn.preprocessing import LabelEncoder import time from typing import List, Tuple, Optional # 配置jieba分词,加载自定义词典(如果有) jieba.load_userdict("user_dict.txt") # 自定义词典,可根据需要添加电影相关词汇 stopwords = set() # 读取停用词表 def load_stopwords(stopwords_path): with open(stopwords_path, 'r', encoding='utf-8') as f: for line in f: stopwords.add(line.strip()) return stopwords # 数据获取模块 - 豆瓣影评爬虫(遵守robots协议) class DoubanMovieCommentCrawler: def __init__(self, movie_id, max_pages=5, delay=1): self.movie_id = movie_id self.max_pages = max_pages self.base_url = f"https://movie.douban.com/subject/{movie_id}/comments" self.headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8", "Connection": "keep-alive" } self.delay = delay # 请求间隔时间(秒) def parse_comment(self, html): soup = BeautifulSoup(html, 'lxml') comment_list = soup.select('div.comment-item') comments = [] for comment in comment_list: try: # 提取评论内容 content_elem = comment.select_one('span.short') content = content_elem.text.strip() if content_elem else "" # 提取评分 rating_elem = comment.select_one('span.rating') rating = rating_elem['title'] if rating_elem else "未评分" # 提取评论时间 time_elem = comment.select_one('span.comment-time') comment_time = time_elem.text.strip() if time_elem else "" # 提取用户名 user_elem = comment.select_one('span.comment-info a') username = user_elem.text.strip() if user_elem else "" comments.append({ "username": username, "content": content, "rating": rating, "time": comment_time }) except Exception as e: print(f"解析评论时出错: {e}") return comments def crawl(self): all_comments = [] for page in range(self.max_pages): offset = page * 20 params = { "start": offset, "limit": 20, "sort": "new_score", "status": "P" } try: print(f"正在爬取第{page + 1}页,URL: {self.base_url}?start={offset}") response = requests.get(self.base_url, headers=self.headers, params=params, timeout=15) response.raise_for_status() # 检查请求是否成功 # 检查是否被反爬 if "检测到有异常请求" in response.text: print("警告: 可能被豆瓣反爬机制拦截,请稍后再试或降低爬取频率") break comments = self.parse_comment(response.text) all_comments.extend(comments) print(f"已爬取第{page + 1}页,获取{len(comments)}条评论") # 避免请求过于频繁 if page < self.max_pages - 1: print(f"等待{self.delay}秒后继续爬取下一页...") time.sleep(self.delay) except requests.RequestException as e: print(f"爬取第{page + 1}页时出错: {e}") # 出错后等待一段时间再继续 time.sleep(self.delay * 2) continue return pd.DataFrame(all_comments) # 文本预处理模块 class TextProcessor: def __init__(self, stopwords): self.stopwords = stopwords def clean_text(self, text): # 去除特殊字符和标点符号 text = re.sub(r'[^\u4e00-\u9fa5a-zA-Z0-9]', ' ', text) # 去除多余空格 text = re.sub(r'\s+', ' ', text).strip() return text def tokenize(self, text): # 中文分词 words = jieba.lcut(text) # 去除停用词和单字 words = [word for word in words if word not in self.stopwords and len(word) > 1] return words def pos_tag(self, text): # 词性标注 words = pseg.lcut(text) return [(word, flag) for word, flag in words if word not in self.stopwords and len(word) > 1] def filter_pos(self, pos_tags, allowed_pos=None): """过滤特定词性的词语""" if allowed_pos is None: allowed_pos = ['n', 'v', 'a', 'ad', 'an', 'vn'] # 默认保留名词、动词、形容词 return [word for word, flag in pos_tags if any(flag.startswith(p) for p in allowed_pos)] # 特征提取模块 class FeatureExtractor: def __init__(self, max_features=5000, ngram_range=(1, 2)): self.tfidf_vectorizer = TfidfVectorizer( max_features=max_features, ngram_range=ngram_range, token_pattern=r'(?u)\b\w+\b' # 自定义分词模式,确保中文单字也能被识别 ) def extract_features(self, texts): # 使用TF-IDF提取文本特征 return self.tfidf_vectorizer.fit_transform(texts) def get_feature_names(self): return self.tfidf_vectorizer.get_feature_names_out() def transform(self, texts): """对新文本进行特征转换""" return self.tfidf_vectorizer.transform(texts) # 模型训练与评估模块 class ModelTrainer: def __init__(self, model_type="naive_bayes"): if model_type == "naive_bayes": self.model = MultinomialNB() elif model_type == "logistic_regression": from sklearn.linear_model import LogisticRegression self.model = LogisticRegression(max_iter=1000) else: raise ValueError(f"不支持的模型类型: {model_type}") self.label_encoder = LabelEncoder() def train(self, X_train, y_train): # 对标签进行编码 y_train_encoded = self.label_encoder.fit_transform(y_train) # 训练模型 self.model.fit(X_train, y_train_encoded) return self.label_encoder.classes_ def evaluate(self, X_test, y_test): # 对测试集标签进行编码 y_test_encoded = self.label_encoder.transform(y_test) # 预测 y_pred = self.model.predict(X_test) # 计算评估指标 accuracy = accuracy_score(y_test_encoded, y_pred) precision = precision_score(y_test_encoded, y_pred, average='weighted', zero_division=0) recall = recall_score(y_test_encoded, y_pred, average='weighted', zero_division=0) f1 = f1_score(y_test_encoded, y_pred, average='weighted', zero_division=0) return { "accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1 } def predict(self, X): # 预测新数据的类别 y_pred = self.model.predict(X) return self.label_encoder.inverse_transform(y_pred) def predict_proba(self, X): """预测概率""" return self.model.predict_proba(X) # 文本分析与可视化模块 class TextAnalyzer: def __init__(self, font_path: Optional[str] = None): # 优先使用用户指定的字体 self.font_path = self._get_valid_font(font_path) # 确保matplotlib使用正确的字体 self._setup_matplotlib_font() def _get_valid_font(self, user_font: Optional[str]) -> str: """获取有效的中文字体路径""" # 1. 用户指定的字体 if user_font and os.path.exists(user_font): print(f"使用用户指定字体: {user_font}") return user_font # 2. 尝试常用字体文件 candidate_fonts = [ "simhei.ttf", "simsun.ttc", "msyh.ttc", # Windows常用字体 ] for font in candidate_fonts: if os.path.exists(font): print(f"使用候选字体: {font}") return font # 3. 系统字体检测 try: from matplotlib import font_manager fonts = font_manager.findSystemFonts() chinese_keywords = ['hei', 'song', 'yahei', 'simhei', 'heiti', 'microsoft'] for font in fonts: if any(keyword in font.lower() for keyword in chinese_keywords): print(f"使用系统检测到的字体: {font}") return font except Exception as e: print(f"字体检测出错: {e}") # 4. 回退到matplotlib默认字体 print("警告: 未找到中文字体,使用默认字体") return "" def _setup_matplotlib_font(self): """配置matplotlib使用指定字体""" if self.font_path and os.path.exists(self.font_path): try: from matplotlib import font_manager font_manager.fontManager.addfont(self.font_path) plt.rcParams["font.family"] = ["SimHei"] print(f"已配置matplotlib使用字体: {self.font_path}") except Exception as e: print(f"配置matplotlib字体出错: {e}") def analyze_word_frequency(self, tokens_list, top_n=20): # 统计词频 all_tokens = [token for tokens in tokens_list for token in tokens] word_freq = Counter(all_tokens) return word_freq.most_common(top_n) def visualize_wordcloud(self, word_freq, output_file="wordcloud.png", width=800, height=600): # 生成词云 wordcloud = WordCloud( font_path=self.font_path if os.path.exists(self.font_path) else None, width=width, height=height, background_color="white", max_words=200, contour_width=1, contour_color='steelblue' ).generate_from_frequencies(dict(word_freq)) # 显示词云图 plt.figure(figsize=(12, 8)) plt.imshow(wordcloud, interpolation="bilinear") plt.axis("off") plt.tight_layout() plt.savefig(output_file, dpi=300, bbox_inches='tight') plt.close() def visualize_class_distribution(self, labels, output_file="class_distribution.png"): # 统计类别分布 class_counts = Counter(labels) df = pd.DataFrame(list(class_counts.items()), columns=['Class', 'Count']) # 绘制柱状图 plt.figure(figsize=(12, 6)) sns.barplot(x='Class', y='Count', data=df) plt.title('评论类别分布') plt.xlabel('评分') plt.ylabel('评论数量') plt.xticks(rotation=45) plt.tight_layout() plt.savefig(output_file, dpi=300, bbox_inches='tight') plt.close() def visualize_tfidf_features(self, feature_names, feature_values, class_names, top_n=10, output_file="tfidf_features.png"): # 为每个类别获取最重要的特征 plt.figure(figsize=(18, 10)) for i, class_name in enumerate(class_names): class_idx = i top_features_idx = feature_values[class_idx].argsort()[-top_n:][::-1] top_features = [feature_names[idx] for idx in top_features_idx] top_values = feature_values[class_idx][top_features_idx] plt.subplot(1, len(class_names), i + 1) plt.barh(top_features, top_values) plt.title(f'{class_name}的重要特征') plt.xlabel('TF-IDF权重') plt.tight_layout() plt.savefig(output_file, dpi=300, bbox_inches='tight') plt.close() def visualize_sentiment_trend(self, df, date_col, sentiment_col, output_file="sentiment_trend.png"): """可视化情感随时间的变化趋势""" # 转换日期格式 df['date'] = pd.to_datetime(df[date_col]) df['date'] = df['date'].dt.strftime('%Y-%m-%d') # 按日期分组并计算平均情感 sentiment_trend = df.groupby('date')[sentiment_col].mean().reset_index() plt.figure(figsize=(15, 7)) sns.lineplot(x='date', y=sentiment_col, data=sentiment_trend, marker='o') plt.title('情感随时间变化趋势') plt.xlabel('日期') plt.ylabel('平均情感分数') plt.xticks(rotation=45) plt.grid(True, linestyle='--', alpha=0.7) plt.tight_layout() plt.savefig(output_file, dpi=300, bbox_inches='tight') plt.close() # 情感分析模块 class SentimentAnalyzer: def __init__(self, positive_words_path="positive_words.txt", negative_words_path="negative_words.txt"): self.positive_words = set() self.negative_words = set() # 加载情感词典 if os.path.exists(positive_words_path): with open(positive_words_path, 'r', encoding='utf-8') as f: for line in f: self.positive_words.add(line.strip()) if os.path.exists(negative_words_path): with open(negative_words_path, 'r', encoding='utf-8') as f: for line in f: self.negative_words.add(line.strip()) # 如果没有情感词典,创建简单的词典 if not self.positive_words: self.positive_words = set(["好", "优秀", "喜欢", "赞", "精彩", "推荐", "感人", "好看", "完美", "棒"]) if not self.negative_words: self.negative_words = set(["差", "垃圾", "讨厌", "失望", "难看", "糟糕", "恶心", "失败", "无聊", "浪费"]) def analyze_sentiment(self, tokens): """基于情感词典的简单情感分析""" positive_count = sum(1 for word in tokens if word in self.positive_words) negative_count = sum(1 for word in tokens if word in self.negative_words) total_words = len(tokens) if total_words == 0: return 0.5 # 中性情感 # 计算情感分数 (-11 之间) sentiment_score = (positive_count - negative_count) / total_words # 转换为 0 到 1 之间的分数 (0: 负面, 0.5: 中性, 1: 正面) normalized_score = (sentiment_score + 1) / 2 return normalized_score def get_sentiment_label(self, score): """将情感分数转换为情感标签""" if score < 0.4: return "负面" elif score < 0.6: return "中性" else: return "正面" # 主函数 - 整合所有模块 def main(): # 创建输出目录 if not os.path.exists("data"): os.makedirs("data") if not os.path.exists("results"): os.makedirs("results") # 1. 数据获取 print("开始获取豆瓣影评数据...") # 示例:使用豆瓣ID movie_id = "34780991" # 读取电影信息 try: movie_info_url = f"https://movie.douban.com/subject/{movie_id}/" movie_response = requests.get(movie_info_url, headers={ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" }) movie_soup = BeautifulSoup(movie_response.text, 'lxml') movie_title = movie_soup.select_one('span[property="v:itemreviewed"]').text print(f"正在分析电影: {movie_title}") except Exception as e: print(f"获取电影信息失败: {e}") movie_title = f"电影ID_{movie_id}" # 爬取评论数据 crawler = DoubanMovieCommentCrawler(movie_id, max_pages=30, delay=2) # 爬取5页数据,每页20条 df = crawler.crawl() # 保存原始数据 raw_data_file = f"data/{movie_id}_raw_comments.csv" df.to_csv(raw_data_file, index=False, encoding='utf-8-sig') print(f"已保存{len(df)}条原始评论到{raw_data_file}") # 2. 数据预处理 print("开始数据预处理...") # 加载停用词 stopwords_path = "stopwords.txt" if not os.path.exists(stopwords_path): # 如果没有停用词表,创建一个简单的停用词表 print(f"未找到停用词表,创建默认停用词表: {stopwords_path}") with open(stopwords_path, 'w', encoding='utf-8') as f: f.write("\n".join([ "的", "了", "在", "是", "我", "有", "和", "就", "不", "人", "都", "一", "一个", "上", "也", "很", "到", "说", "要", "去", "你", "会", "着", "没有", "看", "好", "这", "那", "与", "啊", "把", "被", "从", "到", "而", "对于", "给", "关于", "过", "还是", "或者", "既", "即", "即使", "几", "己", "见", "将", "叫", "让", "然后", "任何", "如何", "什么", "时", "时候", "使", "使用", "她", "他", "它", "他们", "她们", "我们", "你们", "些", "向", "像", "想", "向", "项", "这样", "这种", "这里", "那里", "自己", "只是", "知道", "之中", "之后", "之间", "这些", "那些", "这个", "那个", "这是", "那是", "现在", "现在", "一个" ])) stopwords = load_stopwords(stopwords_path) # 初始化文本处理器 text_processor = TextProcessor(stopwords) # 清理文本 print("清理文本内容...") # 检查 content 列是否存在 if 'content' not in df.columns: print(f"错误: DataFrame 中不存在 'content' 列。可用列: {list(df.columns)}") return else: # 确保 content 列是字符串类型 df['content'] = df['content'].astype(str) # 应用文本清理函数 df['clean_content'] = df['content'].apply( lambda x: text_processor.clean_text(x) if pd.notna(x) and x.strip() != "" else "" ) # 去除空评论 original_count = len(df) df = df[df['clean_content'].str.strip() != ""] removed_count = original_count - len(df) print(f"清理完成: 原始 {original_count} 条记录, 移除 {removed_count} 条空评论, 剩余 {len(df)} 条记录") # 分词 print("进行中文分词...") df['tokens'] = df['clean_content'].apply(text_processor.tokenize) # 词性标注 print("进行词性标注...") df['pos_tags'] = df['clean_content'].apply(text_processor.pos_tag) # 3. 情感分析 print("进行情感分析...") sentiment_analyzer = SentimentAnalyzer() df['sentiment_score'] = df['tokens'].apply(sentiment_analyzer.analyze_sentiment) df['sentiment_label'] = df['sentiment_score'].apply(sentiment_analyzer.get_sentiment_label) # 保存预处理后的数据 preprocessed_file = f"data/{movie_id}_preprocessed.csv" df.to_csv(preprocessed_file, index=False, encoding='utf-8-sig') print(f"数据预处理完成,已保存到{preprocessed_file}") # 4. 特征提取 print("开始特征提取...") feature_extractor = FeatureExtractor(max_features=3000, ngram_range=(1, 2)) X = feature_extractor.extract_features(df['clean_content']) feature_names = feature_extractor.get_feature_names() print(f"提取的特征数量: {X.shape[1]}") # 5. 模型训练与评估 print("开始模型训练与评估...") # 将评分转换为数值标签 rating_map = { "力荐": 5, "推荐": 4, "还行": 3, "较差": 2, "很差": 1, "未评分": 0 } df['rating_value'] = df['rating'].map(lambda x: rating_map.get(x, 0)) # 过滤掉未评分的评论 df_rated = df[df['rating_value'] > 0] if len(df_rated) < 10: print("警告: 评分数据不足,无法训练模型") else: y = df_rated['rating_value'] X_train, X_test, y_train, y_test = train_test_split( feature_extractor.transform(df_rated['clean_content']), y, test_size=0.2, random_state=42 ) # 训练模型 model_trainer = ModelTrainer(model_type="logistic_regression") class_names = model_trainer.train(X_train, y_train) evaluation_results = model_trainer.evaluate(X_test, y_test) print("模型评估结果:") for metric, value in evaluation_results.items(): print(f"{metric}: {value:.4f}") # 保存评估结果 with open(f"results/{movie_id}_model_evaluation.txt", 'w', encoding='utf-8') as f: f.write(f"电影: {movie_title}\n") f.write(f"评论数量: {len(df_rated)}\n") f.write("模型评估结果:\n") for metric, value in evaluation_results.items(): f.write(f"{metric}: {value:.4f}\n") # 6. 文本分析与可视化 print("开始文本分析与可视化...") # 查找并设置中文字体 text_analyzer = TextAnalyzer() # 分析词频 word_freq = text_analyzer.analyze_word_frequency(df['tokens']) print("高频词:") for word, freq in word_freq[:10]: print(f"{word}: {freq}") # 生成词云 text_analyzer.visualize_wordcloud( word_freq, output_file=f"results/{movie_id}_wordcloud.png", width=1000, height=600 ) print(f"词云图已保存到results/{movie_id}_wordcloud.png") # 可视化类别分布 text_analyzer.visualize_class_distribution( df['sentiment_label'], output_file=f"results/{movie_id}_sentiment_distribution.png" ) print(f"情感分布图已保存到results/{movie_id}_sentiment_distribution.png") # 可视化评分分布 if len(df_rated) > 0: text_analyzer.visualize_class_distribution( df_rated['rating'], output_file=f"results/{movie_id}_rating_distribution.png" ) print(f"评分分布图已保存到results/{movie_id}_rating_distribution.png") # 可视化情感趋势 if len(df) > 0 and 'time' in df.columns: # 过滤掉时间为空的评论 df_with_time = df[df['time'].notna() & (df['time'] != "")] if len(df_with_time) > 10: text_analyzer.visualize_sentiment_trend( df_with_time, date_col='time', sentiment_col='sentiment_score', output_file=f"results/{movie_id}_sentiment_trend.png" ) print(f"情感趋势图已保存到results/{movie_id}_sentiment_trend.png") # 可视化各分类的重要特征 if len(df_rated) > 0 and 'rating_value' in df_rated.columns: # 训练一个新模型用于特征可视化 feature_values = model_trainer.model.coef_ text_analyzer.visualize_tfidf_features( feature_names, feature_values, class_names, top_n=10, output_file=f"results/{movie_id}_tfidf_features.png" ) print(f"TF-IDF特征图已保存到results/{movie_id}_tfidf_features.png") print(f"\n分析完成!结果已保存到results目录下") print(f"电影: {movie_title}") print(f"总评论数: {len(df)}") print(f"平均情感分数: {df['sentiment_score'].mean():.4f}") print(f"情感分布: {Counter(df['sentiment_label'])}") if len(df_rated) > 0: print(f"平均评分: {df_rated['rating_value'].mean():.2f}/5") if __name__ == "__main__": main()修改代码使避免403错误 要求输出修改后完整的代码
06-17
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值