在信息论中,熵是对不确定性的一种度量。不确定性越大,熵就越大,包含的信息量越大;不确定性越小,熵就越小,包含的信息量就越小。
根据熵的特性,可以通过计算熵值来判断一个事件的随机性及无序程度,也可以用熵值来判断某个指标的离散程度,指标的离散程度越大,该指标对综合评价的影响(权重)越大。比如样本数据在某指标下取值都相等,则该指标对总体评价的影响为0,权值为0.
熵权法是一种客观赋权法,因为它仅依赖于数据本身的离散性。但其实通过熵值法确定权重并不是特别合理。
熵值法python实现:
# -*- coding:utf-8 -*-
"""
@author: 1
@file: entropy_method3.py
@time: 2020/3/8 11:08
"""
# import
import pandas as pd
import numpy as np
import math
from numpy import array
def cal_weight(x):
"""
@param x:
@return: df
"""
# standardization
x = x.apply(lambda x: ((x - np.min(x)) / (np.max(x) - np.min(x))))
rows = x.index.size
cols = x.columns.size
k = 1.0 / math.log2(rows)
# information entropy
x = array(x)
lnf = [[None] * cols for i in range(rows)]
lnf = array(lnf)
for i in range(0, rows):
for j in range(0, cols):
if x[i][j] == 0:
lnfij = 0.0
else:
p = x[i][j] / x.sum(axis=0)[j]
lnfij = math.log(p) * p * (-k)
lnf[i][j] = lnfij
lnf = pd.DataFrame(lnf)
E = lnf
# Calculate redundancy
d = 1 - E.sum(axis=0)
# Calculate the weight of each index
w = [[None] * 1 for i in range(cols)]
for j in range(0, cols):
wj = d[j] / sum(d)
w[j] = wj
w = pd.DataFrame(w)
return w
if __name__ == '__main__':
data_end = pd.read_csv('py_challenge/data_end.csv', index_col=0)
df = data_end[['pagerank_value', 'NA', 'FA']]
w = cal_weight(df)
w.index = df.columns
w.columns = ['weight']
print(w)