日撸java 三百行趁热打铁（02）基于 M-distance 的推荐-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_44761856/article/details/124575166

knn的变种，基于用户已有的评分来猜测当前的评分，采用的是 leave-one-out 测试方法，要对数据集里的每一个评分都进行一次预测，由于这种测试方法的工作量特别大，所以对算法要求很高或者计算机特别能跑

今天的数据集是txt文本，所以不能用weka来帮助我们管理数据，只能手动用java的文件io，数据集是一个二维数组，有三列，第一列表示用户编号，第列表示电影编号，第三列表示用户对电影的评分。

首先是构造方法，主要目的是读取文件，并将txt文件的数据保存到数组里，由于txt文件的数据是压缩后的数据，所以我们在读取的时候除了有一个二维数组来接收数据，还会有很多辅助的数组来表明数据的意义，类似“解压”操作。感觉这个矩阵的压缩还是不容易的，之前只搞过简单的压缩。

    /**
     *************************
     * Construct the rating matrix.
     *
     * @param paraFilename
     *            the rating filename.
     * @param paraNumUsers
     *            number of users
     * @param paraNumItems
     *            number of items
     * @param paraNumRatings
     *            number of ratings
     *************************
     */
    public MBR(String paraFilename, int paraNumUsers, int paraNumItems, int paraNumRatings) throws Exception {
        numItems = paraNumItems;
        numUsers = paraNumUsers;
        numRatings = paraNumRatings;

        userDegrees = new int[numUsers];
        userStartingIndices = new int[numUsers + 1];
        userAverageRatings = new double[numUsers];
        itemDegrees = new int[numItems];
        compressedRatingMatrix = new int[numRatings][3];
        itemAverageRatings = new double[numItems];

        predictions = new double[numRatings];

        System.out.println("Reading " + paraFilename);
        
        File tempFile = new File(paraFilename);
        if (!tempFile.exists()) {
            System.out.println("File " + paraFilename + " does not exists.");
            System.exit(0);
        } // Of if
        BufferedReader tempBufReader = new BufferedReader(new FileReader(tempFile));
        String tempString;
        String[] tempStrArray;
        int tempIndex = 0;
        userStartingIndices[0] = 0;
        userStartingIndices[numUsers] = numRatings;
        while ((tempString = tempBufReader.readLine()) != null) {
            // Each line has three values
            tempStrArray = tempString.split(",");
            compressedRatingMatrix[tempIndex][0] = Integer.parseInt(tempStrArray[0]);
            compressedRatingMatrix[tempIndex][1] = Integer.parseInt(tempStrArray[1]);
            compressedRatingMatrix[tempIndex][2] = Integer.parseInt(tempStrArray[2]);

            userDegrees[compressedRatingMatrix[tempIndex][0]]++;
            itemDegrees[compressedRatingMatrix[tempIndex][1]]++;

            if (tempIndex > 0) {
                if (compressedRatingMatrix[tempIndex][0] != compressedRatingMatrix[tempIndex - 1][0]) {
                    userStartingIndices[compressedRatingMatrix[tempIndex][0]] = tempIndex;
                } // Of if
            } // Of if
            tempIndex++;
        } // Of while
        tempBufReader.close();

        double[] tempUserTotalScore = new double[numUsers];
        double[] tempItemTotalScore = new double[numItems];
        for (int i = 0; i < numRatings; i++) {
            tempUserTotalScore[compressedRatingMatrix[i][0]] += compressedRatingMatrix[i][2];
            tempItemTotalScore[compressedRatingMatrix[i][1]] += compressedRatingMatrix[i][2];
        } // Of for i

        for (int i = 0; i < numUsers; i++) {
            userAverageRatings[i] = tempUserTotalScore[i] / userDegrees[i];
        } // Of for i
        for (int i = 0; i < numItems; i++) {
            itemAverageRatings[i] = tempItemTotalScore[i] / itemDegrees[i];
        } // Of for i
    }// Of the first constructor

我们要存储的是每一部电影的平均分，下一个方法设置距离，这里的距离是分值差的范围，在这个范围内的我们就认为是邻居。

在这里插入代码片/**
     *************************
     * Set the radius (delta).
     *
     * @param paraRadius
     *            The given radius.
     *************************
     */
    public void setRadius(double paraRadius) {
        if (paraRadius > 0) {
            radius = paraRadius;
        } else {
            radius = 0.1;
        } // Of if
    }// Of setRadius

接下来就是 leave-one-out 测试方法的核心代码，这里要遍历所有的数据对象，并且要将当前数据对象所对应的电影的平均分给更新（要排除掉自己），把更新后的平均分与之前存储的平均分数组比较，只要是在允许的范围内，我们就记为邻居，所以这里的邻居个数是不固定的。最后填充预测数组

    public void leaveOneOutPrediction() {
        double tempItemAverageRating;
        int tempUser, tempItem, tempRating;
        System.out.println("\r\nLeaveOneOutPrediction for radius " + radius);

        numNonNeighbors = 0;
        for (int i = 0; i < numRatings; i++) {
            tempUser = compressedRatingMatrix[i][0];
            tempItem = compressedRatingMatrix[i][1];
            tempRating = compressedRatingMatrix[i][2];
            tempItemAverageRating = (itemAverageRatings[tempItem] * itemDegrees[tempItem] - tempRating)
                    / (itemDegrees[tempItem] - 1);
            int tempNeighbors = 0;
            double tempTotal = 0;
            int tempComparedItem;
            for (int j = userStartingIndices[tempUser]; j < userStartingIndices[tempUser + 1]; j++) {
                tempComparedItem = compressedRatingMatrix[j][1];
                if (tempItem == tempComparedItem) {
                    continue;
                } // Of if

                if (Math.abs(tempItemAverageRating - itemAverageRatings[tempComparedItem]) < radius) {
                    tempTotal += compressedRatingMatrix[j][2];
                    tempNeighbors++;
                } // Of if
            } // Of for j
            if (tempNeighbors > 0) {
                predictions[i] = tempTotal / tempNeighbors;
            } else {
                predictions[i] = DEFAULT_RATING;
                numNonNeighbors++;
            } // Of if
        } // Of for i
    }// Of leaveOneOutPrediction

最后我们对每个评分都有一个预测，将预测值与真实值进行距离计算，理论上这个值越小我们取的范围就越合适。然后是运行结果：
在这里插入图片描述
这几个范围里取0.4效果是最好的（我感觉差距也不大，都是小数点后几位的差距）
总结：数据集矩阵的压缩感觉挺恶心的，但是为了节约空间不得不这么干，然后在设置Radius范围的时候突然有个疑惑，怎么样才能找到这个最好的范围，手动设置感觉就很粗略，即使进行一个遍历也是在几个粗略的中找一个最好的，万一这个范围是小数点后很多位，这就没法遍历到了。

然后这里面好像有个多余的代码

for (int i = 0; i < numUsers; i++) {
            userAverageRatings[i] = tempUserTotalScore[i] / userDegrees[i];
        } // Of for i

是计算每个用户对已经评分的电影的平均分
这个横向算平均分应该是不需要的，下面的代码也没有用到。

日撸java 三百行 趁热打铁（02）基于 M-distance 的推荐

日撸java 三百行趁热打铁（02）基于 M-distance 的推荐