820. Short Encoding of Words

本文深入探讨了一种用于计算字符串数组中所有单词的最小编码总长度的算法。通过对每个单词进行反转并排序,然后使用自定义函数进行比较,该算法能够高效地去除重复和包含关系的单词,从而达到最小化编码长度的目的。
class Solution {
    public int minimumLengthEncoding(String[] words) {
        StringBuffer sb_array = new StringBuffer();
        String[] str_array = new String[words.length];
        for(int i = 0; i < words.length; ++i){
            sb_array = new StringBuffer(words[i]);
            sb_array = sb_array.reverse();
            str_array[i] = sb_array.toString();
        }
        Arrays.sort(str_array);
        words = str_array;
        List<String> mem = new ArrayList<>();
        mem.add(words[0]);
        for(int i = 1; i < words.length; ++i){
            String last_str = mem.get(mem.size() - 1);
            int temp = fun(last_str, words[i]);
            if(temp == -2){
                mem.add(words[i]);
            }else if(temp == -1){
                mem.set(mem.size() - 1, words[i]);
            }
        }
        int res = 0;
        for(int i = 0; i < mem.size(); ++i){
            res += mem.get(i).length();
        }
        res += mem.size();
        return res;
    }
    
    public int fun(String a, String b){
        int a_len = a.length();
        int b_len = b.length();
        int len = Math.min(a_len, b_len);
        for(int i = 0; i < len; ++i){
            if(a.charAt(i) != b.charAt(i)){
                return -2;
            }
        }
        if(a_len == b_len){
            return 0;
        }else if(a_len > b_len){
            return 1;
        }else{
            return -1;
        }
    }
}
Thanks to your help on the vocabulary management system, your tutor Jueqing can now mark assignment essays automatically based on word usage. In this writing assignment, students are required to submit a short essay of at least 300 words and up to 500 words on a given topic. Essays are marked on relevance to the topic and writing style. By looking at which words a student chooses, we can gain insight into the nature of their writing. Rare and diverse words often suggest richer expression, while overusing of filler words may suggest weaker style. You are required to implement EssayScorer class, such that Its constructor receives TextProcessor object as input (see the scaffold). It has a method score_essay(self, prob_statement, file_path) that receives a short problem statement as input, reads an essay from a .txt file and returns a dictionary containing 4 component scores, penalty and the total score (rounded to 2 decimal places). { 'length': 0.0, 'relevance': 26.67, 'rarity': 23.75, 'variety': 13.33, 'penalty': -10.0, 'total_score': 53.75 } The essay file should be preprocessed using the same text processing rules from Task 3B and 4 (except for removing the stopwords): all words have been converted to lowercase processing punctuation and contractions filtering out numbers and words composed entirely of digits discarding words with a length less than 2 You may write additional helper functions or methods as needed. Note that the essay must be processed prior to counting the words. Scoring criteria The essay is scored out of 100 marks, split into 4 components plus a possible penalty. All the component score cannot go below 0. 1. Length check (max 10 marks) Essays between 300 and 500 words (inclusive) get the full 10 marks. If the essay is shorter than 300 words or longer than 500 words: Apply a 10% deduction of the length mark (1 point) for every 20 words of under- or overshoot. 2. Relevance (max 40 marks) The set of all non-stopwords from the problem statement is also referred as topic words. The appearance frequency of topic words in the essay is a good indicator about the relevance of the essay to the given topic. If all topic words appear at least 3 times, award the full 40 marks. If some appear fewer than 3 times, give partial credit: We cap the max appearance of each topic word to 3 and compute the total frequency of all topic words. The relevance score is computed as follow relevance = 40 × ∑ topic words ( min ( 3 , topic word appearance ) ) total topic words × 3 relevance=40× total topic words×3 ∑ topic words(min(3,topic word appearance)) ​ If no topic words appear, award 0 marks. 3. Word rarity (max 30 marks) Score each unique word in the essay (excluding stopwords) based on its frequency in the words_freq: word frequency Points 0 -1 penalty due to the use of unknown word 1-3 5 (rare word) 4-20 4 21-50 3 51-100 2 > 100 1 word frequency 0 1-3 4-20 21-50 51-100 > 100 ​ Points -1 penalty due to the use of unknown word 5 (rare word) 4 3 2 1 ​ ​ Let U be the number of unique words (exclude stopwords). The rarity score is computed by normalizing the sum of total rarity score over unique word to the scale [0, 30] rarity = min ( 30 , 30 × sum of word rarity points 3 × U ) rarity=min(30,30× 3×U sum of word rarity points ​ ) Students are encouraged to use academic words (with rarity level 3) and awarded bonus for rare words (level 4-5). 4. Variety score (max 20 marks) Encourage students to use many different words. Let U be number of unique words (excluding stopwords) and L be total words (excluding stopwords). The variety score is computed as follow variety = 20 × U L variety=20× L U ​ ​ 5. Filler penalty (up to -10 marks) If more than 50% of the essay words are stopwords, subtract 10 points from the total. Total score are the sum of all 4 score components and the penalty (if applicable), rounded to 2 decimal places. If the total score is negative, it is capped at zero instead. Example Problem statement: "The impact of technology on education." Topic words after stopword removal: "impact", "technology", "education" Essay: "Technology is rapidly changing education. The impact of technology can be seen in online education technology. However, not all impacts of technology are positive." Length check: 24 words -> falls short of 276 words until 300 word target. Penalty 10% for each 20 words missing: -13.8 Final length mark: 10 - 13.8 = -3.8 -> 0 mark Relevance: Frequency of topic words in the essay: "technology":4, "education":2, "impact":1 Not all topic words appear at least 3 times in the essay. Total word topic appearance: 3 + 2 + 1 = 6 (We cap the appearance of technology to 3). Relevance score: 40 * 6 / (3 * 3) = 26.67 Rarity: Suppose the essay has 8 unique non-stopword words, and the total rarity points are 19. Rarity score: min(30, 30 * 19 / (8*3)) = 23.75 Variety: Suppose the essay has 8 unique non-stopword words, and the total non-stopword words is 12. Variety score: variety = 20 × 8 12 = 16.33 variety=20× 12 8 ​ ​ =16.33 Filler penalty: Suppose the essay has 12 stopwords ~ 50% Penalty: -10 The final score is: 0 + 26.67 + 23.75 + 16.33 - 10 = 56.75
最新发布
09-25
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值