Thanks to your help on the vocabulary management system, your tutor Jueqing can now mark assignment essays automatically based on word usage.
In this writing assignment, students are required to submit a short essay of at least 300 words and up to 500 words on a given topic. Essays are marked on relevance to the topic and writing style. By looking at which words a student chooses, we can gain insight into the nature of their writing. Rare and diverse words often suggest richer expression, while overusing of filler words may suggest weaker style.
You are required to implement EssayScorer class, such that
Its constructor receives TextProcessor object as input (see the scaffold).
It has a method score_essay(self, prob_statement, file_path) that receives a short problem statement as input, reads an essay from a .txt file and returns a dictionary containing 4 component scores, penalty and the total score (rounded to 2 decimal places).
{
'length': 0.0,
'relevance': 26.67,
'rarity': 23.75,
'variety': 13.33,
'penalty': -10.0,
'total_score': 53.75
}
The essay file should be preprocessed using the same text processing rules from Task 3B and 4 (except for removing the stopwords):
all words have been converted to lowercase
processing punctuation and contractions
filtering out numbers and words composed entirely of digits
discarding words with a length less than 2
You may write additional helper functions or methods as needed.
Note that the essay must be processed prior to counting the words.
Scoring criteria
The essay is scored out of 100 marks, split into 4 components plus a possible penalty. All the component score cannot go below 0.
1. Length check (max 10 marks)
Essays between 300 and 500 words (inclusive) get the full 10 marks.
If the essay is shorter than 300 words or longer than 500 words:
Apply a 10% deduction of the length mark (1 point) for every 20 words of under- or overshoot.
2. Relevance (max 40 marks)
The set of all non-stopwords from the problem statement is also referred as topic words. The appearance frequency of topic words in the essay is a good indicator about the relevance of the essay to the given topic.
If all topic words appear at least 3 times, award the full 40 marks.
If some appear fewer than 3 times, give partial credit:
We cap the max appearance of each topic word to 3 and compute the total frequency of all topic words.
The relevance score is computed as follow
relevance
=
40
×
∑
topic words
(
min
(
3
,
topic word appearance
)
)
total topic words
×
3
relevance=40×
total topic words×3
∑
topic words
(min(3,topic word appearance))
If no topic words appear, award 0 marks.
3. Word rarity (max 30 marks)
Score each unique word in the essay (excluding stopwords) based on its frequency in the words_freq:
word
frequency
Points
0
-1 penalty due to the use of unknown word
1-3
5 (rare word)
4-20
4
21-50
3
51-100
2
> 100
1
word frequency
0
1-3
4-20
21-50
51-100
> 100
Points
-1 penalty due to the use of unknown word
5 (rare word)
4
3
2
1
Let U be the number of unique words (exclude stopwords). The rarity score is computed by normalizing the sum of total rarity score over unique word to the scale [0, 30]
rarity
=
min
(
30
,
30
×
sum of word rarity points
3
×
U
)
rarity=min(30,30×
3×U
sum of word rarity points
)
Students are encouraged to use academic words (with rarity level 3) and awarded bonus for rare words (level 4-5).
4. Variety score (max 20 marks)
Encourage students to use many different words.
Let U be number of unique words (excluding stopwords) and L be total words (excluding stopwords). The variety score is computed as follow
variety
=
20
×
U
L
variety=20×
L
U
5. Filler penalty (up to -10 marks)
If more than 50% of the essay words are stopwords, subtract 10 points from the total.
Total score are the sum of all 4 score components and the penalty (if applicable), rounded to 2 decimal places. If the total score is negative, it is capped at zero instead.
Example
Problem statement: "The impact of technology on education."
Topic words after stopword removal: "impact", "technology", "education"
Essay:
"Technology is rapidly changing education. The impact of technology can be seen in online education technology. However, not all impacts of technology are positive."
Length check:
24 words -> falls short of 276 words until 300 word target. Penalty 10% for each 20 words missing: -13.8
Final length mark: 10 - 13.8 = -3.8 -> 0 mark
Relevance:
Frequency of topic words in the essay: "technology":4, "education":2, "impact":1
Not all topic words appear at least 3 times in the essay.
Total word topic appearance: 3 + 2 + 1 = 6 (We cap the appearance of technology to 3).
Relevance score: 40 * 6 / (3 * 3) = 26.67
Rarity:
Suppose the essay has 8 unique non-stopword words, and the total rarity points are 19.
Rarity score: min(30, 30 * 19 / (8*3)) = 23.75
Variety:
Suppose the essay has 8 unique non-stopword words, and the total non-stopword words is 12.
Variety score:
variety
=
20
×
8
12
=
16.33
variety=20×
12
8
=16.33
Filler penalty:
Suppose the essay has 12 stopwords ~ 50%
Penalty: -10
The final score is: 0 + 26.67 + 23.75 + 16.33 - 10 = 56.75