学校数据科学，第二部分：使用 Python 进行学生选修课分配-优快云博客

原文：towardsdatascience.com/data-science-for-schools-part-2-student-electives-allocation-with-python-9102830fff9c

想象以下场景：你是一名教师，有人要求你帮助为 200 名学生创建一个课外“选项/选修”课程。

每个学生选择他们前四个偏好，你需要以最大化学生满意度的方式分配学生，同时考虑到各种约束（例如，一个选修课至少需要 5 名学生才能运行）。

你是如何做到的？

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/fbf0c5f0cd611a7cbcdd07b7c4299716.png

作者提供的图片

学校数据科学

在这篇文章中，我将向你展示如何使用数据科学——特别是线性规划这种技术——来解决这个问题。

这是我展示学校如何应用数据科学和人工智能来改善诸如：

排课
评估
课程计划

我的目的是帮助你自动化无聊的事情，让教师有更多时间专注于他们最擅长的事情：教学。

如果我不是教师，我为什么要关心（这个问题）？

正如我之前所论证，数据科学家对机器学习的痴迷程度过高：

停止过度使用 Scikit-Learn，尝试 OR-Tools 代替

在这篇文章中，我们将使用一种称为线性规划的技术，这是一种巧妙的方法来处理诸如：

“每个学生必须被分配到他们前四个选择中的一个”
“如果一个学生在秋季学期被分配到足球，那么他们春季学期将无资格被分配到足球”

…并将它们转换成可以编程解决的数学方程式。

线性规划是你在数据科学工具箱中拥有的非常有价值的技能。如果你想创建一个独特的个人作品集，并学习一个高价值（但供应量低）的数据技能，这是一个很好的开始。

我的之前指南通过展示如何使用 Python 自动安排缺席常规教师的课程覆盖开始。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/4a13c49492dc0ad944edfb489eb863e4.png

来自我之前文章《学校数据科学，第一部分》的图片，Data Science for Schools, Part 1

在本指南中，我们将在此基础上构建，并介绍一些更复杂的约束条件。

第一部分：创建数据

在我们开始优化之前，我们需要设置我们的环境并创建一些数据。

首先，我们将导入所需的库：

import numpy as np
import pandas as pd
from collections import Counter
from faker import Faker
from ortools.sat.python import cp_model
from ortools.linear_solver import pywraplp

接下来，让我们创建一个名为options的 DataFrame，列出提供的选修课及其最大容量：

# Define dict of options
# Format: {option: max_capacity}
options_dict = {
    "Badminton": 20,
    "Football 1": 28,
    "Football 2": 28,
    "Basketball": 15,
    "Cricket": 22,
    "Basket Weaving": 20,
    "Lego Engineering": 20,
    "History Guided Reading": 20,
    "Learn a Brass Instrument": 10,
    "Public Speaking": 15,
    "Knitting": 15,
    "Karate": 15,
}

# Convert to DataFrame and insert id column
options = pd.DataFrame(list(options_dict.items()), columns=['Option', 'Maximum capacity'])
options.insert(0, 'option_id', range(1, len(options)+1))
options

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/b9a1b2efe7484f3a68a372e5702e14d2.png

图片由作者提供

接下来，我们将使用Faker库创建 200 个假学生，以编程方式创建逼真的姓名：

# Create 200 student IDs
student_ids = [_ for _ in range(1, 201)]

# Create 200 student names
fake = Faker()
student_names = [f"{fake.first_name()} {fake.last_name()}" for _ in student_ids]

# Store in a DataFrame
students = pd.DataFrame({
    'student_id': student_ids,
    'student_name': student_names,
})

students

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/ad23b8a11290afe24b1267da8f14892a.png

图片由作者提供

现在，让我们创建学生的偏好，并将这些添加到students DataFrame 中。目前，我们假设这些偏好是随机的（稍后，我们将对其进行修改，使其更真实）：

# Create preferences
preferences_dict = {student: np.random.choice(options['option_id'], size=4, replace=False) for student in student_ids}
preferences = pd.DataFrame.from_dict(preferences_dict, orient='index', columns=[f'Preference {i}' for i in range(1, 4+1)])
preferences.index.name = 'student_id'
preferences.reset_index(inplace=True)

# Merge
students = pd.merge(students, preferences, on='student_id')
students

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/74ed5baae41f3037f59fa7f934066610.png

图片由作者提供

最后，我们将添加一个快速检查，以确保选修课之间有足够的容量，以确保每个学生都能被分配：

num_students = students.shape[0]
num_electives = options.shape[0]

print(f"Number of students: {num_students}")
print(f"Number of electives: {num_electives}")

# Check that there's enough capacity for all the students
assert sum(options_dict.values()) >= len(students)

# Number of students: 200
# Number of electives: 12

第二部分：简单分配

我们已经获得了数据；现在，我们可以开始分配过程。

我们将首先构建一个简单的模型，将每个学生分配到一个选修课选项，同时最大化学生满意度（即，该模型将尽可能多地将学生的首选分配给他们，并尽量避免将学生分配到排名较低的选择）。

这里是假设：

每个学生将被分配到恰好一门选修课。
每个学生必须被分配到他们前四项选择中的一项。
每个选项只能选择一次，但没有依赖关系或不兼容的组合。（如果一个学生将“足球 1”作为他们的首选，这并不妨碍他们选择其他选项——例如，“足球 2”——等等。）

之后，我们将添加一些更复杂的考虑因素。但这将是一个良好的开始。

初始化模型

首先，使用 OR-Tools 初始化模型：

solver = pywraplp.Solver.CreateSolver('SCIP')

创建决策变量

接下来，我们将创建 2400 个布尔决策变量——每个学生-选修课组合一个（12 名学生 * 200 门选修课 = 2400 个变量）：

# Decision variables
decis_vars = {}
for i in students['student_id']:
    for j in options['option_id']:
        decis_vars[f'student{i}_elective{j}'] = solver.BoolVar(f'student{i}_elective{j}')

print(f"Number of decision variables = {solver.NumVariables()}")

例如，对于学生 1（“大卫·吉尔伯特”），将有 12 个决策变量，每个变量代表一种可能的选修课分配：

[k for k in decis_vars.keys() if k.startswith('student1_')]

# Result
['student1_elective1',
 'student1_elective2',
 'student1_elective3',
 'student1_elective4',
 'student1_elective5',
 'student1_elective6',
 'student1_elective7',
 'student1_elective8',
 'student1_elective9',
 'student1_elective10',
 'student1_elective11',
 'student1_elective12']

如果大卫被分配到选修课 1（“羽毛球”），变量student1_elective1将被设置为 1，其余 11 个变量将被设置为 0。如果大卫被分配到选修课 2（“足球 1”），变量student1_elective2将被设置为 1，其余 11 个变量将被设置为 0。以此类推。

(记住：我们事先不知道大卫将被分配到哪门选修课。我们的优化工作流程的目标是如何以最佳方式分配所有学生。目前，我们将这些变量初始化为布尔变量，这意味着“这些决策变量可能等于 0，或者它们可能等于 1”。)

添加约束条件

接下来，我们将向我们的模型添加约束条件：

# Define the constraints

# 1\. Constraint #1: Each student can only be assigned to one elective option
for student_id in students['student_id']:

    # Collect the decision variables related to that student
    relevant_variables = []
    for d in decis_vars.keys():
        if d.startswith(f'student{student_id}_'):
            relevant_variables.append(decis_vars[d])

    # Add the constraint
    solver.Add(sum(relevant_variables) == 1)

# 2\. Constraint #2: Each student must be assigned to one of their top 4 preferences
for student_id in students['student_id']:

    # Identify the student's top preferences
    preferences = list(preferences_dict[student_id])

    relevant_variables = []
    for p in preferences:
        for d in decis_vars.keys():
            if d.startswith(f'student{student_id}_') and d.endswith(f'elective{p}'):
                relevant_variables.append(decis_vars[d])

    # Add the constraint
    solver.Add(sum(relevant_variables) == 1)

# 3\. Constraint #3: Each elective has a max upper limit
for option in options['option_id']:

    max_num_students = options[options['option_id']==option]['Maximum capacity'].values[0]
    relevant_variables = []
    for d in decis_vars.keys():
        if d.endswith(f'elective{option}'):
            relevant_variables.append(decis_vars[d])

    # Add the constraint
    solver.Add(sum(relevant_variables) <= max_num_students)

定义目标函数

目标函数是我们想要最大化（或最小化）的东西。

用简单的话说，我们的目标函数是：“以最大化学生满意度的同时，满足所有约束条件的方式分配学生。”

我们也可以用数学方式表达这一点：

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/5a13a9206d8ea74306bbbd8b09ea0e05.png

图片由作者提供

如果这看起来像是乱码，别担心！线性规划不如机器学习等（例如）那么出名的原因之一是它隐藏在令人畏惧的数学符号后面，这些符号可能会让新手感到有些害怕。但你不需要理解数学符号就能理解原理和编写代码：

# Define the objective function
objective = solver.Objective()

for index, row in students.iterrows():

    # Fetch the student id
    student_id = row['student_id']

    # Identify the student's top 4 preferences
    for pref_rank in range(1, 5):
        preference = row[f'Preference {pref_rank}']

        # Find the option that corresponds to the preference
        for option_index, option_row in options.iterrows():
            option_id = option_row['option_id']

            if option_id == preference:

                # Find the decision variable for this student and option
                var_name = f'student{student_id}_elective{option_id}'
                if var_name in decis_vars:
                    var = decis_vars[var_name]
                    # Add to the objective function with a weight based on preference rank
                    objective.SetCoefficient(var, 5 - pref_rank)  # Higher preference ranks are weighted more positively

objective.SetMaximization()

解

我们已经初始化了决策变量，定义了约束条件，并指定了目标函数。现在，让我们找到最优解：

status = solver.Solve()

# Check if the problem has an optimal solution
if status == pywraplp.Solver.OPTIMAL:
    print('Optimal solution found!')

    # Print optimal objective value
    print('Objective value =', solver.Objective().Value())

# Optimal solution found!
# Objective value = 785.0

显示解决方案

我们算法成功地将所有学生分配到满足我们的约束条件并最大化学生满意度的方案中。哇哦！现在，我们需要查看解决方案。

我们可以只检查每个决策变量的值：

for var in decis_vars.values():
    print(var.solution_value())

…但这会变得很繁琐！让我们定义一些辅助函数来打印结果并获取一些有用的统计数据：

def get_relevant_vars(student_id):
    """Find the (12) decision variables related to a given `student_id`."""
    return {k: v for k, v in decis_vars.items() if f'student{student_id}' in k}

def show_elective(list_of_vars):
    """For a given list of decision variables, find the one(s) equal to 1."""
    vars_equal_to_1 = [k.split('elective')[1] for k, v in list_of_vars.items() if v.solution_value() == 1]
    return vars_equal_to_1[0]

def get_final_allocations(list_of_student_ids=students['student_id']):
    """For a given `list_of_student_ids`, retrieve the final allocations."""
    final_allocations = {student: show_elective(get_relevant_vars(student)) for student in list_of_student_ids}
    return final_allocations

def count_final_allocations(allocations_dict):
    """Count the number of student allocated to each elective."""
    return Counter(allocations_dict.values())

def count_num_students_who_got_each_choice():
    """Count how many students got their 1st choice, how many got their second, etc."""       

    satisfaction = {}
    for student in students['student_id']:
        allocated_elective = final_allocations[student]

        for i in range(1, 5):
            choice = students[students['student_id']==student][f'Preference {i}'].values[0]

            if allocated_elective == str(choice):
                satisfaction[student] = i

    return satisfaction

这将使我们能够生成三样东西：（1）一个字典，显示每个学生被分配到何处：

final_allocations = get_final_allocations()
final_allocations

# {1: '10',
#  2: '6',
#  3: '12',
#  4: '4',
#  5: '3',
#  6: '1',
#  7: '5',
#  ...
#  200: '9'}

…（2）分配给每个选修课的学生总数：

count_final_allocations(final_allocations)

# Result
Counter({'5': 22,
         '6': 20,
         '8': 20,
         '7': 20,
         '1': 19,
         '3': 16,
         '10': 15,
         '12': 15,
         '4': 15,
         '11': 15,
         '2': 13,
         '9': 10})

…（3）获得第一选择、第二选择等的学生人数：

performance = Counter(count_num_students_who_got_each_choice().values())

for i in range(1, 5):
    print(f"The number of students who got their #{i} choice was {performance[i]}")

# Result
# The number of students who got their #1 choice was 176
# The number of students who got their #2 choice was 24
# The number of students who got their #3 choice was 0
# The number of students who got their #4 choice was 0

从中我们可以看出，我们的算法做得相当不错：200 名学生中有 176 人得到了他们的首选，剩下的 24 人得到了他们的次选。不错！

第三部分：更现实的投票偏好

到目前为止，我们一直假设学生的偏好是随机的，就像学生选择“足球”和“编织篮子”的可能性一样。

然而，在现实中，一些选项可能会被过度预订，而其他选项可能会被预订不足。

让我们更新学生的偏好，以考虑这一点。而不是这一行：

preferences_dict = {student: np.random.choice(options['option_id'], size=4, replace=False) for student in student_ids}

我们将编写：

preferences_dict = {student: np.random.choice(options['option_id'], size=4, replace=False, p=[0.05, 0.25, 0.25, 0.1, 0.05, 0.05, 0.05, 0.05, 0.1, 0.01, 0.01, 0.03]) for student in student_ids}

我们使用p参数来指定选择每个选项的概率。我给“足球 1”和“足球 2”赋予了最高的权重。这意味着（不出所料），这些选项将获得最多的选票：

temp_df = students['Preference 1'].value_counts().to_frame()
temp_df.index.name = 'option_id'
temp_df.reset_index(inplace=True)

pd.merge(
    temp_df,
    options,
    on='option_id'
)

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/1b33953438fb62e0ee5b43fe90f49aab.png

图片由作者提供

现在，如果我们再次运行优化代码，我们会看到算法仍然找到了一个最优解，但被分配到首选选项的人数减少了：

performance = Counter(count_num_students_who_got_each_choice().values())

for i in range(1, 5):
    print(f"The number of students who got their #{i} choice was {performance[i]}")

# Result
# The number of students who got their #1 choice was 143
# The number of students who got their #2 choice was 51
# The number of students who got their #3 choice was 6
# The number of students who got their #4 choice was 0

第四部分：附加约束

模型看起来不错，但仍然过于简单，在现实世界中没有太大用处。

让我们通过添加一个附加约束来纠正这一点：

每个选修课至少需要 5 名学生才能开课。如果学生人数少于 5 人，课程将无法开设，学生需要被分配到其他地方。

要进行这些更改，请将以下行添加到你的代码的约束部分中（在我们定义目标函数之前）：

# 4\. Each elective needs a minimum of 5 students to run
min_num_students = 5
for option in options['option_id']:

    relevant_variables = []
    for d in decis_vars.keys():
        if d.endswith(f'elective{option}'):
            relevant_variables.append(decis_vars[d])

    # Add the constraint
    solver.Add(sum(relevant_variables) >= min_num_students)