MoviesLens.py

本文详细介绍了MoviesLens数据集的预处理流程,包括用户评分数据的整合、电影信息的读取、电影受欢迎程度的排行计算、电影分类标签的向量化以及电影年份的提取,为后续的推荐系统训练打下坚实的基础。

MoviesLens.py

此项目用到的数据集是MoviesLens数据集,其中包括用户评分、电影信息等数据。MoviesLens.py是进行数据的预处理文件。下面是对数据预处理流程代码的详解过程。
首先,涉及到的库如下所示

import csv
import re
import pandas as pd
from surprise import Reader
from surprise import Dataset
from collections import defaultdict
import os
import sys

其中我们较为不熟悉的是surprise库,用到的Reader和Dataset类都是为了让数据集适配之后的训练过程,在此就不详细介绍此库。如果想了解这两个类的参数配置,请点击ReaderDataset去做相应的了解。
除此之外,collections中的defaultdict类是用来初始化字典的,如a = defaulydict(int)等。

loadMovieLensDataset函数

这个函数主要用于获取用户评价数据

    def loadMovieLensDataset(self):

        ratingsDataset = 0
        self.movieId_to_movieName = {}  # 将电影ID和电影名称一一对应的字典
        self.movieName_to_movieId = {}  # 将电影名称和电影ID一一对应的字典
        # 读取用户对电影评价的csv和作者自己对电影评价的csv并设置header
        df1 = pd.read_csv(self.rating_file_location, skiprows=1)
        df1.columns = ['user', 'item', 'rating', 'timestamp']
        df2 = pd.read_csv(self.my_rating_file_location, skiprows=1)
        df2.columns = ['user', 'item', 'rating', 'timestamp']
        frame = [df1, df2]
        ratingsData = pd.concat(frame, ignore_index=True)  # 将两个dataframe进行合并
        print(ratingsData.head())
        ratingsData.to_csv("ratings-Data.csv", index=False)

        ratingsData = ratingsData.astype({'user': str, 'item': str, 'rating': str, 'timestamp': str})
        print(ratingsData.tail())  # 打印数据的后五行数据
        # 将数据转化成surprise使用的数据形式
        reader = Reader(line_format='user item rating timestamp')
        ratingsDataset = Dataset.load_from_df(ratingsData[['user', 'item', 'rating']], reader)

        # 打开电影文件, 采用'ISO-8859-1'编码避免编码错误
        # 如果 csvfile 是文件对象,则应使用newline打开它。
        # 获取电影Id和名称的对应的两个字典
        with open(self.movies_file_location, newline='', encoding='ISO-8859-1') as csv_file:
            # 返回reader器对象,该对象将迭代给定csvfile中的行。
            movie_Reader = csv.reader(csv_file)
            next(movie_Reader)  # 跳过文件的第一行(标签)
            for row in movie_Reader:
                movieID = int(row[0])  # 获取电影的ID
                movieName = row[1]  # 获取电影名称
                # 将电影名称和Id进行对应,得到两个字典
                self.movieId_to_movieName[movieID] = movieName
                self.movieName_to_movieId[movieName] = movieID
        # 返回合并完并且修改好的数据表
        return ratingsDataset

getMovieName函数

此函数根据电影Id获取电影名称

    def getMovieName(self, movieID):
        if movieID in self.movieId_to_movieName:
            return self.movieId_to_movieName[movieID]
        else:
            return ""

getPopularityRanks函数

根据用户的评分获取电影受欢迎程度排行榜(被评分次数排行榜)

    def getPopularityRanks(self):
        ratings = defaultdict(int)  # 电影被评次数字典
        rankings = defaultdict(int)  # 电影受欢迎排行榜字典
        with open(self.new_rating_csv, newline='') as cvsfile:
            ratingReader = csv.reader(cvsfile)
            next(ratingReader)
            for row in ratingReader:
                movieID = int(row[1])
                ratings[movieID] += 1  # 电影ID出现一次,被评次数加一
        rank = 1  # 初始排位第一名
        # 根据rating的被评次数给rating字典进行排序(升序)
        for movieID, _ in sorted(ratings.items(), key=lambda x: x[1], reverse=True):
            # 根据排序依次排名
            rankings[movieID] = rank
            rank += 1
        return rankings

getGenres函数

根据电影分类标签获取向量化的标签

    def getGenres(self):
        genres = defaultdict(list)  # 每个电影分类标签列表字典
        genreIDs = {}  # 电影ID和电影分类标签(数字形式)对照字典
        maxGenreID = 0  # 电影分类标签数量
        with open(self.movies_file_location, newline='', encoding='ISO-8859-1') as csvfile:
            movieReader = csv.reader(csvfile)
            next(movieReader)  # Skip header line
            for row in movieReader:
                movieID = int(row[0])
                genreList = row[2].split('|')  # 获取每一部电影的多个分类标签
                genreIDList = []  # 每个电影拥有的分类标签列表
                for genre in genreList:
                    if genre in genreIDs:  # 如果genre存在于已有的分类标签
                        genreID = genreIDs[genre]  # 将genreID赋值为该标签对应的数字
                    else:  # 如果不存在
                        genreID = maxGenreID  # 把一个新的数字类别赋值给genreID
                        genreIDs[genre] = genreID  # 添加新的类别对照进入对照字典
                        maxGenreID += 1  # 类别数量加一
                    genreIDList.append(genreID)  # 更新每个电影拥有的分类标签列表(数字形式)
                genres[movieID] = genreIDList  # 获得每个电影分类标签列表字典
        # 将数字形式的分类标签列表转换成0、1表示的矩阵
        for (movieID, genreIDList) in genres.items():
            bitfield = [0] * maxGenreID  # 创建一个类别数量长度的行向量
            for genreID in genreIDList:  # 遍历拥有的类别列表(数字列表),以此来确定行向量的哪些位置赋值为1
                bitfield[genreID] = 1
            genres[movieID] = bitfield  # 将标签字典换成标签行向量

        return genres

getYears函数

获取电影年份

    def getYears(self):
        expToMatch = re.compile(r"(?:\((\d{4})\))?\s*$")  # 取出年份的正则表达式
        years = defaultdict(int)
        with open(self.movies_file_location, newline='', encoding='ISO-8859-1') as csvfile:
            movieReader = csv.reader(csvfile)
            next(movieReader)
            for row in movieReader:
                movieId = int(row[0])
                title = row[1]
                rawYear = expToMatch.search(title)  # 在电影标题中根据正则表达式搜索年份
                year = rawYear.group(1)  # 获取第二个括号匹配的内容
                if year:
                    years[movieId] = int(year)
        return years
(nnunet_env) jzuser@vpc87-3:~/Work_dir/Gn/pystudy/nnUNet/nnUNet$ ls -R .: documentation LICENSE nnunetv2 nnunetv2.egg-info pyproject.toml readme.md setup.py UNKNOWN.egg-info ./documentation: assets dataset_format.md __init__.py run_inference_with_pretrained_models.md benchmarking.md explanation_normalization.md installation_instructions.md set_environment_variables.md changelog.md explanation_plans_files.md manual_data_splits.md setting_up_paths.md competitions extending_nnunet.md pretraining_and_finetuning.md tldr_migration_guide_from_v1.md convert_msd_dataset.md how_to_use_nnunet.md region_based_training.md dataset_format_inference.md ignore_label.md resenc_presets.md ./documentation/assets: amos2022_sparseseg10_2d.png dkfz_logo.png nnUNetMagician.png regions_vs_labels.png sparse_annotation_amos.png amos2022_sparseseg10.png HI_Logo.png nnU-Net_overview.png scribble_example.png ./documentation/competitions: AortaSeg24.md AutoPETII.md FLARE24 __init__.py Toothfairy2 ./documentation/competitions/FLARE24: __init__.py Task_1 Task_2 ./documentation/competitions/FLARE24/Task_1: inference_flare_task1.py __init__.py readme.md ./documentation/competitions/FLARE24/Task_2: inference_flare_task2.py __init__.py readme.md ./documentation/competitions/Toothfairy2: inference_script_semseg_only_customInf2.py __init__.py readme.md ./nnunetv2: batch_running ensembling imageio model_sharing preprocessing training configuration.py evaluation inference paths.py run utilities dataset_conversion experiment_planning __init__.py postprocessing tests ./nnunetv2/batch_running: benchmarking collect_results_custom_Decathlon.py __init__.py release_trainings collect_results_custom_Decathlon_2d.py generate_lsf_runs_customDecathlon.py jobs.sh ./nnunetv2/batch_running/benchmarking: generate_benchmarking_commands.py __init__.py summarize_benchmark_results.py ./nnunetv2/batch_running/release_trainings: __init__.py nnunetv2_v1 ./nnunetv2/batch_running/release_trainings/nnunetv2_v1: collect_results.py generate_lsf_commands.py __init__.py ./nnunetv2/dataset_conversion: convert_MSD_dataset.py Dataset114_MNMs.py Dataset223_AMOS2022postChallenge.py convert_raw_dataset_from_old_nnunet_format.py Dataset115_EMIDEC.py Dataset224_AbdomenAtlas1.0.py Dataset015_018_RibFrac_RibSeg.py Dataset119_ToothFairy2_All.py Dataset226_BraTS2024-BraTS-GLI.py Dataset021_CTAAorta.py Dataset120_RoadSegmentation.py Dataset227_TotalSegmentatorMRI.py Dataset023_AbdomenAtlas1_1Mini.py Dataset137_BraTS21.py Dataset987_dummyDataset4.py Dataset027_ACDC.py Dataset218_Amos2022_task1.py Dataset989_dummyDataset4_2.py Dataset042_BraTS18.py Dataset219_Amos2022_task2.py datasets_for_integration_tests Dataset043_BraTS19.py Dataset220_KiTS2023.py generate_dataset_json.py Dataset073_Fluo_C3DH_A549_SIM.py Dataset221_AutoPETII_2023.py __init__.py ./nnunetv2/dataset_conversion/datasets_for_integration_tests: Dataset996_IntegrationTest_Hippocampus_regions_ignore.py Dataset998_IntegrationTest_Hippocampus_ignore.py __init__.py Dataset997_IntegrationTest_Hippocampus_regions.py Dataset999_IntegrationTest_Hippocampus.py ./nnunetv2/ensembling: ensemble.py __init__.py ./nnunetv2/evaluation: accumulate_cv_results.py evaluate_predictions.py find_best_configuration.py __init__.py ./nnunetv2/experiment_planning: dataset_fingerprint __init__.py plan_and_preprocess_entrypoints.py verify_dataset_integrity.py experiment_planners plan_and_preprocess_api.py plans_for_pretraining ./nnunetv2/experiment_planning/dataset_fingerprint: fingerprint_extractor.py __init__.py ./nnunetv2/experiment_planning/experiment_planners: default_experiment_planner.py __init__.py network_topology.py resampling resencUNet_planner.py residual_unets ./nnunetv2/experiment_planning/experiment_planners/resampling: __init__.py planners_no_resampling.py resample_with_torch.py ./nnunetv2/experiment_planning/experiment_planners/residual_unets: __init__.py residual_encoder_unet_planners.py ./nnunetv2/experiment_planning/plans_for_pretraining: __init__.py move_plans_between_datasets.py ./nnunetv2/imageio: base_reader_writer.py natural_image_reader_writer.py reader_writer_registry.py simpleitk_reader_writer.py __init__.py nibabel_reader_writer.py readme.md tif_reader_writer.py ./nnunetv2/inference: data_iterators.py export_prediction.py JHU_inference.py readme.md examples.py __init__.py predict_from_raw_data.py sliding_window_prediction.py ./nnunetv2/model_sharing: entry_points.py __init__.py model_download.py model_export.py model_import.py ./nnunetv2/postprocessing: __init__.py remove_connected_components.py ./nnunetv2/preprocessing: cropping __init__.py normalization preprocessors resampling ./nnunetv2/preprocessing/cropping: cropping.py __init__.py ./nnunetv2/preprocessing/normalization: default_normalization_schemes.py __init__.py map_channel_name_to_normalization.py readme.md ./nnunetv2/preprocessing/preprocessors: default_preprocessor.py __init__.py ./nnunetv2/preprocessing/resampling: default_resampling.py __init__.py no_resampling.py resample_torch.py utils.py ./nnunetv2/run: __init__.py load_pretrained_weights.py run_training.py ./nnunetv2/tests: example_data __init__.py integration_tests ./nnunetv2/tests/example_data: example_ct_sm.nii.gz example_ct_sm_T300_output.nii.gz ./nnunetv2/tests/integration_tests: add_lowres_and_cascade.py lsf_commands.sh run_integration_test_bestconfig_inference.py run_nnunet_inference.py cleanup_integration_test.py prepare_integration_tests.sh run_integration_test.sh __init__.py readme.md run_integration_test_trainingOnly_DDP.sh ./nnunetv2/training: data_augmentation dataloading __init__.py logging loss lr_scheduler nnUNetTrainer ./nnunetv2/training/data_augmentation: compute_initial_patch_size.py custom_transforms __init__.py ./nnunetv2/training/data_augmentation/custom_transforms: cascade_transforms.py __init__.py region_based_training.py deep_supervision_donwsampling.py masking.py transforms_for_dummy_2d.py ./nnunetv2/training/dataloading: data_loader.py __init__.py nnunet_dataset.py utils.py ./nnunetv2/training/logging: __init__.py nnunet_logger.py ./nnunetv2/training/loss: compound_losses.py deep_supervision.py dice.py __init__.py robust_ce_loss.py ./nnunetv2/training/lr_scheduler: __init__.py polylr.py warmup.py ./nnunetv2/training/nnUNetTrainer: __init__.py nnUNetTrainer.py primus variants ./nnunetv2/training/nnUNetTrainer/primus: __init__.py primus_trainers.py ./nnunetv2/training/nnUNetTrainer/variants: benchmarking data_augmentation loss network_architecture sampling competitions __init__.py lr_schedule optimizer training_length ./nnunetv2/training/nnUNetTrainer/variants/benchmarking: __init__.py nnUNetTrainerBenchmark_5epochs_noDataLoading.py nnUNetTrainerBenchmark_5epochs.py ./nnunetv2/training/nnUNetTrainer/variants/competitions: aortaseg24.py __init__.py ./nnunetv2/training/nnUNetTrainer/variants/data_augmentation: __init__.py nnUNetTrainerDAOrd0.py nnUNetTrainer_noDummy2DDA.py nnUNetTrainerDA5.py nnUNetTrainerNoDA.py nnUNetTrainerNoMirroring.py ./nnunetv2/training/nnUNetTrainer/variants/loss: __init__.py nnUNetTrainerCELoss.py nnUNetTrainerDiceLoss.py nnUNetTrainerTopkLoss.py ./nnunetv2/training/nnUNetTrainer/variants/lr_schedule: __init__.py nnUNetTrainerCosAnneal.py nnUNetTrainer_warmup.py ./nnunetv2/training/nnUNetTrainer/variants/network_architecture: __init__.py nnUNetTrainerBN.py nnUNetTrainerNoDeepSupervision.py ./nnunetv2/training/nnUNetTrainer/variants/optimizer: __init__.py nnUNetTrainerAdam.py nnUNetTrainerAdan.py ./nnunetv2/training/nnUNetTrainer/variants/sampling: __init__.py nnUNetTrainer_probabilisticOversampling.py ./nnunetv2/training/nnUNetTrainer/variants/training_length: __init__.py nnUNetTrainer_Xepochs_NoMirroring.py nnUNetTrainer_Xepochs.py ./nnunetv2/utilities: collate_outputs.py default_n_proc_DA.py helpers.py network_initialization.py crossval_split.py file_path_utilities.py __init__.py overlay_plots.py dataset_name_id_conversion.py find_class_by_name.py json_export.py plans_handling ddp_allgather.py get_network_from_plans.py label_handling utils.py ./nnunetv2/utilities/label_handling: __init__.py label_handling.py ./nnunetv2/utilities/plans_handling: __init__.py plans_handler.py ./nnunetv2.egg-info: dependency_links.txt entry_points.txt PKG-INFO requires.txt SOURCES.txt top_level.txt ./UNKNOWN.egg-info: dependency_links.txt PKG-INFO SOURCES.txt top_level.txt (nnunet_env) jzuser@vpc87-3:~/Work_dir/Gn/pystudy/nnUNet/nnUNet$
最新发布
08-15
你运行了 `ls -R`,显示了当前目录的完整结构,看起来你已经进入了 `nnUNet/nnUNet` 目录,并且: - 存在 `setup.py` 文件(虽然你之前看到它内容为空) - 存在完整的 `nnunetv2/` 源码目录 - 存在 `nnunetv2_plan_and_preprocess` 的模块路径:`nnunetv2/experiment_planning/plan_and_preprocess_entrypoints.py` --- ## ✅ 问题定位 你之前运行 `cat setup.py` 显示内容为空,这说明: - 当前目录下的 `setup.py` 是空的,无法通过 `pip install -e .` 生成命令行脚本(如 `nnunetv2_plan_and_preprocess`) - 但是你拥有完整的源码结构,说明你**可以手动运行模块** --- ## ✅ 解决方案:绕过命令行工具,直接调用 Python 模块 你可以使用如下方式直接运行 `nnunetv2_plan_and_preprocess` 的功能: ### ✅ 手动运行预处理模块(Python 模块方式) ```bash cd ~/Work_dir/Gn/pystudy/nnUNet/nnUNet PYTHONPATH=/home/jzuser/Work_dir/Gn/pystudy/nnUNet/nnUNet python -m nnunetv2.experiment_planning.plan_and_preprocess_entrypoints.nnunetv2_plan_and_preprocess -d Task003_Liver --configuration 3d_fullres ``` ### 🔍 参数说明: - `PYTHONPATH=...`:告诉 Python 到哪里去找 `nnunetv2` 模块 - `-m ...`:运行 `plan_and_preprocess_entrypoints.py` 中的 `nnunetv2_plan_and_preprocess` 模块 - `-d Task003_Liver`:指定你要预处理数据集 - `--configuration 3d_fullres`:指定配置(3D 全分辨率) --- ## ✅ 设置环境变量(非常重要) 确保你设置了以下环境变量: ```bash export nnUNet_raw=/home/jzuser/Work_dir/Gn export nnUNet_preprocessed=/home/jzuser/Work_dir/Gn/preprocessed export nnUNet_results=/home/jzuser/Work_dir/Gn/results ``` 这些路径用于告诉 `nnUNet` 数据放在哪里。 --- ## ✅ 示例完整命令流程 ```bash # 设置环境变量 export nnUNet_raw=/home/jzuser/Work_dir/Gn export nnUNet_preprocessed=/home/jzuser/Work_dir/Gn/preprocessed export nnUNet_results=/home/jzuser/Work_dir/Gn/results # 设置 PYTHONPATH export PYTHONPATH=/home/jzuser/Work_dir/Gn/pystudy/nnUNet/nnUNet:$PYTHONPATH # 进入项目目录 cd ~/Work_dir/Gn/pystudy/nnUNet/nnUNet # 手动运行预处理 python -m nnunetv2.experiment_planning.plan_and_preprocess_entrypoints.nnunetv2_plan_and_preprocess -d Task003_Liver --configuration 3d_fullres ``` --- ## ✅ 如果你希望修复 `setup.py`(可选) 你可以将下面的内容保存为 `setup.py` 来修复它: ```python from setuptools import setup, find_packages setup( name="nnunetv2", version="2.1", packages=find_packages(), entry_points={ 'console_scripts': [ 'nnunetv2_plan_and_preprocess = nnunetv2.experiment_planning.plan_and_preprocess_entrypoints:nnunetv2_plan_and_preprocess', 'nnunetv2_train = nnunetv2.run.run_training:main', 'nnunetv2_predict = nnunetv2.inference.predict:main', 'nnunetv2_convert_decathlon_task = nnunetv2.dataset_conversion.convert_Decathlon_dataset:main', ], }, ) ``` 保存后重新运行: ```bash pip install -e . ``` 然后你就可以使用命令: ```bash nnunetv2_plan_and_preprocess -d Task003_Liver --configuration 3d_fullres ``` --- ## ✅ 总结 | 目标 | 方法 | |------|------| | 运行预处理 | 使用 `python -m` 手动调用模块 | | 设置路径 | 设置 `nnUNet_raw`, `nnUNet_preprocessed`, `nnUNet_results` | | 修复 `setup.py` | 替换为空文件 → 使用上面的完整 `setup.py` 内容 | | 生成命令行脚本 | 安装修复后的 `setup.py` | --- ##
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值