python机器学习基础笔记3之加载数据(cook book)

博客主要介绍了使用Python进行机器学习时加载数据集的相关内容,涵盖了CSV文件(包括网络URL和本地文件)、EXCEL、JSON文件以及SQL数据库访问等不同格式数据集的加载方式。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Loading datasets

# Load scikit-learn's datasets
from sklearn import datasets

# Load digits dataset(手写数字数据集)
digits = datasets.load_digits()

# Create features matrix
features = digits.data

# Create target vector
target = digits.target
# View first observation
features[0]


部分数据集:

load_boston
Contains 503 observations on Boston housing prices. It is a good dataset for
exploring regression algorithms.
load_iris
Contains 150 observations on the measurements of Iris flowers. It is a good data‐
set for exploring classification algorithms.
load_digits
Contains 1,797 observations from images of handwritten digits. It is a good data‐
set for teaching image classification.

CSV file

网络上url :

# Load library
import pandas as pd

# Create URL
url = 'https://tinyurl.com/simulated_data'

# Load dataset
dataframe = pd.read_csv(url)

# View first two rows
dataframe.head(2)

本地 file:

dataframe = pd.read_csv(r'path')

EXCEL

# Load library
import pandas as pd

# Create URL
url = 'https://tinyurl.com/simulated_excel'

# Load data
dataframe = pd.read_excel(url, sheetname=0, header=1)

# View the first two rows
dataframe.head(2)

# ps: sheetname can accept both strings containing the name of the sheet and
integers pointing to sheet positions (zero-indexed). If we need to load multiple sheets,
include them as a list. For example, sheetname=[0,1,2, "Monthly Sales"] will
return a dictionary of pandas DataFrames containing the first, second, and third
sheets and the sheet named Monthly Sales.

JSON file

# Load library
import pandas as pd

# Create URL
url = 'https://tinyurl.com/simulated_json'

# Load data
dataframe = pd.read_json(url, orient='columns')

# View the first two rows
dataframe.head(2)

注意: orient parameter, which indicates to pandas how the JSON file
is structured. However, it might take some experimenting to figure out which argu‐
ment (split, records, index, columns, and values) is the right one. Another helpful
tool pandas offers is json_normalize, which can help convert semistructured JSON
data into a pandas DataFrame.

SQL 数据库访问

# Load libraries
import pandas as pd
from sqlalchemy import create_engine

# Create a connection to the database
database_connection = create_engine('sqlite:///sample.db')

# Load data
dataframe = pd.read_sql_query('SELECT * FROM data', database_connection)

# View first two rows
dataframe.head(2)
从错误信息来看,问题出在 `pdist` 函数的调用上。具体来说,`ValueError: Unsupported dtype object` 表明传递给 `pdist` 的数据类型为 `object`,而 `pdist` 只支持数值类型的数组(如 `float` 或 `int`)。这通常是由于输入数据中存在非数值类型的数据导致的。 --- ### **问题分析** 1. **错误来源**: - `coords` 是通过 `pd.read_excel` 读取的坐标数据,可能包含非数值类型(如空值、字符串等)。 - 在调用 `pdist` 之前,需要确保 `coords` 是一个纯数值类型的二维数组。 2. **解决方案**: - 检查并清理 `coords` 数据,确保其为纯数值类型。 - 使用 `numpy` 的 `astype` 方法将数据转换为浮点数类型。 --- ### **修复后的代码** 以下是修复后的代码片段: ```python import numpy as np from scipy.spatial.distance import pdist, squareform def generate_initial_routes(coords: np.ndarray, weights: np.ndarray, vehicle_params: dict): """生成初始路径""" try: # 确保 coords 是浮点数类型 coords = np.array(coords, dtype=np.float64) # 计算距离矩阵 dist_matrix = squareform(pdist(coords)) # 初始化变量 routes = {k: [] for k in vehicle_params} unassigned = {k: [] for k in vehicle_params} # 根据垃圾类型筛选未分配点 for k in vehicle_params: unassigned[k] = [i + 1 for i in range(len(weights)) if weights[i][k - 1] > 0] # 构建初始路径 for k in vehicle_params: Q = vehicle_params[k][&#39;Q&#39;] while unassigned[k]: route = [] load = 0 current = 0 # 仓库 while True: # 筛选可服务的候选点 candidates = [ p for p in unassigned[k] if (load + weights[p - 1][k - 1] <= Q) ] if not candidates: break # 选择最近的候选点 next_point = min(candidates, key=lambda p: dist_matrix[current][p]) route.append(next_point) load += weights[next_point - 1][k - 1] unassigned[k].remove(next_point) current = next_point if route: routes[k].append(route) return routes, dist_matrix except Exception as e: print(f"生成初始路径时发生错误: {e}") raise ``` --- ### **解释** 1. **数据类型检查与转换**: - 使用 `np.array(coords, dtype=np.float64)` 将 `coords` 转换为浮点数类型,确保 `pdist` 可以正常计算距离。 - 如果原始数据中存在非数值类型(如空值或字符串),转换时会抛出异常,提示用户数据存在问题。 2. **距离矩阵计算**: - `pdist(coords)` 计算两两之间的距离,并返回一个压缩形式的距离向量。 - `squareform(pdist(coords))` 将压缩形式的距离向量转换为对称的距离矩阵。 3. **路径构建逻辑**: - 遍历每种车辆类型,根据垃圾类型筛选出符合条件的点。 - 使用贪心算法构建初始路径,优先选择最近的点进行分配。 --- ### **完整调用示例** 以下是一个完整的调用示例,展示如何加载数据并生成初始路径: ```python import pandas as pd import os def read_multi_type_data(coord_file, weight_file): """读取多类型垃圾数据""" try: # 检查文件是否存在 if not os.path.exists(coord_file): raise FileNotFoundError(f"文件 {coord_file} 不存在,请检查路径!") if not os.path.exists(weight_file): raise FileNotFoundError(f"文件 {weight_file} 不存在,请检查路径!") # 动态读取坐标数据 coords_df = pd.read_excel(coord_file, sheet_name=0, skiprows=1, usecols=&#39;A:D&#39;) coords = coords_df.iloc[:, 1:3].values # 动态读取垃圾量数据 weights_df = pd.read_excel(weight_file, header=0) weights = weights_df.values # 车辆参数 vehicle_params = { 1: {&#39;Q&#39;: 8, &#39;type&#39;: &#39;厨余垃圾&#39;}, 2: {&#39;Q&#39;: 6, &#39;type&#39;: &#39;可回收物&#39;}, 3: {&#39;Q&#39;: 3, &#39;type&#39;: &#39;有害垃圾&#39;}, 4: {&#39;Q&#39;: 10, &#39;type&#39;: &#39;其他垃圾&#39;} } return coords, weights, vehicle_params except Exception as e: print(f"读取文件时发生错误: {e}") raise if __name__ == &#39;__main__&#39;: try: # 加载数据 coord_file = &#39;附件1.xlsx&#39; weight_file = &#39;附件3.xlsx&#39; coords, weights, vehicle_params = read_multi_type_data(coord_file, weight_file) # 生成初始路径 routes, dist_matrix = generate_initial_routes(coords, weights, vehicle_params) print("初始路径生成成功!") except Exception as e: print(f"程序运行时发生错误: {e}") ``` --- ### **相关问题**
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

万物琴弦光锥之外

给个0.1,恭喜老板发财

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值