缺失的数|The Missing Number_1144

本文探讨了一种高效算法,用于从给定的整数列表中找到最小的未出现的正整数。通过扫描和标记策略,算法能在O(n)时间内解决此问题,适用于大规模数据集。文章对比了C++和Python实现的难度,并提供了C++代码示例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

描述

Given N integers, you are supposed to find the smallest positive integer that is NOT in the given list.

指定输入

Each input file contains one test case. For each case, the first line gives a positive integer N (≤10^5). Then N integers are given in the next line, separated by spaces. All the numbers are in the range of int.

指定输出

Print in a line the smallest positive integer that is missing from the input list.

输入样例

10
5 -25 9 6 1 3 4 2 5 17

输出样例

7

分析

其实就是找出连续最小的正数,网上有一种解法,怎么说呢,想法很清奇但也很有效,就是下面这个

#include <bits/stdc++.h>
using namespace std;
const int maxn = 100000 + 10;
int n;
int vis[maxn];
int main() {
    scanf("%d", &n);
    int MAX = -1;
    for (int i = 0; i < n; i++) {
        int a;
        scanf("%d", &a);
        if (a > 0 && a < maxn) vis[a] = 1;//注意这里
    }
    for (int i = 1; i < maxn; i++) {
        if (vis[i] == 0) {
            printf("%d\n", i);
            break;
        }
    }
 
    return 0;
}

但python很难做到,因为python只有列表(list)类型,很难初始化一个这样庞大的数组(又或许是我队python掌握得不够深),我的想法是下面(本来使用python写的,但一不小心被我删了),只有c++版本,总是有一个错误但是没有搞出来,很气

#include <iostream>
#include <algorithm>

using namespace std;

int main()
{
	int n;
	cin >> n;
	int in[11000];
	for (int i = 0; i < n; i++) {
		cin >> in[i];
	}
	sort(in, in + n);
	int left_p = 0;
	int right_p = 1;

	for (int i = 0; i < n-1; i++) {
		if (in[left_p] < 0) {
			left_p += 1;
			right_p += 1;
			if (right_p == n) {
				if (in[n - 1] <= 0) {
					cout << 1;
				}
				else
				{
					cout << in[n - 1] + 1;
				}
			}
		}
		else if (in[right_p] - in[left_p] == 1 || in[right_p] - in[left_p] == 0) {
			left_p += 1;
			right_p += 1;
			if (right_p == n) {
				cout << in[left_p] + 1;
				break;
			}
		}
		else
		{
			cout << in[left_p] + 1;
			break;
		}
	}
}
??正文结束??
import chardet import streamlit as st import pandas as pd import numpy as np import joblib import os import time import matplotlib.pyplot as plt import matplotlib as mpl import matplotlib.font_manager as fm import seaborn as sns from pyspark.sql import SparkSession from pyspark.ml.feature import VectorAssembler, StandardScaler from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier, RandomForestClassifier from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.ml.tuning import ParamGridBuilder, CrossValidator from pyspark.sql.functions import when, col from sklearn.metrics import classification_report, confusion_matrix import warnings import dask.dataframe as dd from dask.diagnostics import ProgressBar from dask_ml.preprocessing import StandardScaler as DaskStandardScaler import tempfile import shutil warnings.filterwarnings("ignore") plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False # 页面设置 st.set_page_config( page_title="单宽转融用户预测系统", page_icon="📶", layout="wide", initial_sidebar_state="expanded" ) # 自定义CSS样式 st.markdown(""" <style> .stApp { background: linear-gradient(135deg, #f5f7fa 0%, #e4edf5 100%); font-family: 'Helvetica Neue', Arial, sans-serif; } .header { background: linear-gradient(90deg, #2c3e50 0%, #4a6491 100%); color: white; padding: 1.5rem; border-radius: 0.75rem; box-shadow: 0 4px 12px rgba(0,0,0,0.1); margin-bottom: 2rem; } .card { background: white; border-radius: 0.75rem; padding: 1.5rem; margin-bottom: 1.5rem; box-shadow: 0 4px 12px rgba(0,0,0,0.08); transition: transform 0.3s ease; } .card:hover { transform: translateY(-5px); box-shadow: 0 6px 16px rgba(0,0,0,0.12); } .stButton button { background: linear-gradient(90deg, #3498db 0%, #1a5276 100%) !important; color: white !important; border: none !important; border-radius: 0.5rem; padding: 0.75rem 1.5rem; font-size: 1rem; font-weight: 600; transition: all 0.3s ease; width: 100%; } .stButton button:hover { transform: scale(1.05); box-shadow: 0 4px 8px rgba(52, 152, 219, 0.4); } .feature-box { background: linear-gradient(135deg, #e3f2fd 0%, #bbdefb 100%); border-radius: 0.75rem; padding: 1.5rem; margin-bottom: 1.5rem; } .result-box { background: linear-gradient(135deg, #e8f5e9 0%, #c8e6c9 100%); border-radius: 0.75rem; padding: 1.5rem; margin-top: 1.5rem; } .model-box { background: linear-gradient(135deg, #fff3e0 0%, #ffe0b2 100%); border-radius: 0.75rem; padding: 1.5rem; margin-top: 1.5rem; } .stProgress > div > div > div { background: linear-gradient(90deg, #2ecc71 0%, #27ae60 100%) !important; } .metric-card { background: white; border-radius: 0.75rem; padding: 1rem; text-align: center; box-shadow: 0 4px 8px rgba(0,0,0,0.06); } .metric-value { font-size: 1.8rem; font-weight: 700; color: #2c3e50; } .metric-label { font-size: 0.9rem; color: #7f8c8d; margin-top: 0.5rem; } .highlight { background: linear-gradient(90deg, #ffeb3b 0%, #fbc02d 100%); padding: 0.2rem 0.5rem; border-radius: 0.25rem; font-weight: 600; } .stDataFrame { border-radius: 0.75rem; box-shadow: 0 4px 8px rgba(0,0,0,0.06); } .risk-high { background-color: #ffcdd2 !important; color: #c62828 !important; font-weight: 700; } .risk-medium { background-color: #fff9c4 !important; color: #f57f17 !important; font-weight: 600; } .risk-low { background-color: #c8e6c9 !important; color: #388e3c !important; } </style> """, unsafe_allow_html=True) def preprocess_data(ddf): """使用Dask进行大据预处理""" processed_ddf = ddf.copy() # 删除无意义特征 drop_cols = ['BIL_MONTH', 'ASSET_ROW_ID', 'CCUST_ROW_ID', 'BELONG_CITY', 'MKT_CHANNEL_NAME', 'MKT_CHANNEL_SUB_NAME', 'PREPARE_FLG', 'SERV_START_DT', 'COMB_STAT_NAME', 'FIBER_ACCESS_CATEGORY'] existing_cols = [col for col in drop_cols if col in processed_ddf.columns] processed_ddf = processed_ddf.drop(columns=existing_cols) # 处理缺失值 numeric_cols = processed_ddf.select_dtypes(include=[np.number]).columns.tolist() if 'is_rh_next' in numeric_cols: numeric_cols.remove('is_rh_next') with ProgressBar(): means = processed_ddf[numeric_cols].mean().compute() for col in numeric_cols: processed_ddf[col] = processed_ddf[col].fillna(means[col]) # 类型转换 for col in numeric_cols: if processed_ddf[col].dtype == 'float64': if processed_ddf[col].dropna().apply(lambda x: x == int(x)).all(): processed_ddf[col] = processed_ddf[col].astype('Int64') else: processed_ddf[col] = processed_ddf[col].astype('float64') object_cols = processed_ddf.select_dtypes(include=['object']).columns.tolist() for col in object_cols: processed_ddf[col] = processed_ddf[col].fillna("Unknown") # 离散特征编码 binary_cols = ['IF_YHTS', 'is_kdts', 'is_itv_up', 'is_mobile_up', 'if_zzzw_up'] for col in binary_cols: if col in processed_ddf.columns: processed_ddf[col] = processed_ddf[col].map({'否': 0, '是': 1, 0: 0, 1: 1, 'Unknown': -1}) if 'GENDER' in processed_ddf.columns: gender_mapping = {'男': 0, '女': 1, 'Unknown': -1} processed_ddf['GENDER'] = processed_ddf['GENDER'].map(gender_mapping) if 'MKT_STAR_GRADE_NAME' in processed_ddf.columns: star_mapping = {'五星级': 5, '四星级': 4, '三星级': 3, '二星级': 2, '一星级': 1, 'Unknown': 0} processed_ddf['MKT_STAR_GRADE_NAME'] = processed_ddf['MKT_STAR_GRADE_NAME'].map(star_mapping) # 特征工程 if 'PROM_AMT' in processed_ddf.columns and 'STMT_AMT' in processed_ddf.columns: processed_ddf['CONSUMPTION_RATIO'] = processed_ddf['PROM_AMT'] / (processed_ddf['STMT_AMT'] + 1) if 'DWN_VOL' in processed_ddf.columns and 'ONLINE_DAY' in processed_ddf.columns: processed_ddf['TRAFFIC_DENSITY'] = processed_ddf['DWN_VOL'] / (processed_ddf['ONLINE_DAY'] + 1) if 'TERM_CNT' in processed_ddf.columns: processed_ddf['HAS_TERMINAL'] = (processed_ddf['TERM_CNT'] > 0).astype(int) # 标准化处理 scaler = DaskStandardScaler() numeric_cols_for_scaling = [col for col in numeric_cols if col != 'is_rh_next'] if numeric_cols_for_scaling: processed_ddf[numeric_cols_for_scaling] = scaler.fit_transform(processed_ddf[numeric_cols_for_scaling]) feature_cols = [col for col in processed_ddf.columns if col != 'is_rh_next'] return processed_ddf, feature_cols, means, numeric_cols_for_scaling, scaler def create_spark_session(): """创建或获取现有的Spark会话""" return SparkSession.builder \ .appName("SingleToMeltUserPrediction") \ .config("spark.sql.shuffle.partitions", "8") \ .config("spark.driver.memory", "8g") \ .config("spark.executor.memory", "8g") \ .getOrCreate() def train_models(spark_df, feature_cols): """使用Spark训练多个模型并评估性能""" spark = create_spark_session() assembler = VectorAssembler(inputCols=feature_cols, outputCol="rawFeatures") assembled_df = assembler.transform(spark_df) scaler = StandardScaler(inputCol="rawFeatures", outputCol="features") scaler_model = scaler.fit(assembled_df) scaled_df = scaler_model.transform(assembled_df) train_df, test_df = scaled_df.randomSplit([0.8, 0.2], seed=42) # 定义模型和参网格 models = { "逻辑回归": ( LogisticRegression(featuresCol="features", labelCol="is_rh_next"), ParamGridBuilder().addGrid(LogisticRegression.regParam, [0.01, 0.1]) .addGrid(LogisticRegression.elasticNetParam, [0.0, 0.5]) .build() ), "决策树": ( DecisionTreeClassifier(featuresCol="features", labelCol="is_rh_next"), ParamGridBuilder().addGrid(DecisionTreeClassifier.maxDepth, [5, 10]) .addGrid(DecisionTreeClassifier.minInstancesPerNode, [10, 20]) .build() ), "随机森林": ( RandomForestClassifier(featuresCol="features", labelCol="is_rh_next", numTrees=10), ParamGridBuilder().addGrid(RandomForestClassifier.numTrees, [10, 20]) .addGrid(RandomForestClassifier.maxDepth, [5, 10]) .build() ) } evaluator = BinaryClassificationEvaluator(labelCol="is_rh_next", metricName="areaUnderROC") results = {} for model_name, (model, param_grid) in models.items(): with st.spinner(f"正在训练{model_name}模型..."): cv = CrossValidator(estimator=model, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3) cv_model = cv.fit(train_df) predictions = cv_model.transform(test_df) auc = evaluator.evaluate(predictions) accuracy = predictions.filter(predictions.is_rh_next == predictions.prediction).count() / test_df.count() results[model_name] = { "model": cv_model, "auc": auc, "accuracy": accuracy, "best_params": cv_model.bestModel._java_obj.parent().extractParamMap(), "feature_importances": getattr(cv_model.bestModel, "featureImportances", {}).toArray().tolist() if model_name != "逻辑回归" else None } return results # 页面布局 st.markdown(""" <div class="header"> <h1 style='text-align: center; margin: 0;'>单宽转融用户预测系统</h1> <p style='text-align: center; margin: 0.5rem 0 0; font-size: 1.1rem;'>基于大据挖掘的精准营销分析平台</p> </div> """, unsafe_allow_html=True) col1, col2 = st.columns([1, 1.5]) with col1: st.markdown(""" <div class="feature-box"> <h4>📈 系统功能</h4> <ul> <li>用户转化预测</li> <li>多模型对比分析</li> <li>特征重要性分析</li> <li>可视化据洞察</li> </ul> </div> """, unsafe_allow_html=True) st.image("https://images.unsplash.com/photo-1550751822256-00808c92fc8d?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1200&q=80", caption="精准营销示意图", use_column_width=True) with col2: option = st.radio("", ["🚀 训练新模型 - 使用新据训练预测模型", "🔍 模型分析 - 查看现有模型的分析结果"], index=0, label_visibility="hidden") if "训练新模型" in option: st.markdown("<div class='model-box'><h4>模型训练</h4><p>上传训练据并训练新的预测模型</p></div>", unsafe_allow_html=True) train_file = st.file_uploader("上传训练据 (CSV格式)", type=["csv"], accept_multiple_files=False) if train_file is not None: try: with tempfile.TemporaryDirectory() as tmpdir: tmp_path = os.path.join(tmpdir, "large_file.csv") with open(tmp_path, "wb") as f: f.write(train_file.getvalue()) def detect_encoding(file_path): with open(file_path, 'rb') as f: raw_data = f.read(10000) result = chardet.detect(raw_data) return result['encoding'] detected_encoding = detect_encoding(tmp_path) st.info(f"检测到文件编码: {detected_encoding}") chunksize = 256 * 1024 * 1024 na_values_list = ['', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'] try: raw_ddf = dd.read_csv( tmp_path, blocksize=chunksize, dtype={'is_rh_next': 'float64'}, encoding=detected_encoding, na_values=na_values_list, assume_missing=True, low_memory=False ) except UnicodeDecodeError: st.warning("检测编码读取失败,尝试GB18030编码...") raw_ddf = dd.read_csv( tmp_path, blocksize=chunksize, dtype={'is_rh_next': 'float64'}, encoding='GB18030', na_values=na_values_list, assume_missing=True, low_memory=False ) with st.expander("据预览", expanded=True): preview_data = raw_ddf.head(1000) st.dataframe(preview_data) col1, col2 = st.columns(2) col1.metric("总样本", f"{raw_ddf.shape[0].compute():,}") col2.metric("特征量", len(raw_ddf.columns)) if 'is_rh_next' not in raw_ddf.columns: st.warning("⚠️ 注意:未找到目标变量 'is_rh_next'") if st.button("开始据预处理", use_container_width=True): with st.spinner("正在进行据预处理,请稍候..."): processed_ddf, feature_cols, means, numeric_cols_for_scaling, scaler = preprocess_data(raw_ddf) preprocessor_params = { 'means': means, 'numeric_cols_for_scaling': numeric_cols_for_scaling, 'scaler': scaler, 'feature_cols': feature_cols } joblib.dump(preprocessor_params, 'preprocessor_params.pkl') processed_ddf.to_csv('processed_data_*.csv', index=False) st.success("✅ 据预处理完成!") # 显示处理后的据统计 st.subheader("据质量检查") with st.spinner("计算缺失值统计..."): null_counts = processed_ddf.isnull().sum().compute() st.write("缺失值统计:") st.write(null_counts[null_counts > 0]) # 可视化关键特征分布 st.subheader("关键特征分布") sample_ddf = processed_ddf.sample(frac=0.1) sample_df = sample_ddf.compute() fig, axes = plt.subplots(2, 2, figsize=(12, 10)) sns.histplot(sample_df['AGE'], ax=axes[0, 0], kde=True) sns.histplot(sample_df['ONLINE_DAY'], ax=axes[0, 1], kde=True) sns.histplot(sample_df['PROM_AMT'], ax=axes[1, 0], kde=True) sns.histplot(sample_df['DWN_VOL'], ax=axes[1, 1], kde=True) plt.tight_layout() st.pyplot(fig) # 目标变量分布 st.subheader("目标变量分布") fig, ax = plt.subplots(figsize=(6, 4)) sns.countplot(x='is_rh_next', data=sample_df, ax=ax) ax.set_xlabel("是否转化 (0=未转化, 1=转化)") ax.set_ylabel("用户量") ax.set_title("用户转化分布") st.pyplot(fig) # 特征与目标变量相关性 st.subheader("特征与转化的相关性") with st.spinner("计算特征相关性..."): correlation = sample_df[feature_cols + ['is_rh_next']].corr()['is_rh_next'].sort_values(ascending=False) fig, ax = plt.subplots(figsize=(10, 6)) sns.barplot(x=correlation.values, y=correlation.index, ax=ax) ax.set_title("特征与转化的相关性") st.pyplot(fig) # 模型训练 if st.button("开始模型训练", use_container_width=True): if not any(fname.startswith('processed_data_') for fname in os.listdir('.')): st.error("请先进行据预处理") else: spark = create_spark_session() spark_df = spark.read.csv('processed_data_*.csv', header=True, inferSchema=True) preprocessor_params = joblib.load('preprocessor_params.pkl') feature_cols = preprocessor_params['feature_cols'] with st.spinner("正在训练模型,请耐心等待..."): results = train_models(spark_df, feature_cols) joblib.dump(results, 'model_results.pkl') st.success("🎉 模型训练完成!") # 显示模型比较 st.subheader("模型性能对比") model_performance = pd.DataFrame({ '模型': ['逻辑回归', '决策树', '随机森林'], '准确率': [results['逻辑回归']['accuracy'], results['决策树']['accuracy'], results['随机森林']['accuracy']], 'AUC': [results['逻辑回归']['auc'], results['决策树']['auc'], results['随机森林']['auc']] }).sort_values('AUC', ascending=False) st.table(model_performance.style.format({ '准确率': '{:.2%}', 'AUC': '{:.4f}' })) # 最佳模型特征重要性 best_model_name = model_performance.iloc[0]['模型'] model_map = {'逻辑回归': 'logistic_regression', '决策树': 'decision_tree', '随机森林': 'random_forest'} best_model_key = model_map[best_model_name] best_model = results[best_model_key]['model'].bestModel st.subheader(f"最佳模型 ({best_model_name}) 分析") if best_model_key in ['decision_tree', 'random_forest']: feature_importances = results[best_model_key]['feature_importances'] importance_df = pd.DataFrame({ '特征': feature_cols, '重要性': feature_importances }).sort_values('重要性', ascending=False).head(10) fig, ax = plt.subplots(figsize=(10, 6)) sns.barplot(x='重要性', y='特征', data=importance_df, ax=ax) ax.set_title('Top 10 重要特征') st.pyplot(fig) # 显示最佳模型参 st.subheader("最佳模型参") params = results[best_model_key]['best_params'] param_table = pd.DataFrame({ '参': [str(param.name) for param in params.keys()], '值': [str(value) for value in params.values()] }) st.table(param_table) except Exception as e: st.error(f"据处理错误: {str(e)}") st.exception(e) else: st.markdown("<div class='model-box'><h4>模型分析</h4><p>查看已有模型的详细分析结果</p></div>", unsafe_allow_html=True) if not os.path.exists('model_results.pkl'): st.info("ℹ️ 当前没有可用模型。请先进行模型训练以生成分析报告。") else: results = joblib.load('model_results.pkl') preprocessor_params = joblib.load('preprocessor_params.pkl') feature_cols = preprocessor_params['feature_cols'] model_choice = st.selectbox( "选择要分析的模型", ("逻辑回归", "决策树", "随机森林") ) model_key = model_choice.lower().replace(" ", "_") # 显示模型基本信息 model_info = results[model_choice] st.markdown(f""" <div class="card"> <h3>{model_choice}</h3> <p><strong>AUC得分:</strong> {model_info['auc']:.4f}</p> <p><strong>准确率:</strong> {model_info['accuracy']:.2%}</p> </div> """, unsafe_allow_html=True) # 显示参详情 with st.expander("模型参详情", expanded=False): params = model_info['best_params'] param_table = pd.DataFrame({ '参': [str(param.name) for param in params.keys()], '值': [str(value) for value in params.values()] }) st.table(param_table) # 特征重要性分析 if model_key in ['decision_tree', 'random_forest']: feature_importances = model_info['feature_importances'] importance_df = pd.DataFrame({ '特征': feature_cols, '重要性': feature_importances }).sort_values('重要性', ascending=False) st.subheader("特征重要性分析") top_features = importance_df.head(10) fig, ax = plt.subplots(figsize=(10, 6)) sns.barplot(x='重要性', y='特征', data=top_features, ax=ax) ax.set_title('Top 10 重要特征') st.pyplot(fig) fig, ax = plt.subplots(figsize=(10, 6)) sns.histplot(importance_df['重要性'], bins=20, ax=ax) ax.set_title('特征重要性分布') st.pyplot(fig) st.write("特征重要性详细据:") st.dataframe(importance_df.style.background_gradient(subset=['重要性'], cmap='viridis')) # 模型比较 st.subheader("与其他模型的对比") model_performance = pd.DataFrame({ '模型': ['逻辑回归', '决策树', '随机森林'], '准确率': [results['逻辑回归']['accuracy'], results['决策树']['accuracy'], results['随机森林']['accuracy']], 'AUC': [results['逻辑回归']['auc'], results['决策树']['auc'], results['随机森林']['auc']] }).sort_values('AUC', ascending=False) fig, ax = plt.subplots(figsize=(10, 6)) model_performance.set_index('模型')[['AUC', '准确率']].plot(kind='bar', ax=ax) ax.set_title('模型性能对比') ax.set_ylabel('评分') plt.xticks(rotation=0) st.pyplot(fig) st.table(model_performance.style.format({ '准确率': '{:.2%}', 'AUC': '{:.4f}' }).apply(lambda x: ['background: lightgreen' if x.name == model_performance.index[0] else '' for _ in x])) # 页脚 st.markdown("—") st.markdown(""" <div style="text-align: center; color: #7f8c8d; font-size: 0.9rem; padding: 1rem;"> © 2023 单宽转融用户预测系统 | 2231030273 基于Streamlit和Spark开发 </div> """, unsafe_allow_html=True) 执行上述代码出现如下报错,给出修改后的完整代码 据处理错误: Mismatched dtypes found in pd.read_csv/pd.read_table. +---------------------+--------+----------+ | Column | Found | Expected | +---------------------+--------+----------+ | MAX_PRICE_COMPANY | object | float64 | | MAX_PRICE_MODEL | object | float64 | | MAX_PRICE_TERM_TYPE | object | float64 | | MOBLE_4G_CNT_LV | object | float64 | | MOBLE_CNT_LV | object | float64 | | OWE_AMT_LV | object | float64 | | OWE_CNT_LV | object | float64 | | PROM_INTEG_ID | object | float64 | | TOUSU_CNT_LV | object | float64 | +---------------------+--------+----------+ The following columns also raised exceptions on conversion: MAX_PRICE_COMPANY ValueError("could not convert string to float: '华为'") MAX_PRICE_MODEL ValueError("could not convert string to float: '华为 Che1-CL10'") MAX_PRICE_TERM_TYPE ValueError("could not convert string to float: '4G'") MOBLE_4G_CNT_LV ValueError("could not convert string to float: 'a1'") MOBLE_CNT_LV ValueError("could not convert string to float: 'a1'") OWE_AMT_LV ValueError("could not convert string to float: 'e100+'") OWE_CNT_LV ValueError("could not convert string to float: 'a1'") PROM_INTEG_ID ValueError("could not convert string to float: 'DOC_1-A9Z9Y4W'") TOUSU_CNT_LV ValueError("could not convert string to float: 'a1'") Usually this is due to dask's dtype inference failing, and may be fixed by specifying dtypes manually by adding: dtype={'MAX_PRICE_COMPANY': 'object', 'MAX_PRICE_MODEL': 'object', 'MAX_PRICE_TERM_TYPE': 'object', 'MOBLE_4G_CNT_LV': 'object', 'MOBLE_CNT_LV': 'object', 'OWE_AMT_LV': 'object', 'OWE_CNT_LV': 'object', 'PROM_INTEG_ID': 'object', 'TOUSU_CNT_LV': 'object'} to the call to read_csv/read_table. ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`. +---------------------+--------+----------+ | Column | Found | Expected | +---------------------+--------+----------+ | MAX_PRICE_COMPANY | object | float64 | | MAX_PRICE_MODEL | object | float64 | | MAX_PRICE_TERM_TYPE | object | float64 | | MOBLE_4G_CNT_LV | object | float64 | | MOBLE_CNT_LV | object | float64 | | OWE_AMT_LV | object | float64 | | OWE_CNT_LV | object | float64 | | PROM_INTEG_ID | object | float64 | | TOUSU_CNT_LV | object | float64 | +---------------------+--------+----------+ The following columns also raised exceptions on conversion: - MAX_PRICE_COMPANY ValueError("could not convert string to float: '华为'") - MAX_PRICE_MODEL ValueError("could not convert string to float: '华为 Che1-CL10'") - MAX_PRICE_TERM_TYPE ValueError("could not convert string to float: '4G'") - MOBLE_4G_CNT_LV ValueError("could not convert string to float: 'a1'") - MOBLE_CNT_LV ValueError("could not convert string to float: 'a1'") - OWE_AMT_LV ValueError("could not convert string to float: 'e100+'") - OWE_CNT_LV ValueError("could not convert string to float: 'a1'") - PROM_INTEG_ID ValueError("could not convert string to float: 'DOC_1-A9Z9Y4W'") - TOUSU_CNT_LV ValueError("could not convert string to float: 'a1'") Usually this is due to dask's dtype inference failing, and *may* be fixed by specifying dtypes manually by adding: dtype={'MAX_PRICE_COMPANY': 'object', 'MAX_PRICE_MODEL': 'object', 'MAX_PRICE_TERM_TYPE': 'object', 'MOBLE_4G_CNT_LV': 'object', 'MOBLE_CNT_LV': 'object', 'OWE_AMT_LV': 'object', 'OWE_CNT_LV': 'object', 'PROM_INTEG_ID': 'object', 'TOUSU_CNT_LV': 'object'} to the call to `read_csv`/`read_table`. Traceback: File "D:\2035946879\Single_breadth_to_melt.py", line 351, in <module> preview_data = raw_ddf.head(1000) ^^^^^^^^^^^^^^^^^^ File "D:\Anaconda\Lib\site-packages\dask\dataframe\dask_expr\_collection.py", line 692, in head out = out.compute() ^^^^^^^^^^^^^ File "D:\Anaconda\Lib\site-packages\dask\base.py", line 373, in compute (result,) = compute(self, traverse=False, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Anaconda\Lib\site-packages\dask\base.py", line 681, in compute results = schedule(expr, keys, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Anaconda\Lib\site-packages\dask\dataframe\io\csv.py", line 351, in _read_csv df = pandas_read_text( ^^^^^^^^^^^^^^^^^ File "D:\Anaconda\Lib\site-packages\dask\dataframe\io\csv.py", line 79, in pandas_read_text coerce_dtypes(df, dtypes) File "D:\Anaconda\Lib\site-packages\dask\dataframe\io\csv.py", line 180, in coerce_dtypes raise ValueError(msg)
最新发布
07-01
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值