Import Data from Txt or CSV files into MYSQL database tables

本文介绍如何使用MySQL的LOAD DATA INFILE功能从txt文件批量导入数据到数据库表中。通过一个具体示例展示了连接数据库、执行导入操作的全过程,并提供了一段完整的Java代码实现。
 

Mysql-connector-java-3.1.10 is a JDBC connector for MYSQL database. MYSQL provides LOAD DATA INFILE utility to import data from files like csv, txt or xls into database tables.

The example below imports data from .txt file into table.

temp.txt file is a tab separated file:

"1 string"      100
"2 string" 102
"3 string" 104
"4 string" 106
testtable structure
CREATE TABLE testtable
(id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
text varchar(45) NOT NULL,
price integer not null);
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.Statement;
import java.sql.ResultSet;
import java.sql.SQLException;

public  class  automateImport
{
    public static void main(String[] args
    {
        DBase  db  = new DBase();
        Connection conn = db.connect(
    "jdbc:mysql://localhost:3306/test","root","caspian");
        db.importData(conn,args[0]);
    }

}

class DBase
{
    public DBase()
    {
    }

    public Connection connect(String db_connect_str, 
  String db_userid, String db_password)
    {
        Connection conn;
        try 
        {
            Class.forName(  
    "com.mysql.jdbc.Driver").newInstance();

            conn = DriverManager.getConnection(db_connect_str, 
    db_userid, db_password);
        
        }
        catch(Exception e)
        {
            e.printStackTrace();
            conn = null;
        }

        return conn;    
    }
    
    public void importData(Connection conn,String filename)
    {
        Statement stmt;
        String query;

        try
        {
            stmt = conn.createStatement(
    ResultSet.TYPE_SCROLL_SENSITIVE,
    ResultSet.CONCUR_UPDATABLE);

            query = "LOAD DATA INFILE '"+filename+
    "' INTO TABLE testtable (text,price);";

            stmt.executeUpdate(query);
                
        }
        catch(Exception e)
        {
            e.printStackTrace();
            stmt = null;
        }
    }
};

If you want to import a CSV file, you can use the following query:

query = "LOAD DATA INFILE '"+filename+"' INTO TABLE testtable  FIELDS
TERMINATED BY ',' (text,price)";
""" title: SQL Server Access author: MENG author_urls: - https://github.com/mengvision description: A tool for reading database information and executing SQL queries, supporting multiple databases such as MySQL, PostgreSQL, SQLite, and Oracle. It provides functionalities for listing all tables, describing table schemas, and returning query results in CSV format. A versatile DB Agent for seamless database interactions. required_open_webui_version: 0.5.4 requirements: pymysql, sqlalchemy, cx_Oracle version: 0.1.6 licence: MIT # Changelog ## [0.1.6] - 2025-03-11 ### Added - Added `get_table_indexes` method to retrieve index information for a specific table, supporting MySQL, PostgreSQL, SQLite, and Oracle. - Enhanced metadata capabilities by providing detailed index descriptions (e.g., index name, columns, and type). - Improved documentation to include the new `get_table_indexes` method and its usage examples. - Updated error handling in `get_table_indexes` to provide more detailed feedback for unsupported database types. ## [0.1.5] - 2025-01-20 ### Changed - Updated `list_all_tables` and `table_data_schema` methods to accept `db_name` as a function parameter instead of using `self.valves.db_name`. - Improved flexibility by decoupling database name from class variables, allowing dynamic database selection at runtime. ## [0.1.4] - 2025-01-17 ### Added - Added support for Oracle database using `cx_Oracle` driver. - Added dynamic engine creation in each method to ensure fresh database connections for every operation. - Added support for Oracle-specific queries in `list_all_tables` and `table_data_schema` methods. ### Changed - Moved `self._get_engine()` from `__init__` to individual methods for better flexibility and tool compatibility. - Updated `_get_engine` method to support Oracle database connection URL. - Improved `table_data_schema` method to handle Oracle-specific column metadata. ### Fixed - Fixed potential connection issues by ensuring each method creates its own database engine. - Improved error handling for Oracle-specific queries and edge cases. ## [0.1.3] - 2025-01-17 ### Added - Added support for multiple database types (e.g., MySQL, PostgreSQL, SQLite) using SQLAlchemy. - Added configuration flexibility through environment variables or external configuration files. - Enhanced query security with stricter validation and SQL injection prevention. - Improved error handling with detailed exception messages for better debugging. ### Changed - Replaced `pymysql` with SQLAlchemy for broader database compatibility. - Abstracted database connection logic into a reusable `_get_engine` method. - Updated `table_data_schema` method to support multiple database types. ### Fixed - Fixed potential SQL injection vulnerabilities in query execution. - Improved handling of edge cases in query validation and execution. ## [0.1.2] - 2025-01-16 ### Added - Added support for specifying the database port with a default value of `3306`. - Abstracted database connection logic into a reusable `_get_connection` method. ## [0.1.1] - 2025-01-16 ### Added - Support for additional read-only query types: `SHOW`, `DESCRIBE`, `EXPLAIN`, and `USE`. - Enhanced query validation to block sensitive keywords (e.g., `INSERT`, `UPDATE`, `DELETE`, `CREATE`, `DROP`, `ALTER`). ### Fixed - Improved handling of queries starting with `WITH` (CTE queries). - Fixed case sensitivity issues in query validation. ## [0.1.0] - 2025-01-09 ### Initial Release - Basic functionality for listing tables, describing table schemas, and executing `SELECT` queries. - Query results returned in CSV format. """ import os from typing import List, Dict, Any from pydantic import BaseModel, Field import re from sqlalchemy import create_engine, text from sqlalchemy.engine.base import Engine from sqlalchemy.exc import SQLAlchemyError class Tools: class Valves(BaseModel): db_host: str = Field( default="localhost", description="The host of the database. Replace with your own host.", ) db_user: str = Field( default="admin", description="The username for the database. Replace with your own username.", ) db_password: str = Field( default="admin", description="The password for the database. Replace with your own password.", ) db_name: str = Field( default="db", description="The name of the database. Replace with your own database name.", ) db_port: int = Field( default=3306, # Oracle 默认端口 description="The port of the database. Replace with your own port.", ) db_type: str = Field( default="mysql", description="The type of the database (e.g., mysql, postgresql, sqlite, oracle).", ) def __init__(self): """ Initialize the Tools class with the credentials for the database. """ print("Initializing database tool class") self.citation = True self.valves = Tools.Valves() def _get_engine(self) -> Engine: """ Create and return a database engine using the current configuration. """ if self.valves.db_type == "mysql": db_url = f"mysql+pymysql://{self.valves.db_user}:{self.valves.db_password}@{self.valves.db_host}:{self.valves.db_port}/{self.valves.db_name}" elif self.valves.db_type == "postgresql": db_url = f"postgresql://{self.valves.db_user}:{self.valves.db_password}@{self.valves.db_host}:{self.valves.db_port}/{self.valves.db_name}" elif self.valves.db_type == "sqlite": db_url = f"sqlite:///{self.valves.db_name}" elif self.valves.db_type == "oracle": db_url = f"oracle+cx_oracle://{self.valves.db_user}:{self.valves.db_password}@{self.valves.db_host}:{self.valves.db_port}/?service_name={self.valves.db_name}" else: raise ValueError(f"Unsupported database type: {self.valves.db_type}") return create_engine(db_url) def list_all_tables(self, db_name: str) -> str: """ List all tables in the database. :param db_name: The name of the database. :return: A string containing the names of all tables. """ print("Listing all tables in the database") engine = self._get_engine() # 动态创建引擎 try: with engine.connect() as conn: if self.valves.db_type == "mysql": result = conn.execute(text("SHOW TABLES;")) elif self.valves.db_type == "postgresql": result = conn.execute( text( "SELECT table_name FROM information_schema.tables WHERE table_schema = 'public';" ) ) elif self.valves.db_type == "sqlite": result = conn.execute( text("SELECT name FROM sqlite_master WHERE type='table';") ) elif self.valves.db_type == "oracle": result = conn.execute(text("SELECT table_name FROM user_tables;")) else: return "Unsupported database type." tables = [row[0] for row in result.fetchall()] if tables: return ( "Here is a list of all the tables in the database:\n\n" + "\n".join(tables) ) else: return "No tables found." except SQLAlchemyError as e: return f"Error listing tables: {str(e)}" def get_table_indexes(self, db_name: str, table_name: str) -> str: """ Get the indexes of a specific table in the database. :param db_name: The name of the database. :param table_name: The name of the table. :return: A string describing the indexes of the table. """ print(f"Getting indexes for table: {table_name}") engine = self._get_engine() try: key, cloumn = 0, 1 with engine.connect() as conn: if self.valves.db_type == "mysql": query = text(f"SHOW INDEX FROM {table_name}") key, cloumn = 2, 4 elif self.valves.db_type == "postgresql": query = text( """ SELECT indexname, indexdef FROM pg_indexes WHERE tablename = :table_name; """ ) elif self.valves.db_type == "sqlite": query = text( """ PRAGMA index_list(:table_name); """ ) elif self.valves.db_type == "oracle": query = text( """ SELECT index_name, column_name FROM user_ind_columns WHERE table_name = :table_name; """ ) else: return "Unsupported database type." result = conn.execute(query) indexes = result.fetchall() if not indexes: return f"No indexes found for table: {table_name}" description = f"Indexes for table '{table_name}':\n" for index in indexes: description += f"- {index[key]}: {index[cloumn]}\n" return description # result = conn.execute(query) # description = result.fetchall() # if not description: # return f"No indexes found for table: {table_name}" # column_names = result.keys() # description = f"Query executed successfully. Below is the actual result of the query {query} running against the database in CSV format:\n\n" # description += ",".join(column_names) + "\n" # for row in description: # description += ",".join(map(str, row)) + "\n" # return description except SQLAlchemyError as e: return f"Error getting indexes: {str(e)}" def table_data_schema(self, db_name: str, table_name: str) -> str: """ Describe the schema of a specific table in the database, including column comments. :param db_name: The name of the database. :param table_name: The name of the table to describe. :return: A string describing the data schema of the table. """ print(f"Database: {self.valves.db_name}") print(f"Describing table: {table_name}") engine = self._get_engine() # 动态创建引擎 try: with engine.connect() as conn: if self.valves.db_type == "mysql": query = text( " SELECT COLUMN_NAME, COLUMN_TYPE, IS_NULLABLE, COLUMN_KEY, COLUMN_COMMENT " " FROM INFORMATION_SCHEMA.COLUMNS " f" WHERE TABLE_SCHEMA = '{self.valves.db_name}' AND TABLE_NAME = '{table_name}';" ) elif self.valves.db_type == "postgresql": query = text( """ SELECT column_name, data_type, is_nullable, column_default, '' FROM information_schema.columns WHERE table_name = :table_name; """ ) elif self.valves.db_type == "sqlite": query = text("PRAGMA table_info(:table_name);") elif self.valves.db_type == "oracle": query = text( """ SELECT column_name, data_type, nullable, data_default, comments FROM user_tab_columns LEFT JOIN user_col_comments ON user_tab_columns.table_name = user_col_comments.table_name AND user_tab_columns.column_name = user_col_comments.column_name WHERE user_tab_columns.table_name = :table_name; """ ) else: return "Unsupported database type." # result = conn.execute( # query, {"db_name": db_name, "table_name": table_name} # ) result = conn.execute(query) columns = result.fetchall() if not columns: return f"No such table: {table_name}" description = ( f"Table '{table_name}' in the database has the following columns:\n" ) for column in columns: if self.valves.db_type == "sqlite": column_name, data_type, is_nullable, _, _, _ = column column_comment = "" elif self.valves.db_type == "oracle": ( column_name, data_type, is_nullable, data_default, column_comment, ) = column else: ( column_name, data_type, is_nullable, column_key, column_comment, ) = column description += f"- {column_name} ({data_type})" if is_nullable == "YES" or is_nullable == "Y": description += " [Nullable]" if column_key == "PRI": description += " [Primary Key]" if column_comment: description += f" [Comment: {column_comment}]" description += "\n" return description except SQLAlchemyError as e: return f"Error describing table: {str(e)}" def execute_read_query(self, query: str) -> str: """ Execute a read query and return the result in CSV format. :param query: The SQL query to execute. :return: A string containing the result of the query in CSV format. """ print(f"Executing query: {query}") normalized_query = query.strip().lower() if not re.match( r"^\s*(select|with|show|describe|desc|explain|use)\s", normalized_query ): return "Error: Only read-only queries (SELECT, WITH, SHOW, DESCRIBE, EXPLAIN, USE) are allowed. CREATE, DELETE, INSERT, UPDATE, DROP, and ALTER operations are not permitted." sensitive_keywords = [ "insert", "update", "delete", "create", "drop", "alter", "truncate", "grant", "revoke", "replace", ] for keyword in sensitive_keywords: if re.search(rf"\b{keyword}\b", normalized_query): return f"Error: Query contains a sensitive keyword '{keyword}'. Only read operations are allowed." engine = self._get_engine() # 动态创建引擎 try: with engine.connect() as conn: result = conn.execute(text(query)) rows = result.fetchall() if not rows: return "No data returned from query." column_names = result.keys() csv_data = f"Query executed successfully. Below is the actual result of the query {query} running against the database in CSV format:\n\n" csv_data += ",".join(column_names) + "\n" for row in rows: csv_data += ",".join(map(str, row)) + "\n" return csv_data except SQLAlchemyError as e: return f"Error executing query: {str(e)}" 将上面的工具连接到Microsoft sql server
07-23
import xbot from xbot import web from xbot import excel from xbot import print, sleep from .import package from .package import variables as glv import pymysql import os, shutil import pdfplumber import docx import time, hashlib from pathlib import Path import csv import threading from concurrent.futures import ThreadPoolExecutor, as_completed import queue import requests from urllib.parse import urljoin, urlparse # 全局变量用于存储爬取结果 title_link_queue = queue.Queue() content_queue = queue.Queue() ###############################循环抓取文章的标题和url############################## def get_web(): title_list = [] link_list = [] # 使用多线程同时抓取多页 def fetch_page(i): url = f"https://db.yaozh.com/policies?p={i}" try: web_object = web.create(url, 'edge', load_timeout=200) # 更新XPath以匹配实际页面结构 url_list = web_object.find_all_by_xpath('//div[@class="responsive-table"]/table/tbody/tr/th/a') page_titles = [] page_links = [] for item in url_list: title = item.get_text() link = item.get_attribute('href') # 处理相对链接 if link.startswith('/'): link = urljoin("https://db.yaozh.com", link) # 过滤条件:标题不为空且链接有效 if title and link: page_titles.append(title.strip()) page_links.append(link) web_object.close() return page_titles, page_links except Exception as e: print(f"获取页面 {i} 时出错: {str(e)}") return [], [] # 使用线程池,最多同时运行2个线程(即同时打开2个网页) with ThreadPoolExecutor(max_workers=2) as executor: futures = [executor.submit(fetch_page, i) for i in range(1, 3)] # 提交第1、2页任务 for future in as_completed(futures): page_titles, page_links = future.result() title_list.extend(page_titles) link_list.extend(page_links) return title_list, link_list ###########################将文章的标题和url存储进入到CSV里######################### def url_insert_csv(title_list, link_list): csv_file_path = r'D:\作业\yaozhi_url.csv' # 检查父目录是否存在 directory = os.path.dirname(csv_file_path) if not os.path.exists(directory): os.makedirs(directory) # 自动创建目录 existing_links = set() # 尝试读取已有数据(若文件存在且可读) try: if os.path.exists(csv_file_path): with open(csv_file_path, mode='r', encoding='utf-8-sig') as file: reader = csv.DictReader(file) for row in reader: if '链接' in row: existing_links.add(row['链接']) except PermissionError: print(f"警告:无法读取历史数据,文件可能被占用:{csv_file_path}") return # 或抛出异常提示用户 # 过滤新数据 new_data = [[t, l] for t, l in zip(title_list, link_list) if l not in existing_links] # 写入文件 try: file_mode = 'a' if existing_links else 'w' with open(csv_file_path, mode=file_mode, newline='', encoding='utf-8-sig') as file: writer = csv.writer(file) if file_mode == 'w': writer.writerow(['标题', '链接']) writer.writerows(new_data) print(f"新增 {len(new_data)} 条记录。") except PermissionError: print(f"[错误] 无法写入文件,请检查:\n1. 是否用Excel打开了'{csv_file_path}'\n2. 是否有写入权限") ####################################获取单篇文章内容################################# def get_single_article_content(item, article_title): try: web_object_content = web.create(item, 'edge', load_timeout=200) # 尝试多种XPath来获取表格数据 # 尝试多个可能的XPath table_elements = web_object_content.find_all_by_xpath('//*[contains(@class, "body")]//div[contains(@class, "manual")]//div[@class="content hbg" or @class="content"]') dept = "未知" text_sign = "未知" level = "未知" public_date = "未知" keep_time = "未知" title_content = "未知" # 提取基本信息 if table_elements: try: if len(table_elements) > 5: dept = table_elements[0].get_text()[6:].strip() if len(table_elements[0].get_text()) > 6 else "未知" text_sign = table_elements[1].get_text()[6:].strip() if len(table_elements[1].get_text()) > 6 else "未知" level = table_elements[2].get_text()[6:].strip() if len(table_elements[2].get_text()) > 6 else "未知" public_date = table_elements[3].get_text()[-10:].strip() if len(table_elements[3].get_text()) >= 10 else "未知" keep_time = table_elements[4].get_text()[6:].strip() if len(table_elements[4].get_text()) > 6 else "未知" title_content = table_elements[5].get_text().strip() else: dept = table_elements[0].get_text()[6:].strip() if len(table_elements[0].get_text()) > 6 else "未知" level = table_elements[1].get_text()[6:].strip() if len(table_elements[1].get_text()) > 6 else "未知" public_date = table_elements[2].get_text()[-10:].strip() if len(table_elements[2].get_text()) >= 10 else "未知" keep_time = table_elements[3].get_text()[6:].strip() if len(table_elements[3].get_text()) > 6 else "未知" title_content = table_elements[4].get_text().strip() except Exception as e: print(f"提取基本信息时出错: {str(e)}") # 获取正文内容 text_content = "未知" try: text_elements = web_object_content.find_all_by_xpath('//*[contains(@class, "content")]//div[contains(@class, "new_detail_content")]//p') if text_elements: text_content = ''.join(element.get_text().strip() for element in text_elements if element.get_text().strip()) else: # 尝试其他可能的正文内容XPath text_elements = web_object_content.find_all_by_xpath('//div[@class="content"]/p | //div[@class="content"]/div | //div[@class="article-content"]//p') if text_elements: text_content = '\n'.join(element.get_text().strip() for element in text_elements if element.get_text().strip()) except: pass # 如果正文内容为空,使用标题内容 if not text_content.strip(): text_content = title_content # 下载附件 result_list = download_new(web_object_content, public_date) path_file = result_list[0] all_text_content = result_list[1] target_url = item target_id = url_to_md5(target_url) web_object_content.close() # 合并文章内容和附件内容 if all_text_content != "无附件内容" and all_text_content.strip() != "": # 如果有附件内容,则合并文章内容和附件内容 combined_content = f"{text_content}\n\n附件内容:\n{all_text_content}" else: # 如果无附件内容,则只使用文章内容 combined_content = text_content # 返回数据(按照数据库表结构,添加文章标题字段) data = ["李耀铭", article_title, dept, text_sign, level, public_date, keep_time, combined_content, path_file, item, target_id] return data except Exception as e: print(f"处理文章 {item} 时出错: {str(e)}") try: web_object_content.close() except: pass return None ####################################获取文章内容##################################### def get_content(): # 读取CSV文件中的URL和标题 csv_file_path = 'D:\\作业\\yaozhi_url.csv' execl_url_list = [] execl_title_list = [] with open(csv_file_path, mode='r', encoding='utf-8-sig') as file: reader = csv.reader(file) next(reader) # 跳过表头 for row in reader: execl_title_list.append(row[0]) # 标题在第一列 execl_url_list.append(row[1]) # 链接在第二列 print(f"开始处理 {len(execl_url_list)} 篇文章...") data_lines = [] processed_count = 0 # 使用线程池并行处理文章 max_workers = 2 # 减少并发数以避免浏览器崩溃 with ThreadPoolExecutor(max_workers=max_workers) as executor: # 提交所有任务 future_to_url = {executor.submit(get_single_article_content, url, title): (url, title) for url, title in zip(execl_url_list, execl_title_list)} # 收集结果 for future in as_completed(future_to_url): url, title = future_to_url[future] try: result = future.result() if result: data_lines.append(result) processed_count += 1 print(f"已处理第 {processed_count} 篇文章: {result[2][:20]}...") # 每5篇文章上传到数据库和CSV if len(data_lines) >= 5: print(f"已收集 {len(data_lines)} 篇文章,正在批量上传到数据库和CSV...") upload_to_mysql(data_lines) set_csv(data_lines) # 保存批量数据到CSV # 清空已上传的数据 data_lines = [] print(f"前 {processed_count} 篇文章已批量上传到数据库和CSV") except Exception as e: print(f"处理URL {url} 时出错: {str(e)}") # 上传剩余的数据(如果不足5篇) if data_lines: print(f"正在上传剩余的 {len(data_lines)} 篇文章到数据库和CSV...") upload_to_mysql(data_lines) set_csv(data_lines) # 保存剩余数据到CSV print("剩余文章已上传到数据库和CSV") print(f"总共处理了 {processed_count} 篇文章") return data_lines #######################################附件下载##################################### def download_new(web_object_content, public_date): # 初始化结果变量 str_download_path = "无附件" all_text = "无附件内容" download_list = [] try: # 获取页面中的所有下载链接(包含Word、PDF等) # 尝试多种可能的附件链接XPath download_files = web_object_content.find_all_by_xpath('//a[contains(@href, ".pdf") or contains(@href, ".doc") or contains(@href, ".docx") or contains(@href, ".xls") or contains(@href, ".xlsx")]') if not download_files: download_files = web_object_content.find_all_by_xpath('//span[@class="fileLink"]//a | //div[@class="attachment"]//a') if not download_files: print("未找到附件链接") return [str_download_path, all_text] # 创建根目录和日期文件夹 root_dir = r"D:\\作业\\yaozhi" os.makedirs(root_dir, exist_ok=True) # 确保根目录存在 date_folder = os.path.join(root_dir, public_date if public_date != "未知" else "no_date") os.makedirs(date_folder, exist_ok=True) # 自动创建文件夹(已存在则不报错) for item in download_files: try: # 获取文件信息 file_title = item.get_text().strip() # 获取链接文本作为文件名 file_url = item.get_attribute('href').strip() # 处理相对链接 if file_url.startswith('/'): file_url = urljoin("https://db.yaozh.com", file_url) elif not file_url.startswith(('http://', 'https://')): current_url = web_object_content.get_current_url() file_url = urljoin(current_url, file_url) # 提取原始文件名(从URL中)和文件类型 origin_file_name = os.path.basename(file_url) if '.' not in origin_file_name: # 如果URL中没有扩展名,尝试从链接文本推断 if 'pdf' in file_title.lower(): origin_file_name += '.pdf' elif 'doc' in file_title.lower(): origin_file_name += '.doc' elif 'docx' in file_title.lower(): origin_file_name += '.docx' else: continue # 跳过无法识别的文件 file_type = origin_file_name.split('.')[-1].lower() if '.' in origin_file_name else 'unknown' # 如果文件名为空,使用链接文本作为文件名 if not file_title or file_title == origin_file_name: file_title = origin_file_name elif not file_title.endswith(f'.{file_type}'): file_title = f"{file_title}.{file_type}" # 确保标题带扩展名 print(f"准备下载: {file_title} from {file_url}") # 下载文件 response = requests.get(file_url, stream=True) if response.status_code == 200: # 目标路径(日期文件夹内) target_path = os.path.join(date_folder, file_title) # 处理重名:如果目标文件已存在,添加序号 counter = 1 original_target_path = target_path while os.path.exists(target_path): name, ext = os.path.splitext(file_title) target_path = os.path.join(date_folder, f"{name}_{counter}{ext}") counter += 1 # 保存文件 with open(target_path, 'wb') as f: for chunk in response.iter_content(chunk_size=8192): f.write(chunk) print(f"文件下载成功:{target_path}") # 记录相对路径 relative_path = os.path.relpath(target_path, root_dir) download_list.append(relative_path) # 提取文件内容(根据类型处理) if file_type == 'pdf': try: with pdfplumber.open(target_path) as pdf: page_texts = [] for page in pdf.pages: text = page.extract_text() if text: page_texts.append(text.strip()) extracted_text = "\n\n".join(page_texts) all_text = extracted_text if extracted_text else all_text except Exception as e: print(f"PDF提取错误: {str(e)}") elif file_type in ['docx', 'doc']: # 处理docx和doc try: if file_type == 'docx': doc = docx.Document(target_path) content = [] # 提取段落 for para in doc.paragraphs: para_text = para.text.strip() if para_text: content.append(para_text) # 提取表格 for table in doc.tables: for row in table.rows: for cell in row.cells: cell_text = cell.text.strip() if cell_text: content.append(cell_text) extracted_text = "\n".join(content) all_text = extracted_text if extracted_text else all_text except Exception as e: print(f"Word文档提取错误: {str(e)}") else: print(f"下载失败,状态码: {response.status_code}") except Exception as e: print(f"处理下载链接时出错: {str(e)}") continue # 拼接所有下载路径 if download_list: str_download_path = ",".join(download_list) else: str_download_path = "无附件" # 确保无附件时返回正确的值 except Exception as e: print(f"下载附件时出错: {str(e)}") str_download_path = "无附件" # 出错时也返回正确的值 return [str_download_path, all_text] ##############################platform_id字段的生成############################ def url_to_md5(url): # 创建MD5哈希对象 md5_hash = hashlib.md5() # 将URL编码为UTF-8字节流(哈希函数需要字节输入) md5_hash.update(url.encode('utf-8')) # 获取16进制表示的哈希结果(32位) return md5_hash.hexdigest() ###########################将所需数据插入到指定CSV里######################### def set_csv(data_lines): # 定义CSV文件路径 csv_file_path = 'D:\\作业\\yaozhi' # 检查文件是否存在,以决定是否写入表头 file_exists = os.path.isfile(csv_file_path) # 以追加模式写入CSV文件 with open(csv_file_path, mode='a', newline='', encoding='utf-8-sig') as file: writer = csv.writer(file) # 如果文件不存在,写入表头 if not file_exists: writer.writerow(['学生', '文章标题', '发布部门', '字号', '效力级别', '发布日期', '有效时间', '内容', '附件路径', '原文链接', '平台ID']) # 写入数据 for data in data_lines: writer.writerow(data) def upload_to_mysql(data_lines): """上传数据到MySQL数据库""" config = { "host": "192.168.100.18", "port": 3306, "user": "yiyaoqixie", "password": "123456", "database": "yiyaoqixie", # 原为 "db" -> 应使用 "database" "charset": "utf8mb4", "cursorclass": pymysql.cursors.DictCursor } connection = None cursor = None try: conn = get_mysql_conn() cursor = conn.cursor() cursor.execute(""" INSERT INTO yaozh_policies ( student, title, cat, code, public_date, content, attachments, url, platform_id ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE title = VALUES(title), cat = VALUES(cat), code = VALUES(code), public_date = VALUES(public_date), content = VALUES(content), attachments = VALUES(attachments), url = VALUES(url) """, ( data["student"], data["title"], data["cat"] or "", data["code"] or "", data["public_date"] or "", data["content"] or "", data["attachments"] or "", data["url"], data["platform_id"] )) conn.commit() if cursor.rowcount == 1: print(f"✅ 数据入库成功:{data['title'][:20]}") elif cursor.rowcount == 2: print(f"🔄 数据更新成功:{data['title'][:20]}") cursor.close() conn.close() return True except Exception as e: if 'conn' in locals(): conn.rollback() print(f"❌ 数据入库失败[{data['title'][:20]}]: {str(e)[:60]}") return False finally: # 安全关闭资源 if cursor: cursor.close() if connection: connection.close() #########################################主函数####################################### def main(args): print("开始获取网页数据...") result1, result2 = get_web() print(f"获取到 {len(result1)} 个标题和 {len(result2)} 个链接") if result1 and result2: url_insert_csv(result1, result2) print("开始获取文章内容...") get_content() print("爬取完成!") else: print("未能获取到数据,请检查网络连接和XPath选择器") if __name__ == "__main__": import sys main(sys.argv) 完善一下插入数据库数据片段,并给我完整代码
最新发布
10-31
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值