pyspark 1.6.2 API http://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html?highlight=jdbc#pyspark.sql.DataFrameWriter.jdbc
1. 数据库以Mysql为例
url = “jdbc:mysql://localhost:3306/test”
table = “test”
properties = {"user":"root","password":"111111"}
df = sqlContext.read.jdbc(url,table,properties) #读
df.write.jdbc(url,table,properties) #写
# 写入时候需要把RDD转为dataframe类型 可用rdd.toDF()
# 如果导入数据有中文
# mysql的表的编码格式设置成utf8 ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
# url=
"jdbc:mysql://127.0.0.1:3306/goddness?useUnicoding=true&characterEncoding=utf-8"
# 带汉字的字段前加u 如:u'汉字'
2. jdbc函数如下,主要传入url table 和properties等三个参数,properties是map类型
def jdbc(self, url, table, mode=None, properties=None):
"""Saves the content of the :class:`DataFrame` to a external database table via JDBC.
.. note:: Don't create too many partitions in parallel on a large cluster;\
otherwise Spark might crash your external database systems.
:param url: a JDBC URL of the form ``jdbc:subprotocol:subname``
:param table: Name of the table in the external database.
:param mode: specifies the behavior of the save operation when data already exists.
* ``append``: Append contents of this :class:`DataFrame` to existing data.
* ``overwrite``: Overwrite existing data.
* ``ignore``: Silently ignore this operation if data already exists.
* ``error`` (default case): Throw an exception if data already exists.
:param properties: JDBC database connection arguments, a list of
arbitrary string tag/value. Normally at least a
"user" and "password" property should be included."""
3. 问题:
连接数据库时用的是什么方法, 线程池?还是普通的连接???