Spark与Mysql的交互
2016-09-27 15:26
309 查看
背景
Spark在对目标数据进行计算后,RDD格式的数据一般都会存在HDFS,Hive,HBase中,另一方面,对于非RDD格式的数据,可能会存放在像Mysql中这种传统的RDMS中.但是写入过程中经常出现各种各样的问题, stackoverflow上有很多帖子:
Error writing spark dataframe to mysql table
JDBC batch insert performance
还有些其他的贴
- Using Apache Spark and MySQL for Data Analysis
- spark 1.3.0 将dataframe数据写入Hive分区表
- Spark读取数据库(Mysql)的四种方式详解
- 完整java开发中JDBC连接数据库代码和步骤
- Spark踩坑记——数据库(Hbase+Mysql)
RDD
Spark SQL通过JDBC连接MySQL读写数据非RDD
import java.sql.{Date, DriverManager, PreparedStatement, Connection} /* tableName = "tempTableName" columns = [key : String, value : Int] DBIP = 10.10.10.10 DBPort = 8888 DB = tempDB */ def connection2Mysql() = { var conn : Connection = null var ps : PreparedStatement = nulll val userName = "admin" val passwd = "admin" val key = "Tom" val value = 1024 val sql = "INSERT INTO tempTableName(key,value) values (?,?)" try { Class.forName("com.mysql.jdbc.Driver").newInstance conn = DriverManager.getConnection("jdbc:mysql://10.10.10.10:8888/tempDB", userName, passwd) ps = conn.prepareStatement(sql) ps.setDate(1, key) ps.setLong(2, value) ps.executeUpdate() } catch { case e: Exception => println("----> Exception!\t:\t" + e + "<-----") } finally { if (ps != null) ps.close() if (conn != null) conn.close() } }
这里会遇到的一个问题是,在本地启动client进行功能检查,方法是可行的,但是通过submit提交给YARN之后,却被报错
java.sql.SQLException: No suitable driver found for jdbc:mysql://10.10.10.10:8888/tempDB
或是
Error:java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
发生这种情况的原因是因为在上述代码中,没有找到对应的类,即
Class.forName("com.mysql.jdbc.Driver").newInstance这行代码出现了问题. 这行代码的目的是对driver进行注册,查看Driver源码,如下所示:
/** * The Java SQL framework allows for multiple database drivers. Each driver should supply a class that implements the Driver interface * * The DriverManager will try to load as many drivers as it can find and then for any given connection request, it will ask each driver in turn to try to * connect to the target URL. * * It is strongly recommended that each Driver class should be small and standalone so that the Driver class can be loaded and queried without bringing in vast * quantities of supporting code. * * When a Driver class is loaded, it should create an instance of itself and register it with the DriverManager. This means that a user can load and register a * driver by doing Class.forName("foo.bah.Driver") */ public class Driver extends NonRegisteringDriver implements java.sql.Driver { // Register ourselves with the DriverManager static { try { java.sql.DriverManager.registerDriver(new Driver()); } catch (SQLException E) { throw new RuntimeException("Can't register driver!"); } } /** * Construct a new driver and register it with DriverManager * @throws SQLException if a database error occurs. */ public Driver() throws SQLException { // Required for Class.forName().newInstance() } }
出现这个问题有多种可能,
1. 在–jars参数里面加入Mysql jar包引用是没有用的. 需要通过加入–driver-class-path参数来设置driver的classpath.
$ bin/spark-submit --master local[2] --driver-class-path lib/mysql-connector-java-5.1.38.jar --class temp.jar
原因是两者分发的node不同, link
–driver-class-path driver所依赖的包,多个包之间用冒号(:)分割
–jars driver和executor都需要的包,多个包之间用逗号(,)分割
2. 使用依赖结果打包的时候没有将对应jar包导入.
使用依赖的时候需要将对应jar包打入最终的jar包中,这样才能正确的找到对应的类名并成功注册.
例如,在build.sbt中添加依赖后
// https://mvnrepository.com/artifact/mysql/mysql-connector-java libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.38"
需要在Artifacts中添加对应的jar包.
相关文章推荐
- MySQL中的integer 数据类型
- MySQL存储过程
- Spark RDD API详解(一) Map和Reduce
- 使用spark和spark mllib进行股票预测
- mysql中int、bigint、smallint 和 tinyint的区别与长度
- mysql load data 导出、导入 csv
- source命令执行SQL脚本文件
- MySQL创建用户及权限控制
- MySQL管理数据表
- linux下mysql添加用户
- mysql procedure
- mysql触发器
- MySQL 备份和恢复策略
- Spark随谈——开发指南(译)
- mac下安装mysql(转载)
- mysql 修改编码 Linux/Mac/Unix/通用(杜绝修改后无法启动的情况!)
- MySQL数据的导出、导入(mysql内部命令:mysqldump、mysql)
- mysql数据行转列