spark基础练习(未完)
2015-10-14 00:06
316 查看
1、filter
val rdd = sc.parallelize(List(1,2,3,4,5))
val mappedRDD = rdd.map(2*_)
mappedRDD.collect
val filteredRDD = mappedRdd.filter(_>4)
filteredRDD.collect
(上述完整写法)
val filteredRDDAgain = sc.parallelize(List(1,2,3,4,5)).map(2 * _).filter(_ > 4).collect
2、wordcount
val rdd = sc.textfile("/data/README.md")
rdd.count
rdd.cache
val wordcount = rdd.flatMap(_.split('、')).map(_,1).reduceByKey(_+_)
wordcount.collect
wordcount.saveAsTextFile("/data/result")
3、sort
val== rdd.flatMap(_split(' ')).map((_,1)).reduceByKey(_+_).map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1)).saveasTextFile("/data/resultsorted")
4、union
val rdd1 = sc.parallelize(List(('a',1),('b',1)))
val rdd2 = sc.parallelize(List(('c',1),('d',1)))
val result = rdd1 union rdd2
result.collect
(join 同理)
5、连接mysql 创建DF
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{SaveMode, DataFrame}
import org.apache.spark.sql.hive.HiveContext
val mySQLUrl = "jdbc:mysql://localhost:3306/yangsy?user=root&password=yangsiyi"
val people_DDL = s"""
CREATE TEMPORARY TABLE PEOPLE
USING org.apache.spark.sql.jdbc
OPTIONS (
url '${mySQLUrl}',
dbtable 'person'
)""".stripMargin
sqlContext.sql(people_DDL)
val person = sql("SELECT * FROM PEOPLE").cache()
val name = "name"
val targets = person.filter("name ="+name).collect()
for(line <- targets){
val target_name = line(0)
println(target_name)
val target_age = line(1)
println(target_age)
}
![](http://images2015.cnblogs.com/blog/820234/201510/820234-20151026132011591-1168324071.png)
6、手工设置Spark SQL task个数
SQLContext.setConf("spark.sql.shuffle.partitions","10")
val rdd = sc.parallelize(List(1,2,3,4,5))
val mappedRDD = rdd.map(2*_)
mappedRDD.collect
val filteredRDD = mappedRdd.filter(_>4)
filteredRDD.collect
(上述完整写法)
val filteredRDDAgain = sc.parallelize(List(1,2,3,4,5)).map(2 * _).filter(_ > 4).collect
2、wordcount
val rdd = sc.textfile("/data/README.md")
rdd.count
rdd.cache
val wordcount = rdd.flatMap(_.split('、')).map(_,1).reduceByKey(_+_)
wordcount.collect
wordcount.saveAsTextFile("/data/result")
3、sort
val== rdd.flatMap(_split(' ')).map((_,1)).reduceByKey(_+_).map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1)).saveasTextFile("/data/resultsorted")
4、union
val rdd1 = sc.parallelize(List(('a',1),('b',1)))
val rdd2 = sc.parallelize(List(('c',1),('d',1)))
val result = rdd1 union rdd2
result.collect
(join 同理)
5、连接mysql 创建DF
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{SaveMode, DataFrame}
import org.apache.spark.sql.hive.HiveContext
val mySQLUrl = "jdbc:mysql://localhost:3306/yangsy?user=root&password=yangsiyi"
val people_DDL = s"""
CREATE TEMPORARY TABLE PEOPLE
USING org.apache.spark.sql.jdbc
OPTIONS (
url '${mySQLUrl}',
dbtable 'person'
)""".stripMargin
sqlContext.sql(people_DDL)
val person = sql("SELECT * FROM PEOPLE").cache()
val name = "name"
val targets = person.filter("name ="+name).collect()
for(line <- targets){
val target_name = line(0)
println(target_name)
val target_age = line(1)
println(target_age)
}
![](http://images2015.cnblogs.com/blog/820234/201510/820234-20151026132011591-1168324071.png)
6、手工设置Spark SQL task个数
SQLContext.setConf("spark.sql.shuffle.partitions","10")
相关文章推荐
- CSDN周排行第一
- LeetCode -- Happy Number
- LeetCode -- Contains Duplicate
- linux文件权限
- 大数加法和大数减法
- LeetCode – Combination Sum
- mysql rsync方式备份
- 图解django 中间件
- Java基于Scanner对象的简单输入计算功能示例
- iredmail下安装脚本分析(一)---get_all.sh 文件所在目录为PKGS
- exit(-1)或者return(-1)shell得到的退出码为什么是255
- Linux rpm 命令参数使用详解
- linux shell 字符串操作(长度,查找,替换)详解
- linux whatis与whatis database 使用及查询方法(man使用实例)
- linux dev 常见特殊设备介绍与应用(loop,null,zero,full,random)
- linux shell命令快捷获得系统帮助(一)[man-pages定义规范]
- linux shell实现随机数几种方法分享(date,random,uuid)
- linux shell 脚本实现tcp/upd协议通讯(重定向应用)
- linux shell数据重定向(输入重定向与输出重定向)详细分析
- linux shell 管道命令(pipe)使用及与shell重定向区别