博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Windows下:Eclipse下通过java开发spark程序【1】
阅读量:5874 次
发布时间:2019-06-19

本文共 15360 字,大约阅读时间需要 51 分钟。

hot3.png

准备:本机环境设置环境 jdk1.8,hadoop2.8.1(与服务器上hadoop环境保持一致)

第一步:

    需要下载windows版本 bin目录下的文件,替换hadoop目录下原来的bin目录下的文件。下载网址是:

另外还需要注意:下载的动态库是64位的,所以必须要在64位windows系统下运行.

把这个文件夹下的bin目录覆盖自己本地hadoop安装目录的bin文件夹

第二步:

    新建maven项目把服务器端下的hadoop目录下etc/hadoop 下面的配置文件 core-site.xml,hdfs-site.xml,mapred-site.xml,yarn-site.xml 放到项目resources目录下

第三步:

maven添加pom依赖:

org.apache.hadoop
hadoop-client
2.8.1
javax.servlet
servlet-api
org.apache.spark
spark-core_2.11
2.3.1
org.apache.spark
spark-streaming_2.11
2.3.1

注意:

    1.因为本地调试时会启动spark包内嵌的jetty容器,需要servlet-api3.0以上,hadoop中引入了2.5版本servlet要排除掉.

    2.hadoop以及spark版本,要与自己的版本一致,具体版本对应请自己百度查找,或者去spark官网查看相应提示,

        我服务器使用的是spark2.2.0 官网提示使用2.11版本

第四步:

    创建测试类:

import java.io.IOException;import java.util.Arrays;import java.util.List;import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaPairRDD;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.JavaSparkContext;import scala.Tuple2;public class SparkTest {	public static void main(String[] args) throws IOException {		SparkConf sparkConf = new SparkConf().setAppName("SparkTest1").setMaster("local[2]");		JavaSparkContext ctx = new JavaSparkContext(sparkConf);		JavaRDD
jpr = ctx.textFile("/README.md"); JavaRDD
words = jpr.flatMap(line -> Arrays.asList(line.split(" ")).iterator()); JavaPairRDD
counts = words.mapToPair(w -> new Tuple2
(w, 1)) .reduceByKey((x, y) -> x + y); List
> output = counts.collect(); for (Tuple2
tuple : output) { System.out.println(tuple._1() + " : " + tuple._2()); } ctx.close(); }}

    

控制台显示:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

18/09/20 16:45:53 INFO SparkContext: Running Spark version 2.3.1
18/09/20 16:45:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/09/20 16:45:53 INFO SparkContext: Submitted application: SparkTest1
18/09/20 16:45:53 INFO SecurityManager: Changing view acls to: yaotang.zhang
18/09/20 16:45:53 INFO SecurityManager: Changing modify acls to: yaotang.zhang
18/09/20 16:45:53 INFO SecurityManager: Changing view acls groups to: 
18/09/20 16:45:53 INFO SecurityManager: Changing modify acls groups to: 
18/09/20 16:45:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yaotang.zhang); groups with view permissions: Set(); users  with modify permissions: Set(yaotang.zhang); groups with modify permissions: Set()
18/09/20 16:45:54 INFO Utils: Successfully started service 'sparkDriver' on port 55765.
18/09/20 16:45:54 INFO SparkEnv: Registering MapOutputTracker
18/09/20 16:45:54 INFO SparkEnv: Registering BlockManagerMaster
18/09/20 16:45:54 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/09/20 16:45:54 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/09/20 16:45:54 INFO DiskBlockManager: Created local directory at C:\Users\yaotang.zhang\AppData\Local\Temp\blockmgr-af3da18a-9562-4aa7-92ee-e8c46139b185
18/09/20 16:45:54 INFO MemoryStore: MemoryStore started with capacity 898.5 MB
18/09/20 16:45:54 INFO SparkEnv: Registering OutputCommitCoordinator
18/09/20 16:45:54 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/09/20 16:45:54 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://zhangyaotang:4040
18/09/20 16:45:54 INFO Executor: Starting executor ID driver on host localhost
18/09/20 16:45:54 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55782.
18/09/20 16:45:54 INFO NettyBlockTransferService: Server created on zhangyaotang:55782
18/09/20 16:45:54 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/09/20 16:45:54 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, zhangyaotang, 55782, None)
18/09/20 16:45:54 INFO BlockManagerMasterEndpoint: Registering block manager zhangyaotang:55782 with 898.5 MB RAM, BlockManagerId(driver, zhangyaotang, 55782, None)
18/09/20 16:45:54 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, zhangyaotang, 55782, None)
18/09/20 16:45:54 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, zhangyaotang, 55782, None)
18/09/20 16:45:55 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 317.7 KB, free 898.2 MB)
18/09/20 16:45:55 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.5 KB, free 898.2 MB)
18/09/20 16:45:55 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on zhangyaotang:55782 (size: 27.5 KB, free: 898.5 MB)
18/09/20 16:45:55 INFO SparkContext: Created broadcast 0 from textFile at SparkTest.java:18
18/09/20 16:45:56 INFO FileInputFormat: Total input files to process : 1
18/09/20 16:45:56 INFO SparkContext: Starting job: collect at SparkTest.java:24
18/09/20 16:45:56 INFO DAGScheduler: Registering RDD 3 (mapToPair at SparkTest.java:22)
18/09/20 16:45:56 INFO DAGScheduler: Got job 0 (collect at SparkTest.java:24) with 2 output partitions
18/09/20 16:45:56 INFO DAGScheduler: Final stage: ResultStage 1 (collect at SparkTest.java:24)
18/09/20 16:45:56 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18/09/20 16:45:56 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
18/09/20 16:45:56 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at mapToPair at SparkTest.java:22), which has no missing parents
18/09/20 16:45:56 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.9 KB, free 898.2 MB)
18/09/20 16:45:56 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.2 KB, free 898.2 MB)
18/09/20 16:45:56 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on zhangyaotang:55782 (size: 3.2 KB, free: 898.5 MB)
18/09/20 16:45:56 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1039
18/09/20 16:45:56 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at mapToPair at SparkTest.java:22) (first 15 tasks are for partitions Vector(0, 1))
18/09/20 16:45:56 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
18/09/20 16:45:56 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, ANY, 7864 bytes)
18/09/20 16:45:56 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, ANY, 7864 bytes)
18/09/20 16:45:56 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/09/20 16:45:56 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
18/09/20 16:45:56 INFO HadoopRDD: Input split: hdfs://master:9000/README.md:0+1904
18/09/20 16:45:56 INFO HadoopRDD: Input split: hdfs://master:9000/README.md:1904+1905
18/09/20 16:45:56 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1197 bytes result sent to driver
18/09/20 16:45:56 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1197 bytes result sent to driver
18/09/20 16:45:56 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 509 ms on localhost (executor driver) (1/2)
18/09/20 16:45:56 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 528 ms on localhost (executor driver) (2/2)
18/09/20 16:45:56 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
18/09/20 16:45:56 INFO DAGScheduler: ShuffleMapStage 0 (mapToPair at SparkTest.java:22) finished in 0.592 s
18/09/20 16:45:56 INFO DAGScheduler: looking for newly runnable stages
18/09/20 16:45:56 INFO DAGScheduler: running: Set()
18/09/20 16:45:56 INFO DAGScheduler: waiting: Set(ResultStage 1)
18/09/20 16:45:56 INFO DAGScheduler: failed: Set()
18/09/20 16:45:56 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at SparkTest.java:23), which has no missing parents
18/09/20 16:45:56 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.7 KB, free 898.2 MB)
18/09/20 16:45:56 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.1 KB, free 898.1 MB)
18/09/20 16:45:56 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on zhangyaotang:55782 (size: 2.1 KB, free: 898.5 MB)
18/09/20 16:45:56 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1039
18/09/20 16:45:56 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at SparkTest.java:23) (first 15 tasks are for partitions Vector(0, 1))
18/09/20 16:45:56 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
18/09/20 16:45:56 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, executor driver, partition 0, ANY, 7649 bytes)
18/09/20 16:45:56 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, executor driver, partition 1, ANY, 7649 bytes)
18/09/20 16:45:56 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
18/09/20 16:45:56 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
18/09/20 16:45:56 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
18/09/20 16:45:56 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
18/09/20 16:45:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 7 ms
18/09/20 16:45:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 7 ms
18/09/20 16:45:57 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 4579 bytes result sent to driver
18/09/20 16:45:57 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 4723 bytes result sent to driver
18/09/20 16:45:57 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 53 ms on localhost (executor driver) (1/2)
18/09/20 16:45:57 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 53 ms on localhost (executor driver) (2/2)
18/09/20 16:45:57 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
18/09/20 16:45:57 INFO DAGScheduler: ResultStage 1 (collect at SparkTest.java:24) finished in 0.065 s
18/09/20 16:45:57 INFO DAGScheduler: Job 0 finished: collect at SparkTest.java:24, took 0.718616 s
package : 1
this : 1
Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) : 1
Because : 1
Python : 2
page](http://spark.apache.org/documentation.html). : 1
cluster. : 1
its : 1
[run : 1
general : 3
have : 1
pre-built : 1
YARN, : 1
locally : 2
changed : 1
locally. : 1
sc.parallelize(1 : 1
only : 1
several : 1
This : 2
basic : 1
Configuration : 1
learning, : 1
documentation : 3
first : 1
graph : 1
Hive : 2
info : 1
["Specifying : 1
"yarn" : 1
[params]`. : 1
[project : 1
prefer : 1
SparkPi : 2
<http://spark.apache.org/> : 1
engine : 1
version : 1
file : 1
documentation, : 1
MASTER : 1
example : 3
["Parallel : 1
are : 1
params : 1
scala> : 1
DataFrames, : 1
provides : 1
refer : 2
configure : 1
Interactive : 2
R, : 1
can : 7
build : 4
when : 1
easiest : 1
Apache : 1
systems. : 1
thread : 1
how : 3
package. : 1
1000).count() : 1
Note : 1
Data. : 1
>>> : 1
Scala : 2
Alternatively, : 1
tips, : 1
variable : 1
submit : 1
Testing : 1
Streaming : 1
module, : 1
Developer : 1
thread, : 1
rich : 1
them, : 1
detailed : 2
stream : 1
GraphX : 1
distribution : 1
review : 1
Please : 4
return : 2
is : 6
Thriftserver : 1
same : 1
start : 1
built : 1
one : 3
with : 4
Spark](#building-spark). : 1
Spark"](http://spark.apache.org/docs/latest/building-spark.html). : 1
data : 1
Contributing : 1
using : 5
talk : 1
Shell : 2
class : 2
Tools"](http://spark.apache.org/developer-tools.html). : 1
README : 1
computing : 1
Python, : 2
example: : 1
## : 9
from : 1
set : 2
building : 2
N : 1
Hadoop-supported : 1
other : 1
Example : 1
analysis. : 1
runs. : 1
Building : 1
higher-level : 1
need : 1
Big : 1
fast : 1
guide, : 1
Java, : 1
<class> : 1
uses : 1
SQL : 2
will : 1
information : 1
IDE, : 1
requires : 1
get : 1
 : 71
guidance : 2
Documentation : 1
web : 1
cluster : 2
using: : 1
MLlib : 1
contributing : 1
shell: : 2
Scala, : 1
supports : 2
built, : 1
tests](http://spark.apache.org/developer-tools.html#individual-tests). : 1
./dev/run-tests : 1
build/mvn : 1
sample : 1
For : 3
Programs : 1
Spark : 16
particular : 2
The : 1
than : 1
processing. : 1
APIs : 1
computation : 1
Try : 1
[Configuration : 1
./bin/pyspark : 1
A : 1
through : 1
# : 1
library : 1
following : 2
More : 1
which : 2
also : 4
storage : 1
should : 2
To : 2
for : 12
Once : 1
["Useful : 1
setup : 1
mesos:// : 1
Maven](http://maven.apache.org/). : 1
latest : 1
processing, : 1
the : 24
your : 1
not : 1
different : 1
distributions. : 1
given. : 1
About : 1
if : 4
instructions. : 1
be : 2
do : 2
Tests : 1
no : 1
project. : 1
./bin/run-example : 2
programs, : 1
including : 4
`./bin/run-example : 1
Spark. : 1
Versions : 1
started : 1
HDFS : 1
by : 1
individual : 1
spark:// : 1
It : 2
Maven : 1
an : 4
programming : 1
-T : 1
machine : 1
run: : 1
environment : 1
clean : 1
1000: : 2
And : 1
guide](http://spark.apache.org/contributing.html) : 1
developing : 1
run : 7
./bin/spark-shell : 1
URL, : 1
"local" : 1
MASTER=spark://host:7077 : 1
on : 7
You : 4
threads. : 1
against : 1
[Apache : 1
help : 1
print : 1
tests : 2
examples : 2
at : 2
in : 6
-DskipTests : 1
3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3). : 1
development : 1
Maven, : 1
graphs : 1
downloaded : 1
versions : 1
usage : 1
builds : 1
online : 1
Guide](http://spark.apache.org/docs/latest/configuration.html) : 1
abbreviated : 1
comes : 1
directory. : 1
overview : 1
[building : 1
`examples` : 2
optimized : 1
Many : 1
Running : 1
way : 1
use : 3
Online : 1
site, : 1
running : 1
[Contribution : 1
find : 1
sc.parallelize(range(1000)).count() : 1
contains : 1
project : 1
you : 4
Pi : 1
that : 2
protocols : 1
a : 8
or : 3
high-level : 1
name : 1
Hadoop, : 2
to : 17
available : 1
(You : 1
core : 1
more : 1
see : 3
of : 5
tools : 1
"local[N]" : 1
programs : 2
option : 1
package.) : 1
["Building : 1
instance: : 1
must : 1
and : 9
command, : 2
system : 1
Hadoop : 3
18/09/20 16:45:57 INFO SparkUI: Stopped Spark web UI at http://zhangyaotang:4040
18/09/20 16:45:57 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/09/20 16:45:57 INFO MemoryStore: MemoryStore cleared
18/09/20 16:45:57 INFO BlockManager: BlockManager stopped
18/09/20 16:45:57 INFO BlockManagerMaster: BlockManagerMaster stopped
18/09/20 16:45:57 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/09/20 16:45:57 INFO SparkContext: Successfully stopped SparkContext
18/09/20 16:45:57 INFO ShutdownHookManager: Shutdown hook called
18/09/20 16:45:57 INFO ShutdownHookManager: Deleting directory C:\Users\yaotang.zhang\AppData\Local\Temp\spark-2053c47d-4005-4ee5-9335-b8b618a54a7f
 

README.md文件是事先上传到HDFS服务集群上的,至此简单的java操作spark的demo已经完成.

转载于:https://my.oschina.net/u/3398895/blog/2088103

你可能感兴趣的文章
调查显示:1/3的“00后”希望人工智能和机器人领导国家
查看>>
C++异常处理与临时副本
查看>>
[绝对原创] SAP Get User data by User ID
查看>>
WPF笔记(2.2 DockPanel)——Layout
查看>>
嵌入式工控机主板在无人机中的应用
查看>>
后台(22)——AJAX
查看>>
LVM pvcreate,vgcreate,lvcreate,mkfs
查看>>
DeepMind提出快速调参新算法PBT,适用GAN训练(附论文)
查看>>
深入浅出JVM
查看>>
【NIPS2017现场+数据也疯狂】最佳论文大奖公布,算法关注度超越深度学习排第一...
查看>>
MyBatis实现SaveOrUpdate
查看>>
WPF动态加载3D&nbsp;放大-旋转-平移
查看>>
Spring Boot工程支持HTTP和HTTPS,HTTP重定向HTTPS
查看>>
[20170324]dg相关进程.txt
查看>>
DataTable 更改在有数据列的类型方法
查看>>
nginx做本地目录映射
查看>>
用 Maven 运行 MyBatis Generator(Running MyBatis Generator With Maven)
查看>>
Apache Hadoop 3.0.0-alpha1主要改进
查看>>
大型项目开发: 头文件顺序
查看>>
中小型商业银行的软件安全测试之道
查看>>