准备:本机环境设置环境 jdk1.8,hadoop2.8.1(与服务器上hadoop环境保持一致)
第一步:
需要下载windows版本 bin目录下的文件,替换hadoop目录下原来的bin目录下的文件。下载网址是:
另外还需要注意:下载的动态库是64位的,所以必须要在64位windows系统下运行.把这个文件夹下的bin目录覆盖自己本地hadoop安装目录的bin文件夹
第二步:
新建maven项目把服务器端下的hadoop目录下etc/hadoop 下面的配置文件 core-site.xml,hdfs-site.xml,mapred-site.xml,yarn-site.xml 放到项目resources目录下
第三步:
maven添加pom依赖:
org.apache.hadoop hadoop-client 2.8.1 javax.servlet servlet-api org.apache.spark spark-core_2.11 2.3.1 org.apache.spark spark-streaming_2.11 2.3.1
注意:
1.因为本地调试时会启动spark包内嵌的jetty容器,需要servlet-api3.0以上,hadoop中引入了2.5版本servlet要排除掉.
2.hadoop以及spark版本,要与自己的版本一致,具体版本对应请自己百度查找,或者去spark官网查看相应提示,
我服务器使用的是spark2.2.0 官网提示使用2.11版本
第四步:
创建测试类:
import java.io.IOException;import java.util.Arrays;import java.util.List;import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaPairRDD;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.JavaSparkContext;import scala.Tuple2;public class SparkTest { public static void main(String[] args) throws IOException { SparkConf sparkConf = new SparkConf().setAppName("SparkTest1").setMaster("local[2]"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); JavaRDDjpr = ctx.textFile("/README.md"); JavaRDD words = jpr.flatMap(line -> Arrays.asList(line.split(" ")).iterator()); JavaPairRDD counts = words.mapToPair(w -> new Tuple2 (w, 1)) .reduceByKey((x, y) -> x + y); List > output = counts.collect(); for (Tuple2 tuple : output) { System.out.println(tuple._1() + " : " + tuple._2()); } ctx.close(); }}
控制台显示:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 18/09/20 16:45:53 INFO SparkContext: Running Spark version 2.3.1 18/09/20 16:45:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/09/20 16:45:53 INFO SparkContext: Submitted application: SparkTest1 18/09/20 16:45:53 INFO SecurityManager: Changing view acls to: yaotang.zhang 18/09/20 16:45:53 INFO SecurityManager: Changing modify acls to: yaotang.zhang 18/09/20 16:45:53 INFO SecurityManager: Changing view acls groups to: 18/09/20 16:45:53 INFO SecurityManager: Changing modify acls groups to: 18/09/20 16:45:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yaotang.zhang); groups with view permissions: Set(); users with modify permissions: Set(yaotang.zhang); groups with modify permissions: Set() 18/09/20 16:45:54 INFO Utils: Successfully started service 'sparkDriver' on port 55765. 18/09/20 16:45:54 INFO SparkEnv: Registering MapOutputTracker 18/09/20 16:45:54 INFO SparkEnv: Registering BlockManagerMaster 18/09/20 16:45:54 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 18/09/20 16:45:54 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 18/09/20 16:45:54 INFO DiskBlockManager: Created local directory at C:\Users\yaotang.zhang\AppData\Local\Temp\blockmgr-af3da18a-9562-4aa7-92ee-e8c46139b185 18/09/20 16:45:54 INFO MemoryStore: MemoryStore started with capacity 898.5 MB 18/09/20 16:45:54 INFO SparkEnv: Registering OutputCommitCoordinator 18/09/20 16:45:54 INFO Utils: Successfully started service 'SparkUI' on port 4040. 18/09/20 16:45:54 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://zhangyaotang:4040 18/09/20 16:45:54 INFO Executor: Starting executor ID driver on host localhost 18/09/20 16:45:54 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55782. 18/09/20 16:45:54 INFO NettyBlockTransferService: Server created on zhangyaotang:55782 18/09/20 16:45:54 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 18/09/20 16:45:54 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, zhangyaotang, 55782, None) 18/09/20 16:45:54 INFO BlockManagerMasterEndpoint: Registering block manager zhangyaotang:55782 with 898.5 MB RAM, BlockManagerId(driver, zhangyaotang, 55782, None) 18/09/20 16:45:54 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, zhangyaotang, 55782, None) 18/09/20 16:45:54 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, zhangyaotang, 55782, None) 18/09/20 16:45:55 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 317.7 KB, free 898.2 MB) 18/09/20 16:45:55 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.5 KB, free 898.2 MB) 18/09/20 16:45:55 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on zhangyaotang:55782 (size: 27.5 KB, free: 898.5 MB) 18/09/20 16:45:55 INFO SparkContext: Created broadcast 0 from textFile at SparkTest.java:18 18/09/20 16:45:56 INFO FileInputFormat: Total input files to process : 1 18/09/20 16:45:56 INFO SparkContext: Starting job: collect at SparkTest.java:24 18/09/20 16:45:56 INFO DAGScheduler: Registering RDD 3 (mapToPair at SparkTest.java:22) 18/09/20 16:45:56 INFO DAGScheduler: Got job 0 (collect at SparkTest.java:24) with 2 output partitions 18/09/20 16:45:56 INFO DAGScheduler: Final stage: ResultStage 1 (collect at SparkTest.java:24) 18/09/20 16:45:56 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0) 18/09/20 16:45:56 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0) 18/09/20 16:45:56 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at mapToPair at SparkTest.java:22), which has no missing parents 18/09/20 16:45:56 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.9 KB, free 898.2 MB) 18/09/20 16:45:56 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.2 KB, free 898.2 MB) 18/09/20 16:45:56 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on zhangyaotang:55782 (size: 3.2 KB, free: 898.5 MB) 18/09/20 16:45:56 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1039 18/09/20 16:45:56 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at mapToPair at SparkTest.java:22) (first 15 tasks are for partitions Vector(0, 1)) 18/09/20 16:45:56 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 18/09/20 16:45:56 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, ANY, 7864 bytes) 18/09/20 16:45:56 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, ANY, 7864 bytes) 18/09/20 16:45:56 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 18/09/20 16:45:56 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 18/09/20 16:45:56 INFO HadoopRDD: Input split: hdfs://master:9000/README.md:0+1904 18/09/20 16:45:56 INFO HadoopRDD: Input split: hdfs://master:9000/README.md:1904+1905 18/09/20 16:45:56 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1197 bytes result sent to driver 18/09/20 16:45:56 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1197 bytes result sent to driver 18/09/20 16:45:56 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 509 ms on localhost (executor driver) (1/2) 18/09/20 16:45:56 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 528 ms on localhost (executor driver) (2/2) 18/09/20 16:45:56 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 18/09/20 16:45:56 INFO DAGScheduler: ShuffleMapStage 0 (mapToPair at SparkTest.java:22) finished in 0.592 s 18/09/20 16:45:56 INFO DAGScheduler: looking for newly runnable stages 18/09/20 16:45:56 INFO DAGScheduler: running: Set() 18/09/20 16:45:56 INFO DAGScheduler: waiting: Set(ResultStage 1) 18/09/20 16:45:56 INFO DAGScheduler: failed: Set() 18/09/20 16:45:56 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at SparkTest.java:23), which has no missing parents 18/09/20 16:45:56 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.7 KB, free 898.2 MB) 18/09/20 16:45:56 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.1 KB, free 898.1 MB) 18/09/20 16:45:56 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on zhangyaotang:55782 (size: 2.1 KB, free: 898.5 MB) 18/09/20 16:45:56 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1039 18/09/20 16:45:56 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at SparkTest.java:23) (first 15 tasks are for partitions Vector(0, 1)) 18/09/20 16:45:56 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks 18/09/20 16:45:56 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, executor driver, partition 0, ANY, 7649 bytes) 18/09/20 16:45:56 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, executor driver, partition 1, ANY, 7649 bytes) 18/09/20 16:45:56 INFO Executor: Running task 0.0 in stage 1.0 (TID 2) 18/09/20 16:45:56 INFO Executor: Running task 1.0 in stage 1.0 (TID 3) 18/09/20 16:45:56 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 18/09/20 16:45:56 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 18/09/20 16:45:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 7 ms 18/09/20 16:45:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 7 ms 18/09/20 16:45:57 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 4579 bytes result sent to driver 18/09/20 16:45:57 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 4723 bytes result sent to driver 18/09/20 16:45:57 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 53 ms on localhost (executor driver) (1/2) 18/09/20 16:45:57 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 53 ms on localhost (executor driver) (2/2) 18/09/20 16:45:57 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 18/09/20 16:45:57 INFO DAGScheduler: ResultStage 1 (collect at SparkTest.java:24) finished in 0.065 s 18/09/20 16:45:57 INFO DAGScheduler: Job 0 finished: collect at SparkTest.java:24, took 0.718616 s
package : 1 this : 1 Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) : 1 Because : 1 Python : 2 page](http://spark.apache.org/documentation.html). : 1 cluster. : 1 its : 1 [run : 1 general : 3 have : 1 pre-built : 1 YARN, : 1 locally : 2 changed : 1 locally. : 1 sc.parallelize(1 : 1 only : 1 several : 1 This : 2 basic : 1 Configuration : 1 learning, : 1 documentation : 3 first : 1 graph : 1 Hive : 2 info : 1 ["Specifying : 1 "yarn" : 1 [params]`. : 1 [project : 1 prefer : 1 SparkPi : 2 <http://spark.apache.org/> : 1 engine : 1 version : 1 file : 1 documentation, : 1 MASTER : 1 example : 3 ["Parallel : 1 are : 1 params : 1 scala> : 1 DataFrames, : 1 provides : 1 refer : 2 configure : 1 Interactive : 2 R, : 1 can : 7 build : 4 when : 1 easiest : 1 Apache : 1 systems. : 1 thread : 1 how : 3 package. : 1 1000).count() : 1 Note : 1 Data. : 1 >>> : 1 Scala : 2 Alternatively, : 1 tips, : 1 variable : 1 submit : 1 Testing : 1 Streaming : 1 module, : 1 Developer : 1 thread, : 1 rich : 1 them, : 1 detailed : 2 stream : 1 GraphX : 1 distribution : 1 review : 1 Please : 4 return : 2 is : 6 Thriftserver : 1 same : 1 start : 1 built : 1 one : 3 with : 4 Spark](#building-spark). : 1 Spark"](http://spark.apache.org/docs/latest/building-spark.html). : 1 data : 1 Contributing : 1 using : 5 talk : 1 Shell : 2 class : 2 Tools"](http://spark.apache.org/developer-tools.html). : 1 README : 1 computing : 1 Python, : 2 example: : 1 ## : 9 from : 1 set : 2 building : 2 N : 1 Hadoop-supported : 1 other : 1 Example : 1 analysis. : 1 runs. : 1 Building : 1 higher-level : 1 need : 1 Big : 1 fast : 1 guide, : 1 Java, : 1 <class> : 1 uses : 1 SQL : 2 will : 1 information : 1 IDE, : 1 requires : 1 get : 1 : 71 guidance : 2 Documentation : 1 web : 1 cluster : 2 using: : 1 MLlib : 1 contributing : 1 shell: : 2 Scala, : 1 supports : 2 built, : 1 tests](http://spark.apache.org/developer-tools.html#individual-tests). : 1 ./dev/run-tests : 1 build/mvn : 1 sample : 1 For : 3 Programs : 1 Spark : 16 particular : 2 The : 1 than : 1 processing. : 1 APIs : 1 computation : 1 Try : 1 [Configuration : 1 ./bin/pyspark : 1 A : 1 through : 1 # : 1 library : 1 following : 2 More : 1 which : 2 also : 4 storage : 1 should : 2 To : 2 for : 12 Once : 1 ["Useful : 1 setup : 1 mesos:// : 1 Maven](http://maven.apache.org/). : 1 latest : 1 processing, : 1 the : 24 your : 1 not : 1 different : 1 distributions. : 1 given. : 1 About : 1 if : 4 instructions. : 1 be : 2 do : 2 Tests : 1 no : 1 project. : 1 ./bin/run-example : 2 programs, : 1 including : 4 `./bin/run-example : 1 Spark. : 1 Versions : 1 started : 1 HDFS : 1 by : 1 individual : 1 spark:// : 1 It : 2 Maven : 1 an : 4 programming : 1 -T : 1 machine : 1 run: : 1 environment : 1 clean : 1 1000: : 2 And : 1 guide](http://spark.apache.org/contributing.html) : 1 developing : 1 run : 7 ./bin/spark-shell : 1 URL, : 1 "local" : 1 MASTER=spark://host:7077 : 1 on : 7 You : 4 threads. : 1 against : 1 [Apache : 1 help : 1 print : 1 tests : 2 examples : 2 at : 2 in : 6 -DskipTests : 1 3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3). : 1 development : 1 Maven, : 1 graphs : 1 downloaded : 1 versions : 1 usage : 1 builds : 1 online : 1 Guide](http://spark.apache.org/docs/latest/configuration.html) : 1 abbreviated : 1 comes : 1 directory. : 1 overview : 1 [building : 1 `examples` : 2 optimized : 1 Many : 1 Running : 1 way : 1 use : 3 Online : 1 site, : 1 running : 1 [Contribution : 1 find : 1 sc.parallelize(range(1000)).count() : 1 contains : 1 project : 1 you : 4 Pi : 1 that : 2 protocols : 1 a : 8 or : 3 high-level : 1 name : 1 Hadoop, : 2 to : 17 available : 1 (You : 1 core : 1 more : 1 see : 3 of : 5 tools : 1 "local[N]" : 1 programs : 2 option : 1 package.) : 1 ["Building : 1 instance: : 1 must : 1 and : 9 command, : 2 system : 1 Hadoop : 318/09/20 16:45:57 INFO SparkUI: Stopped Spark web UI at http://zhangyaotang:4040 18/09/20 16:45:57 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 18/09/20 16:45:57 INFO MemoryStore: MemoryStore cleared 18/09/20 16:45:57 INFO BlockManager: BlockManager stopped 18/09/20 16:45:57 INFO BlockManagerMaster: BlockManagerMaster stopped 18/09/20 16:45:57 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 18/09/20 16:45:57 INFO SparkContext: Successfully stopped SparkContext 18/09/20 16:45:57 INFO ShutdownHookManager: Shutdown hook called 18/09/20 16:45:57 INFO ShutdownHookManager: Deleting directory C:\Users\yaotang.zhang\AppData\Local\Temp\spark-2053c47d-4005-4ee5-9335-b8b618a54a7fREADME.md文件是事先上传到HDFS服务集群上的,至此简单的java操作spark的demo已经完成.