正在完善ing,5月底之前会把所有上课内容更新完毕
win11 配置单机Hadoop、Hive、Hbase
目前Hadoop最新版本3.3.6,(我使用的jdk11可以启动,jdk17浏览文件时候有异常,目前解决不了,降级jdk11即正常启动)
下载--解压--打补丁--配置
配置环境变量
HADOOP_HOME
HADOOP解压路径
# path 里面的配置
%HADOOP_HOME%\bin
打补丁
hadoop 本身并不支持在 windows 下运行,需要通过外挂一个补丁,这个补丁可以在 github 上下载 https://github.com/cdarlint/winutils
clone 仓库到本地,将hadoop.dll 和 winutils.exe 两个文件,复制到 %HADOOP_HOME%/bin 目录下去。
修改配置文件(以下配置文件保存时候必须是utf-8编码方式保存)
编辑 core-site.xml 文件
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- hdfs api 接口用的端口 -->
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/D:/Hadoop3/hadoop/TmpData</value>
</property>
</configuration>
修改hdfs-site.xml
以下路径 (/D:/Hadoop3/hadoop/dfs/name) 请自行修改
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/D:/Hadoop3/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/D:/Hadoop3/hadoop/dfs/data</value>
</property>
<!-- 访问端口 -->
<property>
<name>dfs.namenode.http.address</name>
<value>localhost:9870</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
修改mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- 配置调度器(默认) -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>The runtime framework for executing MapReduce jobs</description>
</property>
</configuration>
修改yarn-site.xml
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<!--指定mapreduce执行shuffle时获取数据的方式-->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>1</value>
</property>
<!-- 配置主机名称 -->
<!-- <property> -->
<!-- <name>yarn.resourcemanager.hostname</name> -->
<!-- <value>hadoop3a</value> -->
<!-- </property> -->
</configuration>
测试环境....略
启动Hadoop
先格式化(此格式化意思是初始化Hadoop环境,并非格式化磁盘)
bin\hdfs.cmd namenode -format
启动dfs和yarn
sbin\start-all.cmd
启动结果如下:
网站后台
至此,win11单机hadoop3.3.6完成
Hive安装
如你所见,用到了目前最新的版本3.1.3
准备工具,两份hive安装包,一份最新的和一份保留了cmd脚本的hive
最新版本已经去掉了 *.cmd 相关的脚本,所以需要自己下载老版本的hive进行补全方可正常使用
官方下载地址:Downloads (apache.org)
2.2.0版本下载地址:Index of /dist/hive/hive-2.2.0 (apache.org)
下载完成后解压,配置环境变量,配置文件hive-site.xml,hive-env.sh
修改hite-site.xml (cp hive-default.xml.template hive-site.xml)
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<!--hive的临时数据目录,指定的位置在hdfs上的目录-->
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive</value>
<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>
</property>
<!-- Hive临时文件,用于存储每个查询的临时/中间数据集,通常在完成查询后由配置单元客户端清除,scratchdir 本地目录 -->
<property>
<name>hive.exec.local.scratchdir</name>
<value>D:/Hadoop3/Hive/apache-hive-3.1.3-bin/tmp/${system:user.name}</value>
<!-- <value>D:/bigdata/apache-hive-3.1.3-bin/my_hive/scratch_dir</value> -->
<description>Local scratch space for Hive jobs</description>
</property>
<!-- hive添加资源时的临时目录 ,resources_dir 本地目录-->
<property>
<name>hive.downloaded.resources.dir</name>
<value>D:/Hadoop3/Hive/apache-hive-3.1.3-bin/tmp/${hive.session.id}_resources</value>
<!-- <value>D:/bigdata/apache-hive-3.1.3-bin/my_hive/resources_dir/${hive.session.id}_resources</value> -->
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
<!-- Hive运行时结构化日志文件的位置,querylog 本地目录 -->
<property>
<name>hive.querylog.location</name>
<value>D:/Hadoop3/Hive/apache-hive-3.1.3-bin/tmp/${system:user.name}</value>
<!-- <value>D:/bigdata/apache-hive-3.1.3-bin/my_hive/querylog_dir</value> -->
<description>Location of Hive run time structured log file</description>
</property>
<!-- 用于验证metastore和hivejar包是否一致问题,默认为true。false:hive升级版本不一致只会警告 -->
<!-- 解决 Caused by: MetaException(message:Version information not found in metastore. ) -->
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description>
Enforce metastore schema version consistency.
True: Verify that version information stored in is compatible with one from Hive jars. Also disable automatic
schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
proper metastore schema migration. (Default)
False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
</description>
</property>
<!-- 使用MySQL作为hive的元数据Metastore数据库 -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<!-- <value>jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true&characterEncoding=latin1&useSSL=false</value> -->
<value>jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true&useSSL=false</value>
<description>JDBC connect string for a JDBC metastore.</description>
</property>
<!-- MySQL JDBC驱动程序类 -->
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<!-- 连接到MySQL服务器的用户名 -->
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>Username to use against metastore database</description>
</property>
<!-- 连接MySQL服务器的密码 -->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
<description>password to use against metastore database</description>
</property>
<!-- hive server2 thrift ip -->
<property>
<name>hive.server2.thrift.bind.host</name>
<value>localhost</value>
</property>
<!-- hive server2 thrift 端口 -->
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<!-- Thrift Metastore服务器的主机和端口 默认 <value/> -->
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
</property>
<!-- 如果启用了日志功能,则存储操作日志的顶级目录,operation_logs 本地目录 -->
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>D:/Hadoop3/Hive/apache-hive-3.1.3-bin/tmp/${system:user.name}/operation_logs</value>
<!-- <value>D:/bigdata/apache-hive-3.1.3-bin/my_hive/operation_logs_dir</value> -->
<description>Top level directory where operation logs are stored if logging functionality is enabled</description>
</property>
<!-- 自动创建全部 -->
<!-- 初始化数据库自动创建schema -->
<!-- hive Required table missing : "DBS" in Catalog""Schema" 错误 -->
<!-- <property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
<description>Auto creates necessary schema on a startup if one doesn't exist. Set this to false, after creating it once.To enable auto create also set hive.metastore.schema.verification=false. Auto creation is not recommended for production use cases, run schematool command instead.</description>
</property> -->
</configuration>
修改hive-env.sh
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Set Hive and Hadoop environment variables here. These variables can be used
# to control the execution of Hive. It should be used by admins to configure
# the Hive installation (so that users do not have to set environment variables
# or set command line parameters to get correct behavior).
#
# The hive service being invoked (CLI etc.) is available via the environment
# variable SERVICE
# Hive Client memory usage can be an issue if a large number of clients
# are running at the same time. The flags below have been useful in
# reducing memory usage:
#
# if [ "$SERVICE" = "cli" ]; then
# if [ -z "$DEBUG" ]; then
# export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+UseParNewGC -XX:-UseGCOverheadLimit"
# else
# export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit"
# fi
# fi
# The heap size of the jvm stared by hive shell script can be controlled via:
#
# export HADOOP_HEAPSIZE=1024
#
# Larger heap size may be required when running queries over large number of files or partitions.
# By default hive shell scripts use a heap size of 256 (MB). Larger heap size would also be
# appropriate for hive server.
# Set HADOOP_HOME to point to a specific hadoop install directory
# HADOOP_HOME=${bin}/../../hadoop
# Hive Configuration Directory can be controlled by:
# export HIVE_CONF_DIR=
# Folder containing extra libraries required for hive compilation/execution can be controlled by:
# export HIVE_AUX_JARS_PATH=
# hive hsell 启动JVM的堆栈大小
export HADOOP_HEAPSIZE=2048
# hadoop的安装目录
HADOOP_HOME=%HADOOP_HOME%
# hive的配置目录
export HIVE_CONF_DIR=%HIVE_CONF_DIR%
# hive的lib库目录
export HIVE_AUX_JARS_PATH=D:\env\apache\hadoop3\apache-hive-3.1.3-bin\lib
补全bin目录下的文件
解压2.2.0点击conf,全部复制-全部替换
初始化数据库
初始化 Hive 元数据到 MySQL 数据库
将 %HIVE_HOME%\scripts\metastore\upgrade\mysql 目录下的 hive-schema-3.0.0.mysql.sql 导入MySQL
hive --service schematool -dbType mysql -initSchema --verbose
启动Hive
1、启动 Hive 元数据
hive --service metastore
2、启动 Hive server2 服务
hive --service hiveserver2
3、 启动 hive 命令行
hive
Hive的log4j和Hadoop的log4j包冲突
Hive初始化之前必须先启动HDFS
Hive数据库的元数据编码 latin1
Hbase安装
下载最新的2.5.7版本,做如下配置修改
修改hbase-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
-->
<configuration>
<!--
The following properties are set for running HBase as a single process on a
developer workstation. With this configuration, HBase is running in
"stand-alone" mode and without a distributed file system. In this mode, and
without further configuration, HBase and ZooKeeper data are stored on the
local filesystem, in a path under the value configured for `hbase.tmp.dir`.
This value is overridden from its default value of `/tmp` because many
systems clean `/tmp` on a regular basis. Instead, it points to a path within
this HBase installation directory.
Running against the `LocalFileSystem`, as opposed to a distributed
filesystem, runs the risk of data integrity issues and data loss. Normally
HBase will refuse to run in such an environment. Setting
`hbase.unsafe.stream.capability.enforce` to `false` overrides this behavior,
permitting operation. This configuration is for the developer workstation
only and __should not be used in production!__
See also https://hbase.apache.org/book.html#standalone_dist
-->
<property>
<!-- Hbase 的安装目录 -->
<name>hbase.rootdir</name>
<value>file:///D:/env/apache/hadoop3/hbase-2.5.7</value>
</property>
<!-- Hbase 临时文件目录 -->
<property>
<name>hbase.tmp.dir</name>
<value>D:/Hadoop3/hbase_tmp</value>
</property>
<!-- 指定hbase管理页面端口 -->
<property>
<name>hbase.master.info.port</name>
<value>60010</value>
</property>
<!-- 预留 zooKeeper 主机ip -->
<property>
<name>hbase.zookeeper.quorum</name>
<value>127.0.0.1</value>
</property>
<!-- 预留 zooKeeper 程序位置 -->
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>D:/Hadoop3/hbase-2.5.7/zoo</value>
</property>
<!-- hbase部署模式,false表示单机或伪分布式,true代表全分布式 -->
<property>
<name>hbase.cluster.distributed</name>
<value>false</value>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>./tmp</value>
</property>
<!-- 在分布式的情况下一定要设置,不然容易出现Hmaster起不来的情况 -->
<property>
<name>hbase.unsafe.stream.capability.enforce</name>
<value>false</value>
</property>
</configuration>
修改hbase-env.cmd,在末尾添加以下内容即可
set JAVA_HOME=%JAVA_HOME%
@rem 这里暂且先不用内置zookeeper
set HBASE_MANAGES_ZK=false
set HBASE_CLASSPATH=%HBASE_HOME%/conf
启动测试
补全依赖
# 可以从 Hadoop-3.x.x版本的 share\hadoop\common\lib 中找到前三个,第四个直接在 maven 仓库里找
htrace-core4-4.1.0-incubating.jar、slf4j-api-1.7.25.jar、slf4j-log4j12-1.7.25.jar,jansi-1.4.jar
maven 依赖获取 jansi-1.4.jar
<dependency>
<groupId>org.fusesource.jansi</groupId>
<artifactId>jansi</artifactId>
<version>1.4</version>
</dependency>
找到以上内容后复制到 %HBASE_HOME%\lib 中去。
启动:
cd %HBASE_HOME%
bin\start-hbase.cmd
测试案例在下面 hbase 作业中
Linux集群如下:
搭建环境 docker(可伪集群),vm==>ubuntu-server 20.04
个人配置好的伪集群镜像:
我的配置 master 是主机 ,slave1和slave2是从机
顺序依次是
hadoop:
虚拟机安装过程,Windows其实也可以(太折磨人了啊)
现在需要准备两个文件:
对于hadoo-3.3.4来说,jdk8/jdk11都可以,就是这两个包了
hadoop可以从清华镜像站下载
jdk手动从Oracle官网下载
下载到服务器后,分别解压缩
在/opt目录下,新建module文件夹
sudo mkdir /opt/module
给当前用户进行'授权访问',这里以hadoop用户为例
chown hadoop:hadoop /opt/module
修改profile文件
sudo vim /etc/profile.d/my_env.sh
#添加JAVA_HOME和HADOOP_HOME
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_361/
export PATH=$PATH:$JAVA_HOME/bin
#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.3.4
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
#保存并退出
. /etc/profile
java -version
hadoop version
#以上两个有输出版本号即可
使用官方案例进行测试是否正常( 本地运行模式 )
cd $HADOOP_HOME
mkdir wcinput
cd wcinput
vim word.txt
hadoop yarn
hadoop mapreduce
chinasoft
chinasoft
#保存退出
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar wordcount wcinput wcoutput
配置集群之间的免密登录
hbase:
start-hbase
hbase shell 使用进入类似mysql中一样
create "表名","列簇1","列簇2","...列簇n..."
list 查看数据库有哪些表
describe "表名" 查看表结构
put "表名","rowKey名","列簇名:字段名","字段值" (插入和修改)
scan "表名" 扫描查看表数据
get "表名","rowKey","列簇名:字段名" 查看指定数据
get "表名","rowKey" {column=>"列簇名":"列限定符",TIMESTAMP=>时间戳}
deleteall "表名" 删除某rowKey全部数据,表名称+rowKey
delete "表名","rowKey","列簇名:字段名"
truncate "表名" 清空表数据
删除表之前需要停用当前的表
disable "表名"
drop "表名"
作业1:
语句
create "student","info","Mon","Tue","Wed","Thur","Fri"
#张三
put "student","1001","info:name","zhangsan"
put "student","1001","info:groupnumber","0"
put "student","1001","info:groupleader","zhangsan"
put "student","1001","info:sex","male"
put "student","1001","info:tel","19978250000"
put "student","1001","Mon:E","Spark And Hadoop"
put "student","1001","Tue:A","MapReduce"
put "student","1001","Tue:B","SpringCloud"
put "student","1001","Tue:D","MicroServices"
put "student","1001","Wed:C","Vue"
put "student","1001","Thur:B","H5"
#李四
put "student","1002","info:name","lisi"
put "student","1002","info:groupnumber","0"
put "student","1002","info:groupleader","zhangsan"
put "student","1002","info:sex","male"
put "student","1002","info:tel","19978251111"
put "student","1002","Mon:E","Spark And Hadoop"
put "student","1002","Tue:A","MapReduce"
put "student","1002","Tue:B","SpringCloud"
put "student","1002","Tue:D","MicroServices"
put "student","1002","Wed:C","Vue"
put "student","1002","Thur:B","H5"
#王五
put "student","1003","info:name","wangwu"
put "student","1003","info:groupnumber","0"
put "student","1003","info:groupleader","zhangsan"
put "student","1003","info:sex","female"
put "student","1003","info:tel","19978252222"
put "student","1003","Mon:E","Spark And Hadoop"
put "student","1003","Tue:A","MapReduce"
put "student","1003","Tue:B","SpringCloud"
put "student","1003","Tue:D","MicroServices"
put "student","1003","Wed:C","Vue"
put "student","1003","Thur:B","H5"
#赵六
put "student","1004","info:name","zhaoliu"
put "student","1004","info:groupnumber","0"
put "student","1004","info:groupleader","zhangsan"
put "student","1004","info:sex","male"
put "student","1004","info:tel","19978253333"
put "student","1004","Mon:E","Spark And Hadoop"
put "student","1004","Tue:A","MapReduce"
put "student","1004","Tue:B","SpringCloud"
put "student","1004","Tue:D","MicroServices"
put "student","1004","Wed:C","Vue"
put "student","1004","Thur:B","H5"
#孙七
put "student","1005","info:name","sunqi"
put "student","1005","info:groupnumber","0"
put "student","1005","info:groupleader","zhangsan"
put "student","1005","info:sex","female"
put "student","1005","info:tel","19978254444"
put "student","1005","Mon:E","Spark And Hadoop"
put "student","1005","Mon:A","Hadoop"
put "student","1005","Tue:A","MapReduce"
put "student","1005","Tue:B","SpringCloud"
put "student","1005","Tue:D","MicroServices"
put "student","1005","Wed:C","Vue"
put "student","1005","Thur:B","H5"
2.
alter "student","Sun"
3.
deleteall "student","1001"
5.
scan "student",{COLUMN=>"Mon"}
4.
scan "student",{COLUMNS=>['Mon','Tue','Wed','Thur','Fri','Sun'], FILTER=>"PrefixFilter('1001')"}
在Linux下搭建的环境
zookeeper集群
# 解压,移动
tar -zxvf apache-zookeeper-3.7.0.tar.gz
mv apache-zookeeper-3.7.0 /usr/local
#配置zookeeper
#1.复制为zoo.cfg
cp zoo_sample.cfg zoo.cfg
#2.编辑zoo.cfg,修改内容如下
vi zoo.cfg
// 指定zookeeper数据存储路径
dataDir=/usr/local/zookeeper/data/zkData
//在最后添加,指定zookeeper集群主机及端口,机器数必须为奇数
server.1=master:2888:3888
server.2=slave1:2888:3888
server.3=slave2:2888:3888
server.4=slave3:2888:3888
#3.创建存储路径
mkdir /usr/local/zookeeper/data/zkData/ -p
spark集群搭建
参考文章==>> https://blog.csdn.net/m0_53317797/article/details/127216100
https://www.cnblogs.com/liugp/p/16153043.html#3yarn%E6%8E%A8%E8%8D%90
配置文件
vim spark-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_361
export SPARK_DIST_CLASSPATH=$(/opt/module/hadoop-3.3.4/bin/hadoop classpath)
export HADOOP_CONF_DIR=/opt/module/hadoop-3.3.4/etc/hadoop
export SPARK_MASTER_IP=master #此处IP地址是master主机的地址
export SPARK_MASTER_HOST=master
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080
-Dspark.history.retainedApplications=5
-Dspark.history.fs.logDirectory=hdfs://master:9000/spark-app-history"
export HADOOP_HOME=/opt/module/hadoop-3.3.4
export SPARK_MASTER_PORT=7077
vim spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir hdfs://master:9000/spark-app-history
spark.eventLog.compress true
spark.yarn.historyServer.address master:18080
spark.yarn.jars hdfs:///spark-yarn/jars/*.jar
spark.master spark://master:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://master:8020/spark-eventlog
spark.eventLog.compress true
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 1g
spark.yarn.historyServer.address master:18080
spark.history.ui.port 18080
spark.yarn.jars hdfs:///spark-yarn/jars/*.jar
将workers-template重命名为workers
cp workers-template workers
vim workes
master
slave1
slave2
启动hadoop集群
创建目录
hdfs dfs -mkdir /spark-app-history
同步所有修改配置到slave机器上
启动spark集群:
cd $SPARK_HOME && ./sbin/start-all.sh
spark master web ui 默认端口为8080
最终效果:
运行测试案例:
cd $SPARK_HOME
spark-submit --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.12-3.3.2.jar 1000
spark-submit --master yarn \
--deploy-mode client \
--class org.apache.spark.examples.SparkPi \
$SPARK_HOME/examples/jars/spark-examples_2.12-3.1.2.jar 2000
shell启停脚本
#!/bin/bash
if [ $# -lt 1 ]
then
echo "No Args Input..."
exit ;
fi
case $1 in
"start")
echo " =================== 启动 hadoop 集群 ==================="
echo " --------------- 启动 hdfs ---------------"
ssh master "/opt/module/hadoop-3.3.4/sbin/start-dfs.sh"
echo " --------------- 启动 yarn ---------------"
ssh slave1 "/opt/module/hadoop-3.3.4/sbin/start-yarn.sh"
echo " =================== 启动 spark 集群 ==================="
ssh master "/opt/module/spark-3.3.2-bin-hadoop3/sbin/start-all.sh"
echo " --------------- 启动 historyserver ---------------"
ssh master "/opt/module/hadoop-3.3.4/bin/mapred --daemon start historyserver"
ssh master "/opt/module/spark-3.3.2-bin-hadoop3/sbin/start-history-server.sh"
;;
"stop")
echo " =================== 关闭 hadoop 集群 ==================="
echo " --------------- 关闭 historyserver ---------------"
ssh master "/opt/module/spark-3.3.2-bin-hadoop3/sbin/stop-history-server.sh"
ssh master "/opt/module/hadoop-3.3.4/bin/mapred --daemon stop historyserver"
echo " =================== 关闭 spark 集群 ==================="
ssh master "/opt/module/spark-3.3.2-bin-hadoop3/sbin/stop-all.sh"
echo " --------------- 关闭 yarn ---------------"
ssh slave1 "/opt/module/hadoop-3.3.4/sbin/stop-yarn.sh"
echo " --------------- 关闭 hdfs ---------------"
ssh master "/opt/module/hadoop-3.3.4/sbin/stop-dfs.sh"
;;
*)
echo "Input Args Error..."
;;
esac
mapreduce:
暂略
hive
安装mysql数据库,修改hive配置文件
Scala语法:
var变量,val常量(常量在定义后赋值后,不允许再次修改)
Spark集群搭建好之后的一些基本操作,需要有scala语法基础
RDD概念(Resilient Distributed Dataset 弹性分布式数据集)
RDD是一个不可变的分布式对象集合,每个RDD都被分为多个分区,这些分区运行在集群的不同节点上
spark中提供了两种方式来创建RDD,一种是读取外部的数据集,另一种是将一个已经存储在内存当中的集合进行并行化
由于我使用的是hdfs进行管理文件,我一开始并没有注意到,要使用hdfs的方式进行输入文件,
导致在打印结果的时候一直报hdfs输入文件不存在...还有就是在linux下,sc.textFile需要替换成spark.sparkContext.textFile
否则会报相对路径异常
spark读取hadoop文件URI异常解决:见下文章
Spark读取和存储HDFS上的数据 - 腾讯云开发者社区-腾讯云 (tencent.com)
参照上面文章修改后,所得结果是正确的
作业2:
------
------
错误日志/解决过程
exception in thread main org.apache.spark.sparkexception:A master URL must be set in your
#从提示中可以看出找不到程序运行的master,此时需要配置环境变量。传递给spark的master url可以有如下几种
local 本地单线程
local[K] 本地多线程(指定K个内核)
local[*] 本地多线程(指定所有可用内核)
spark://HOST:PORT 连接到指定的 Spark standalone cluster master,需要指定端口。
mesos://HOST:PORT 连接到指定的 Mesos 集群,需要指定端口。
yarn-client客户端模式 连接到 YARN 集群。需要配置 HADOOP_CONF_DIR。
yarn-cluster集群模式 连接到 YARN 集群。需要配置 HADOOP_CONF_DIR。
hive错误日志
Hive User: root is not allowed to impersonate xxx问题
解决方式:在hadoop的配置文件core-site.xml增加如下配置,重启hdfs
<property>
<name>hadoop.proxyuser.xxx.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.xxx.groups</name>
<value>*</value>
</property>
-------
对表进行增删改
<property>
<name>hive.server2.enable.doAs </name>
<value>false</value>
</property>
HDFS将普通用户添加到超级用户组
使用 hdfs dfsadmin -report 检查权限
没有权限需要添加权限
groupadd supergroup
#添加root用户,选哟其它用户讲root替换为需要添加的用户
usermod -a -G supergroup root
#修改完成后,讲信息同步到hdfs中
hdfs dfsadmin -refreshUserToGroupsMappings
#再次使用命令验证
hdfs dfsadmin -report
上课内容:
编写一个wordcount 单词记数
确保上面的Hadoop环境正常,jdk版本 8/11 ,实际上jdk自己修改即可
下面直接怼代码
maven依赖
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<!-- hdfs version-->
<hadoop.hdfs.version>3.3.4</hadoop.hdfs.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.hdfs.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.hdfs.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.hdfs.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.13.2</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.30</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<source>${maven.compiler.source}</source>
<target>${maven.compiler.target}</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.9</version>
<configuration>
<skipTests>true</skipTests>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
<resources>
<resource>
<directory>src/main/java</directory>
<includes>
<include>**/*.xml</include>
</includes>
</resource>
<resource>
<directory>src/main/resources</directory>
<includes>
<include>**/*</include>
</includes>
</resource>
</resources>
</build>
首先需要写上三个类,一个Mapper,一个Reducer,一个驱动
驱动类
package xyz.leeyangy.hdfs.hdfs;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
/**
* @Author liyangyang
* @Date: 2023/05/11 20:56
* @Package xyz.leeyangy.hdfs.hdfs
* @Version 1.0
* @Description:
*/
public class MapReduceWordCountDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// 1 获取配置信息以及获取 job 对象
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 2 关联本 Driver 程序的 jar
job.setJarByClass(MapReduceWordCountDriver.class);
// 3 关联 Mapper 和 Reducer 的 jar
job.setMapperClass(MapReduceWordCountMapper.class);
job.setReducerClass(MapReduceWordCountReduce.class);
// 4 设置 Mapper 输出的 kv 类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 5 设置最终输出 kv 类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 6 设置输入和输出路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 7 提交 job
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
Mapper
package xyz.leeyangy.hdfs.hdfs;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
* @Author liyangyang
* @Date: 2023/05/11 19:17
* @Package xyz.leeyangy.hdfs.hdfs
* @Version 1.0
* @Description:
*/
public class MapReduceWordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable(1);
/**
* @param key
* @param value
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
// 获取一行
String line = value.toString();
// 切割
String[] words = line.split(" ");
// 输出
for (String word : words) {
k.set(word);
context.write(k,v);
}
}
}
Reducer
package xyz.leeyangy.hdfs.hdfs;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* @Author liyangyang
* @Date: 2023/05/11 19:50
* @Package xyz.leeyangy.hdfs.hdfs
* @Version 1.0
* @Description:
*/
public class MapReduceWordCountReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
// 单词数
int sum;
IntWritable v = new IntWritable();
/**
* @param key
* @param values
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
// 累加求和
sum = 0;
for (IntWritable count:values){
sum += count.get();
}
v.set(sum);
context.write(key,v);
}
}
使用maven进行打包后,将编译好的程序上传到服务器中
/input.word.txt 的文件内容如下(文件已上传至hdfs)
hadoop yarn
hadoop mapreduce
chinasoft
chinasoft
hadoop jar /home/hadoop/temp/hadoop_demo-1.0-SNAPSHOT.jar xyz.leeyangy.hdfs.hdfs.MapReduceWordCountDriver /input/word.txt /user/hadoop/output
上网流量统计
package xyz.leeyangy.hdfs.flow;
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
/**
* @Author liyangyang
* @Date: 2023/05/13 1:57
* @Package xyz.leeyangy.hdfs.mybean
* @Version 1.0
* @Description: 流量统计
*/
public class FlowBean implements Writable {
// 上行流量
private Long upFlow;
// 下行流量
private Long downFlow;
// 总流量
private Long sumFlow;
// 反序列化时,需要反射调用空参构造函数,所以必须有空参构造
public FlowBean() {
}
public Long getUpFlow() {
return upFlow;
}
public void setUpFlow(Long upFlow) {
this.upFlow = upFlow;
}
public Long getDownFlow() {
return downFlow;
}
public void setDownFlow(Long downFlow) {
this.downFlow = downFlow;
}
public Long getSumFlow() {
return sumFlow;
}
public void setSumFlow(Long sumFlow) {
this.sumFlow = sumFlow;
}
// 实现序列化和反序列化方法
/**
* @param dataOutput
* @throws IOException
*/
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeLong(upFlow);
dataOutput.writeLong(downFlow);
dataOutput.writeLong(sumFlow);
}
/**
* @param dataInput
* @throws IOException
*/
@Override
public void readFields(DataInput dataInput) throws IOException {
this.upFlow = dataInput.readLong();
this.downFlow = dataInput.readLong();
this.sumFlow = dataInput.readLong();
}
@Override
public String toString() {
return "FlowBean{" +
"upFlow=" + upFlow +
", downFlow=" + downFlow +
", sumFlow=" + sumFlow +
'}';
}
}
driver
package xyz.leeyangy.hdfs.flow;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.FileOutputStream;
import java.io.IOException;
/**
* @Author liyangyang
* @Date: 2023/05/13 21:23
* @Package xyz.leeyangy.hdfs.flow
* @Version 1.0
* @Description:
*/
public class FlowDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
// 1.获取job对象
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 2.关联Driver类
job.setJarByClass(FlowDriver.class);
// 3.关联mapper和reducer
job.setMapperClass(FlowMapper.class);
job.setReducerClass(FlowReducer.class);
// 4.设置输出kv类型
job.setOutputKeyClass(Text.class);
// 这个bean是序列化过的
job.setOutputValueClass(FlowBean.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
// 5.输入/出路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//7 提交 Job
boolean b = job.waitForCompletion(true);
System.exit(b ? 0 : 1);
}
}
mapper
package xyz.leeyangy.hdfs.flow;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* @Author liyangyang
* @Date: 2023/05/13 16:03
* @Package xyz.leeyangy.hdfs.mybean
* @Version 1.0
* @Description:
*/
public class FlowMapper extends Mapper<LongWritable, Text, Text, FlowBean> {
private Text outK = new Text();
private FlowBean outV = new FlowBean();
/**
* @param key
* @param value
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, FlowBean>.Context context) throws IOException, InterruptedException {
// 获取一行数据
String line = value.toString();
// 分割数据
String[] split = line.split("\t");
// 获取手机号,上行流量,下行流量
String phone = split[1];
String up = split[split.length - 3];
String down = split[split.length - 2];
// 封装 outK outV
outK.set(phone);
outV.setUpFlow(Long.parseLong(up));
outV.setDownFlow(Long.parseLong(down));
outV.setSumFlow(Long.parseLong(up) + Long.parseLong(down));
//写出 outK outV
context.write(outK, outV);
}
}
reducer
package xyz.leeyangy.hdfs.flow;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* @Author liyangyang
* @Date: 2023/05/13 16:10
* @Package xyz.leeyangy.hdfs.flow
* @Version 1.0
* @Description:
*/
public class FlowReducer extends Reducer<Text, FlowBean, Text, FlowBean> {
private FlowBean outV = new FlowBean();
/**
* @param key
* @param values
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void reduce(Text key, Iterable<FlowBean> values, Reducer<Text, FlowBean, Text, FlowBean>.Context context) throws IOException, InterruptedException {
Long totalUp = 0L;
Long totalDown = 0L;
//1 遍历 values,将其中的上行流量,下行流量分别累加
for (FlowBean flowBean : values) {
totalUp += flowBean.getUpFlow();
totalDown += flowBean.getDownFlow();
}
// 2 封装 outKV
outV.setUpFlow(totalUp);
outV.setDownFlow(totalDown);
outV.setSumFlow(totalUp + totalDown);
// 3 写出 outK outV
context.write(key, outV);
}
}
待处理数据
1 13736230513 192.196.100.1 www.baidu.com 2481 24681 200
2 13846544121 192.196.100.2 264 0 200
3 13956435636 192.196.100.3 132 1512 200
4 13966251146 192.168.100.1 240 0 404
5 18271575951 192.168.100.2 www.baidu.com 1527 2106 200
6 84188413 192.168.100.3 www.baidu.com 4116 1432 200
7 13590439668 192.168.100.4 1116 954 200
8 15910133277 192.168.100.5 www.hao123.com 3156 2936 200
9 13729199489 192.168.100.6 240 0 200
10 13630577991 192.168.100.7 www.shouhu.com 6960 690 200
11 15043685818 192.168.100.8 www.baidu.com 3659 3538 200
12 15959002129 192.168.100.9 www.baidu.com 1938 180 500
13 13560439638 192.168.100.10 918 4938 200
14 13470253144 192.168.100.11 180 180 200
15 13682846555 192.168.100.12 www.qq.com 1938 2910 200
16 13992314666 192.168.100.13 www.gaga.com 3008 3720 200
17 13509468723 192.168.100.14 www.qinghua.com 7335 110349 404
18 18390173782 192.168.100.15 www.sogou.com 9531 2412 200
19 13975057813 192.168.100.16 www.baidu.com 11058 48243 200
20 13768778790 192.168.100.17 120 120 200
21 13568436656 192.168.100.18 www.alibaba.com 2481 24681 200
22 13568436656 192.168.100.19 1116 954 200
hadoop jar /home/hadoop/temp/hadoop_demo-1.0-SNAPSHOT.jar xyz.leeyangy.hdfs.flow.FlowDriver /input/phone_data.txt /user/hadoop/outputs
找出学生最大的成绩
待处理数据
Future 684
chinasoft 265
Bed 543
Mary 341
Adair 345
Chad 664
Colin 464
Eden 154
Grover 630
Future 340
chinasoft 367
Bed 567
Mary 367
Adair 664
Chad 543
Colin 574
Eden 663
Grover 614
Future 312
chinasoft 513
Bed 641
Mary 467
Adair 613
Chad 697
Colin 271
Eden 463
Grover 452
Future 548
Alex 285
Bed 554
Mary 596
Adair 681
Chad 584
Colin 699
Eden 708
Grover 345
driver
mapper
reducer
bean