LEEYANGY

LEEYANGY 关注TA

拼搏百天

LEEYANGY

LEEYANGY

关注TA

拼搏百天

  • 加入社区2,064天
  • 写了322,476字

该文章投稿至Nemo社区   编程综合  板块 复制链接


Hadoop生态圈

发布于 2023/03/06 00:09 7,393浏览 0回复 43,809

正在完善ing,5月底之前会把所有上课内容更新完毕


win11 配置单机Hadoop、Hive、Hbase

目前Hadoop最新版本3.3.6,(我使用的jdk11可以启动,jdk17浏览文件时候有异常,目前解决不了,降级jdk11即正常启动)

下载--解压--打补丁--配置

配置环境变量

HADOOP_HOME
HADOOP解压路径
# path 里面的配置
%HADOOP_HOME%\bin


打补丁

  hadoop 本身并不支持在 windows 下运行,需要通过外挂一个补丁,这个补丁可以在 github 上下载 https://github.com/cdarlint/winutils  

clone 仓库到本地,将hadoop.dll 和 winutils.exe 两个文件,复制到 %HADOOP_HOME%/bin 目录下去。


修改配置文件(以下配置文件保存时候必须是utf-8编码方式保存)

编辑 core-site.xml 文件  

  

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- hdfs api 接口用的端口 -->
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/D:/Hadoop3/hadoop/TmpData</value>
</property>
</configuration>


 修改hdfs-site.xml

以下路径  (/D:/Hadoop3/hadoop/dfs/name) 请自行修改

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/D:/Hadoop3/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/D:/Hadoop3/hadoop/dfs/data</value>
</property>
<!-- 访问端口 -->
<property>
<name>dfs.namenode.http.address</name>
<value>localhost:9870</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>


 修改mapred-site.xml  


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- 配置调度器(默认) -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>The runtime framework for executing MapReduce jobs</description>
</property>
</configuration>



 修改yarn-site.xml  


<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<!--指定mapreduce执行shuffle时获取数据的方式-->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>1</value>
</property>
<!-- 配置主机名称 -->
<!-- <property> -->
<!-- <name>yarn.resourcemanager.hostname</name> -->
<!-- <value>hadoop3a</value> -->
<!-- </property> -->
</configuration>


测试环境....略



启动Hadoop

先格式化(此格式化意思是初始化Hadoop环境,并非格式化磁盘)

bin\hdfs.cmd namenode -format


启动dfs和yarn

sbin\start-all.cmd

启动结果如下:


网站后台


至此,win11单机hadoop3.3.6完成


Hive安装

如你所见,用到了目前最新的版本3.1.3

准备工具,两份hive安装包,一份最新的和一份保留了cmd脚本的hive

最新版本已经去掉了 *.cmd 相关的脚本,所以需要自己下载老版本的hive进行补全方可正常使用

官方下载地址:Downloads (apache.org)

2.2.0版本下载地址:Index of /dist/hive/hive-2.2.0 (apache.org)

下载完成后解压,配置环境变量,配置文件hive-site.xml,hive-env.sh

修改hite-site.xml (cp hive-default.xml.template hive-site.xml)

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<!--hive的临时数据目录,指定的位置在hdfs上的目录-->
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive</value>
<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>
</property>
<!-- Hive临时文件,用于存储每个查询的临时/中间数据集,通常在完成查询后由配置单元客户端清除,scratchdir 本地目录 -->
<property>
<name>hive.exec.local.scratchdir</name>
<value>D:/Hadoop3/Hive/apache-hive-3.1.3-bin/tmp/${system:user.name}</value>
<!-- <value>D:/bigdata/apache-hive-3.1.3-bin/my_hive/scratch_dir</value> -->
<description>Local scratch space for Hive jobs</description>
</property>
<!-- hive添加资源时的临时目录 ,resources_dir 本地目录-->
<property>
<name>hive.downloaded.resources.dir</name>
<value>D:/Hadoop3/Hive/apache-hive-3.1.3-bin/tmp/${hive.session.id}_resources</value>
<!-- <value>D:/bigdata/apache-hive-3.1.3-bin/my_hive/resources_dir/${hive.session.id}_resources</value> -->
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
<!-- Hive运行时结构化日志文件的位置,querylog 本地目录 -->
<property>
<name>hive.querylog.location</name>
<value>D:/Hadoop3/Hive/apache-hive-3.1.3-bin/tmp/${system:user.name}</value>
<!-- <value>D:/bigdata/apache-hive-3.1.3-bin/my_hive/querylog_dir</value> -->
<description>Location of Hive run time structured log file</description>
</property>
<!-- 用于验证metastore和hivejar包是否一致问题,默认为true。false:hive升级版本不一致只会警告 -->
<!-- 解决 Caused by: MetaException(message:Version information not found in metastore. ) -->
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description>
Enforce metastore schema version consistency.
True: Verify that version information stored in is compatible with one from Hive jars. Also disable automatic
schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
proper metastore schema migration. (Default)
False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
</description>
</property>
<!-- 使用MySQL作为hive的元数据Metastore数据库 -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<!-- <value>jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true&characterEncoding=latin1&useSSL=false</value> -->
<value>jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true&useSSL=false</value>
<description>JDBC connect string for a JDBC metastore.</description>
</property>
<!-- MySQL JDBC驱动程序类 -->
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<!-- 连接到MySQL服务器的用户名 -->
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>Username to use against metastore database</description>
</property>
<!-- 连接MySQL服务器的密码 -->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
<description>password to use against metastore database</description>
</property>
<!-- hive server2 thrift ip -->
<property>
<name>hive.server2.thrift.bind.host</name>
<value>localhost</value>
</property>
<!-- hive server2 thrift 端口 -->
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<!-- Thrift Metastore服务器的主机和端口 默认 <value/> -->
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
</property>
<!-- 如果启用了日志功能,则存储操作日志的顶级目录,operation_logs 本地目录 -->
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>D:/Hadoop3/Hive/apache-hive-3.1.3-bin/tmp/${system:user.name}/operation_logs</value>
<!-- <value>D:/bigdata/apache-hive-3.1.3-bin/my_hive/operation_logs_dir</value> -->
<description>Top level directory where operation logs are stored if logging functionality is enabled</description>
</property>
<!-- 自动创建全部 -->
<!-- 初始化数据库自动创建schema -->
<!-- hive Required table missing : "DBS" in Catalog""Schema" 错误 -->
<!-- <property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
<description>Auto creates necessary schema on a startup if one doesn't exist. Set this to false, after creating it once.To enable auto create also set hive.metastore.schema.verification=false. Auto creation is not recommended for production use cases, run schematool command instead.</description>
</property> -->
</configuration>


修改hive-env.sh

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set Hive and Hadoop environment variables here. These variables can be used
# to control the execution of Hive. It should be used by admins to configure
# the Hive installation (so that users do not have to set environment variables
# or set command line parameters to get correct behavior).
#
# The hive service being invoked (CLI etc.) is available via the environment
# variable SERVICE


# Hive Client memory usage can be an issue if a large number of clients
# are running at the same time. The flags below have been useful in
# reducing memory usage:
#
# if [ "$SERVICE" = "cli" ]; then
# if [ -z "$DEBUG" ]; then
# export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+UseParNewGC -XX:-UseGCOverheadLimit"
# else
# export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit"
# fi
# fi

# The heap size of the jvm stared by hive shell script can be controlled via:
#
# export HADOOP_HEAPSIZE=1024
#
# Larger heap size may be required when running queries over large number of files or partitions.
# By default hive shell scripts use a heap size of 256 (MB). Larger heap size would also be
# appropriate for hive server.


# Set HADOOP_HOME to point to a specific hadoop install directory
# HADOOP_HOME=${bin}/../../hadoop

# Hive Configuration Directory can be controlled by:
# export HIVE_CONF_DIR=

# Folder containing extra libraries required for hive compilation/execution can be controlled by:
# export HIVE_AUX_JARS_PATH=

# hive hsell 启动JVM的堆栈大小
export HADOOP_HEAPSIZE=2048
# hadoop的安装目录
HADOOP_HOME=%HADOOP_HOME%
# hive的配置目录
export HIVE_CONF_DIR=%HIVE_CONF_DIR%
# hive的lib库目录
export HIVE_AUX_JARS_PATH=D:\env\apache\hadoop3\apache-hive-3.1.3-bin\lib


补全bin目录下的文件

解压2.2.0点击conf,全部复制-全部替换


初始化数据库

初始化 Hive 元数据到 MySQL 数据库 

将 %HIVE_HOME%\scripts\metastore\upgrade\mysql 目录下的 hive-schema-3.0.0.mysql.sql 导入MySQL

hive --service schematool -dbType mysql -initSchema --verbose


启动Hive

1、启动 Hive 元数据

hive --service metastore

2、启动 Hive server2 服务

hive --service hiveserver2

3、 启动 hive 命令行

hive



Hive的log4j和Hadoop的log4j包冲突
Hive初始化之前必须先启动HDFS
Hive数据库的元数据编码 latin1



Hbase安装

下载最新的2.5.7版本,做如下配置修改

修改hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
-->
<configuration>
<!--
The following properties are set for running HBase as a single process on a
developer workstation. With this configuration, HBase is running in
"stand-alone" mode and without a distributed file system. In this mode, and
without further configuration, HBase and ZooKeeper data are stored on the
local filesystem, in a path under the value configured for `hbase.tmp.dir`.
This value is overridden from its default value of `/tmp` because many
systems clean `/tmp` on a regular basis. Instead, it points to a path within
this HBase installation directory.

Running against the `LocalFileSystem`, as opposed to a distributed
filesystem, runs the risk of data integrity issues and data loss. Normally
HBase will refuse to run in such an environment. Setting
`hbase.unsafe.stream.capability.enforce` to `false` overrides this behavior,
permitting operation. This configuration is for the developer workstation
only and __should not be used in production!__

See also https://hbase.apache.org/book.html#standalone_dist
-->
<property>
<!-- Hbase 的安装目录 -->
<name>hbase.rootdir</name>
<value>file:///D:/env/apache/hadoop3/hbase-2.5.7</value>
</property>
<!-- Hbase 临时文件目录 -->
<property>
<name>hbase.tmp.dir</name>
<value>D:/Hadoop3/hbase_tmp</value>
</property>
<!-- 指定hbase管理页面端口 -->
<property>
<name>hbase.master.info.port</name>
<value>60010</value>
</property>
<!-- 预留 zooKeeper 主机ip -->
<property>
<name>hbase.zookeeper.quorum</name>
<value>127.0.0.1</value>
</property>
<!-- 预留 zooKeeper 程序位置 -->
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>D:/Hadoop3/hbase-2.5.7/zoo</value>
</property>
<!-- hbase部署模式,false表示单机或伪分布式,true代表全分布式 -->
<property>
<name>hbase.cluster.distributed</name>
<value>false</value>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>./tmp</value>
</property>
<!-- 在分布式的情况下一定要设置,不然容易出现Hmaster起不来的情况 -->
<property>
<name>hbase.unsafe.stream.capability.enforce</name>
<value>false</value>
</property>
</configuration>


修改hbase-env.cmd,在末尾添加以下内容即可

set JAVA_HOME=%JAVA_HOME%

@rem 这里暂且先不用内置zookeeper
set HBASE_MANAGES_ZK=false
set HBASE_CLASSPATH=%HBASE_HOME%/conf


启动测试

补全依赖

# 可以从 Hadoop-3.x.x版本的 share\hadoop\common\lib 中找到前三个,第四个直接在 maven 仓库里找

htrace-core4-4.1.0-incubating.jar、slf4j-api-1.7.25.jar、slf4j-log4j12-1.7.25.jar,jansi-1.4.jar


maven 依赖获取 jansi-1.4.jar

<dependency>
<groupId>org.fusesource.jansi</groupId>
<artifactId>jansi</artifactId>
<version>1.4</version>
</dependency>

找到以上内容后复制到 %HBASE_HOME%\lib 中去。


启动:

cd %HBASE_HOME%
bin\start-hbase.cmd




测试案例在下面 hbase 作业中


Linux集群如下:

搭建环境  docker(可伪集群),vm==>ubuntu-server 20.04

个人配置好的伪集群镜像:


我的配置  master 是主机 ,slave1和slave2是从机


顺序依次是

hadoop:

    虚拟机安装过程,Windows其实也可以(太折磨人了啊)

    现在需要准备两个文件:

            对于hadoo-3.3.4来说,jdk8/jdk11都可以,就是这两个包了

    hadoop可以从清华镜像站下载

    jdk手动从Oracle官网下载


    下载到服务器后,分别解压缩

    在/opt目录下,新建module文件夹

    sudo mkdir /opt/module

    给当前用户进行'授权访问',这里以hadoop用户为例

   chown hadoop:hadoop /opt/module

    修改profile文件

    sudo vim /etc/profile.d/my_env.sh

#添加JAVA_HOME和HADOOP_HOME

#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_361/
export PATH=$PATH:$JAVA_HOME/bin

#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.3.4
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin #保存并退出 . /etc/profile java -version hadoop version #以上两个有输出版本号即可

使用官方案例进行测试是否正常( 本地运行模式 )

cd $HADOOP_HOME
mkdir wcinput
cd wcinput
vim word.txt
hadoop yarn
hadoop mapreduce
chinasoft
chinasoft #保存退出

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar wordcount wcinput wcoutput



配置集群之间的免密登录

 


hbase:

    start-hbase

    hbase shell 使用进入类似mysql中一样


create "表名","列簇1","列簇2","...列簇n..."
list 查看数据库有哪些表
describe "表名" 查看表结构
put "表名","rowKey名","列簇名:字段名","字段值" (插入和修改)
scan "表名" 扫描查看表数据

get "表名","rowKey","列簇名:字段名" 查看指定数据

get "表名","rowKey" {column=>"列簇名":"列限定符",TIMESTAMP=>时间戳}

deleteall "表名" 删除某rowKey全部数据,表名称+rowKey delete "表名","rowKey","列簇名:字段名" truncate "表名" 清空表数据 删除表之前需要停用当前的表 disable "表名" drop "表名"

   作业1:

语句

create "student","info","Mon","Tue","Wed","Thur","Fri"
#张三
put "student","1001","info:name","zhangsan"
put "student","1001","info:groupnumber","0"
put "student","1001","info:groupleader","zhangsan"
put "student","1001","info:sex","male"
put "student","1001","info:tel","19978250000"
put "student","1001","Mon:E","Spark And Hadoop"
put "student","1001","Tue:A","MapReduce"
put "student","1001","Tue:B","SpringCloud"
put "student","1001","Tue:D","MicroServices"
put "student","1001","Wed:C","Vue"
put "student","1001","Thur:B","H5"
#李四
put "student","1002","info:name","lisi"
put "student","1002","info:groupnumber","0"
put "student","1002","info:groupleader","zhangsan"
put "student","1002","info:sex","male"
put "student","1002","info:tel","19978251111"
put "student","1002","Mon:E","Spark And Hadoop"
put "student","1002","Tue:A","MapReduce"
put "student","1002","Tue:B","SpringCloud"
put "student","1002","Tue:D","MicroServices"
put "student","1002","Wed:C","Vue"
put "student","1002","Thur:B","H5"
#王五
put "student","1003","info:name","wangwu"
put "student","1003","info:groupnumber","0"
put "student","1003","info:groupleader","zhangsan"
put "student","1003","info:sex","female"
put "student","1003","info:tel","19978252222"
put "student","1003","Mon:E","Spark And Hadoop"
put "student","1003","Tue:A","MapReduce"
put "student","1003","Tue:B","SpringCloud"
put "student","1003","Tue:D","MicroServices"
put "student","1003","Wed:C","Vue"
put "student","1003","Thur:B","H5"
#赵六
put "student","1004","info:name","zhaoliu"
put "student","1004","info:groupnumber","0"
put "student","1004","info:groupleader","zhangsan"
put "student","1004","info:sex","male"
put "student","1004","info:tel","19978253333"
put "student","1004","Mon:E","Spark And Hadoop"
put "student","1004","Tue:A","MapReduce"
put "student","1004","Tue:B","SpringCloud"
put "student","1004","Tue:D","MicroServices"
put "student","1004","Wed:C","Vue"
put "student","1004","Thur:B","H5"
#孙七
put "student","1005","info:name","sunqi"
put "student","1005","info:groupnumber","0"
put "student","1005","info:groupleader","zhangsan"
put "student","1005","info:sex","female"
put "student","1005","info:tel","19978254444"
put "student","1005","Mon:E","Spark And Hadoop"
put "student","1005","Mon:A","Hadoop"
put "student","1005","Tue:A","MapReduce"
put "student","1005","Tue:B","SpringCloud"
put "student","1005","Tue:D","MicroServices"
put "student","1005","Wed:C","Vue"
put "student","1005","Thur:B","H5"

2.
alter "student","Sun"

3.
deleteall "student","1001"

5.
scan "student",{COLUMN=>"Mon"}

4.
scan "student",{COLUMNS=>['Mon','Tue','Wed','Thur','Fri','Sun'], FILTER=>"PrefixFilter('1001')"}


在Linux下搭建的环境


zookeeper集群

# 解压,移动

tar -zxvf apache-zookeeper-3.7.0.tar.gz
mv apache-zookeeper-3.7.0 /usr/local

#配置zookeeper

#1.复制为zoo.cfg
cp zoo_sample.cfg zoo.cfg
#2.编辑zoo.cfg,修改内容如下
vi zoo.cfg

// 指定zookeeper数据存储路径
dataDir=/usr/local/zookeeper/data/zkData
//在最后添加,指定zookeeper集群主机及端口,机器数必须为奇数
server.1=master:2888:3888
server.2=slave1:2888:3888
server.3=slave2:2888:3888
server.4=slave3:2888:3888

#3.创建存储路径
mkdir /usr/local/zookeeper/data/zkData/ -p



spark集群搭建

参考文章==>>    https://blog.csdn.net/m0_53317797/article/details/127216100

                            https://www.cnblogs.com/liugp/p/16153043.html#3yarn%E6%8E%A8%E8%8D%90

配置文件  

 vim   spark-env.sh

export JAVA_HOME=/opt/module/jdk1.8.0_361
export SPARK_DIST_CLASSPATH=$(/opt/module/hadoop-3.3.4/bin/hadoop classpath)
export HADOOP_CONF_DIR=/opt/module/hadoop-3.3.4/etc/hadoop
export SPARK_MASTER_IP=master #此处IP地址是master主机的地址
export SPARK_MASTER_HOST=master
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080
-Dspark.history.retainedApplications=5
-Dspark.history.fs.logDirectory=hdfs://master:9000/spark-app-history"
export HADOOP_HOME=/opt/module/hadoop-3.3.4
export SPARK_MASTER_PORT=7077


vim spark-defaults.conf

spark.eventLog.enabled           true
spark.eventLog.dir hdfs://master:9000/spark-app-history
spark.eventLog.compress true
spark.yarn.historyServer.address master:18080
spark.yarn.jars hdfs:///spark-yarn/jars/*.jar

spark.master spark://master:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://master:8020/spark-eventlog
spark.eventLog.compress true
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 1g

spark.yarn.historyServer.address master:18080
spark.history.ui.port 18080
spark.yarn.jars hdfs:///spark-yarn/jars/*.jar


将workers-template重命名为workers

cp workers-template workers
vim workes

master
slave1
slave2



启动hadoop集群


创建目录

hdfs dfs -mkdir /spark-app-history


同步所有修改配置到slave机器上


 启动spark集群:

cd $SPARK_HOME && ./sbin/start-all.sh


spark master web ui 默认端口为8080

最终效果:


运行测试案例:

cd $SPARK_HOME
spark-submit --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.12-3.3.2.jar 1000


spark-submit --master yarn \
--deploy-mode client \
--class org.apache.spark.examples.SparkPi \
$SPARK_HOME/examples/jars/spark-examples_2.12-3.1.2.jar 2000


shell启停脚本

#!/bin/bash
if [ $# -lt 1 ]
then
echo "No Args Input..."
exit ;
fi

case $1 in
"start")
echo " =================== 启动 hadoop 集群 ==================="
echo " --------------- 启动 hdfs ---------------"
ssh master "/opt/module/hadoop-3.3.4/sbin/start-dfs.sh"
echo " --------------- 启动 yarn ---------------"
ssh slave1 "/opt/module/hadoop-3.3.4/sbin/start-yarn.sh"
echo " =================== 启动 spark 集群 ==================="
ssh master "/opt/module/spark-3.3.2-bin-hadoop3/sbin/start-all.sh"
echo " --------------- 启动 historyserver ---------------"
ssh master "/opt/module/hadoop-3.3.4/bin/mapred --daemon start historyserver"
ssh master "/opt/module/spark-3.3.2-bin-hadoop3/sbin/start-history-server.sh"
;;

"stop")
echo " =================== 关闭 hadoop 集群 ==================="
echo " --------------- 关闭 historyserver ---------------"
ssh master "/opt/module/spark-3.3.2-bin-hadoop3/sbin/stop-history-server.sh"
ssh master "/opt/module/hadoop-3.3.4/bin/mapred --daemon stop historyserver"
echo " =================== 关闭 spark 集群 ==================="
ssh master "/opt/module/spark-3.3.2-bin-hadoop3/sbin/stop-all.sh"
echo " --------------- 关闭 yarn ---------------"
ssh slave1 "/opt/module/hadoop-3.3.4/sbin/stop-yarn.sh"
echo " --------------- 关闭 hdfs ---------------"
ssh master "/opt/module/hadoop-3.3.4/sbin/stop-dfs.sh"
;;
*)
echo "Input Args Error..."
;;
esac


mapreduce:

    暂略


hive

安装mysql数据库,修改hive配置文件


Scala语法:

var变量,val常量(常量在定义后赋值后,不允许再次修改)


Spark集群搭建好之后的一些基本操作,需要有scala语法基础

RDD概念(Resilient Distributed Dataset 弹性分布式数据集)


RDD是一个不可变的分布式对象集合,每个RDD都被分为多个分区,这些分区运行在集群的不同节点上

spark中提供了两种方式来创建RDD,一种是读取外部的数据集,另一种是将一个已经存储在内存当中的集合进行并行化

由于我使用的是hdfs进行管理文件,我一开始并没有注意到,要使用hdfs的方式进行输入文件,

导致在打印结果的时候一直报hdfs输入文件不存在...还有就是在linux下,sc.textFile需要替换成spark.sparkContext.textFile

否则会报相对路径异常

spark读取hadoop文件URI异常解决:见下文章

Spark读取和存储HDFS上的数据 - 腾讯云开发者社区-腾讯云 (tencent.com)

参照上面文章修改后,所得结果是正确的


作业2:

------

------




错误日志/解决过程

exception in thread main org.apache.spark.sparkexception:A master URL must be set in your

#从提示中可以看出找不到程序运行的master,此时需要配置环境变量。传递给spark的master url可以有如下几种

local 本地单线程
local[K] 本地多线程(指定K个内核)
local[*] 本地多线程(指定所有可用内核)
spark://HOST:PORT 连接到指定的 Spark standalone cluster master,需要指定端口。
mesos://HOST:PORT 连接到指定的 Mesos 集群,需要指定端口。
yarn-client客户端模式 连接到 YARN 集群。需要配置 HADOOP_CONF_DIR。
yarn-cluster集群模式 连接到 YARN 集群。需要配置 HADOOP_CONF_DIR。


hive错误日志

Hive User: root is not allowed to impersonate xxx问题

解决方式:在hadoop的配置文件core-site.xml增加如下配置,重启hdfs

<property>
<name>hadoop.proxyuser.xxx.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.xxx.groups</name>
<value>*</value>
</property>


-------

对表进行增删改

<property>
<name>hive.server2.enable.doAs </name>
<value>false</value>
</property>



HDFS将普通用户添加到超级用户组

使用   hdfs dfsadmin -report  检查权限

没有权限需要添加权限


groupadd supergroup
#添加root用户,选哟其它用户讲root替换为需要添加的用户
usermod -a -G supergroup root

#修改完成后,讲信息同步到hdfs中

hdfs dfsadmin -refreshUserToGroupsMappings

#再次使用命令验证

hdfs dfsadmin -report





上课内容:

编写一个wordcount 单词记数

确保上面的Hadoop环境正常,jdk版本 8/11  ,实际上jdk自己修改即可

下面直接怼代码

maven依赖

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<!-- hdfs version-->
<hadoop.hdfs.version>3.3.4</hadoop.hdfs.version>
</properties>

<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.hdfs.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.hdfs.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.hdfs.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.13.2</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.30</version>
</dependency>
</dependencies>

<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<source>${maven.compiler.source}</source>
<target>${maven.compiler.target}</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.9</version>
<configuration>
<skipTests>true</skipTests>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>

</plugins>
<resources>
<resource>
<directory>src/main/java</directory>
<includes>
<include>**/*.xml</include>
</includes>
</resource>
<resource>
<directory>src/main/resources</directory>
<includes>
<include>**/*</include>
</includes>
</resource>
</resources>
</build>


首先需要写上三个类,一个Mapper,一个Reducer,一个驱动


驱动类

package xyz.leeyangy.hdfs.hdfs;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
* @Author liyangyang
* @Date: 2023/05/11 20:56
* @Package xyz.leeyangy.hdfs.hdfs
* @Version 1.0
* @Description:
*/
public class MapReduceWordCountDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// 1 获取配置信息以及获取 job 对象
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 2 关联本 Driver 程序的 jar
job.setJarByClass(MapReduceWordCountDriver.class);
// 3 关联 Mapper 和 Reducer 的 jar
job.setMapperClass(MapReduceWordCountMapper.class);
job.setReducerClass(MapReduceWordCountReduce.class);
// 4 设置 Mapper 输出的 kv 类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 5 设置最终输出 kv 类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 6 设置输入和输出路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 7 提交 job
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}

Mapper

package xyz.leeyangy.hdfs.hdfs;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
* @Author liyangyang
* @Date: 2023/05/11 19:17
* @Package xyz.leeyangy.hdfs.hdfs
* @Version 1.0
* @Description:
*/


public class MapReduceWordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

Text k = new Text();

IntWritable v = new IntWritable(1);

/**
* @param key
* @param value
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {

// 获取一行
String line = value.toString();

// 切割
String[] words = line.split(" ");

// 输出
for (String word : words) {
k.set(word);
context.write(k,v);
}
}
}

Reducer

package xyz.leeyangy.hdfs.hdfs;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
* @Author liyangyang
* @Date: 2023/05/11 19:50
* @Package xyz.leeyangy.hdfs.hdfs
* @Version 1.0
* @Description:
*/
public class MapReduceWordCountReduce extends Reducer<Text, IntWritable, Text, IntWritable> {

// 单词数
int sum;

IntWritable v = new IntWritable();

/**
* @param key
* @param values
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
// 累加求和
sum = 0;
for (IntWritable count:values){
sum += count.get();
}
v.set(sum);
context.write(key,v);
}
}

使用maven进行打包后,将编译好的程序上传到服务器中

/input.word.txt  的文件内容如下(文件已上传至hdfs)

hadoop yarn
hadoop mapreduce
chinasoft
chinasoft

hadoop jar /home/hadoop/temp/hadoop_demo-1.0-SNAPSHOT.jar  xyz.leeyangy.hdfs.hdfs.MapReduceWordCountDriver /input/word.txt /user/hadoop/output



上网流量统计

package xyz.leeyangy.hdfs.flow;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
* @Author liyangyang
* @Date: 2023/05/13 1:57
* @Package xyz.leeyangy.hdfs.mybean
* @Version 1.0
* @Description: 流量统计
*/


public class FlowBean implements Writable {
// 上行流量
private Long upFlow;

// 下行流量
private Long downFlow;

// 总流量
private Long sumFlow;

// 反序列化时,需要反射调用空参构造函数,所以必须有空参构造
public FlowBean() {
}

public Long getUpFlow() {
return upFlow;
}

public void setUpFlow(Long upFlow) {
this.upFlow = upFlow;
}

public Long getDownFlow() {
return downFlow;
}

public void setDownFlow(Long downFlow) {
this.downFlow = downFlow;
}

public Long getSumFlow() {
return sumFlow;
}

public void setSumFlow(Long sumFlow) {
this.sumFlow = sumFlow;
}


// 实现序列化和反序列化方法

/**
* @param dataOutput
* @throws IOException
*/
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeLong(upFlow);
dataOutput.writeLong(downFlow);
dataOutput.writeLong(sumFlow);
}

/**
* @param dataInput
* @throws IOException
*/
@Override
public void readFields(DataInput dataInput) throws IOException {
this.upFlow = dataInput.readLong();
this.downFlow = dataInput.readLong();
this.sumFlow = dataInput.readLong();
}

@Override
public String toString() {
return "FlowBean{" +
"upFlow=" + upFlow +
", downFlow=" + downFlow +
", sumFlow=" + sumFlow +
'}';
}
}

driver

package xyz.leeyangy.hdfs.flow;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.FileOutputStream;
import java.io.IOException;

/**
* @Author liyangyang
* @Date: 2023/05/13 21:23
* @Package xyz.leeyangy.hdfs.flow
* @Version 1.0
* @Description:
*/
public class FlowDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
// 1.获取job对象
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);

// 2.关联Driver类
job.setJarByClass(FlowDriver.class);

// 3.关联mapper和reducer
job.setMapperClass(FlowMapper.class);
job.setReducerClass(FlowReducer.class);

// 4.设置输出kv类型
job.setOutputKeyClass(Text.class);
// 这个bean是序列化过的
job.setOutputValueClass(FlowBean.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);

// 5.输入/出路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//7 提交 Job
boolean b = job.waitForCompletion(true);
System.exit(b ? 0 : 1);
}
}

mapper

package xyz.leeyangy.hdfs.flow;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
* @Author liyangyang
* @Date: 2023/05/13 16:03
* @Package xyz.leeyangy.hdfs.mybean
* @Version 1.0
* @Description:
*/
public class FlowMapper extends Mapper<LongWritable, Text, Text, FlowBean> {
private Text outK = new Text();
private FlowBean outV = new FlowBean();

/**
* @param key
* @param value
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, FlowBean>.Context context) throws IOException, InterruptedException {
// 获取一行数据
String line = value.toString();

// 分割数据
String[] split = line.split("\t");
// 获取手机号,上行流量,下行流量
String phone = split[1];
String up = split[split.length - 3];
String down = split[split.length - 2];

// 封装 outK outV
outK.set(phone);
outV.setUpFlow(Long.parseLong(up));
outV.setDownFlow(Long.parseLong(down));
outV.setSumFlow(Long.parseLong(up) + Long.parseLong(down));
//写出 outK outV
context.write(outK, outV);
}
}

reducer

package xyz.leeyangy.hdfs.flow;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
* @Author liyangyang
* @Date: 2023/05/13 16:10
* @Package xyz.leeyangy.hdfs.flow
* @Version 1.0
* @Description:
*/
public class FlowReducer extends Reducer<Text, FlowBean, Text, FlowBean> {

private FlowBean outV = new FlowBean();

/**
* @param key
* @param values
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void reduce(Text key, Iterable<FlowBean> values, Reducer<Text, FlowBean, Text, FlowBean>.Context context) throws IOException, InterruptedException {

Long totalUp = 0L;
Long totalDown = 0L;

//1 遍历 values,将其中的上行流量,下行流量分别累加
for (FlowBean flowBean : values) {
totalUp += flowBean.getUpFlow();
totalDown += flowBean.getDownFlow();
}
// 2 封装 outKV
outV.setUpFlow(totalUp);
outV.setDownFlow(totalDown);
outV.setSumFlow(totalUp + totalDown);
// 3 写出 outK outV
context.write(key, outV);

}
}

待处理数据

1	13736230513	192.196.100.1	www.baidu.com	2481	24681	200
2 13846544121 192.196.100.2 264 0 200
3 13956435636 192.196.100.3 132 1512 200
4 13966251146 192.168.100.1 240 0 404
5 18271575951 192.168.100.2 www.baidu.com 1527 2106 200
6 84188413 192.168.100.3 www.baidu.com 4116 1432 200
7 13590439668 192.168.100.4 1116 954 200
8 15910133277 192.168.100.5 www.hao123.com 3156 2936 200
9 13729199489 192.168.100.6 240 0 200
10 13630577991 192.168.100.7 www.shouhu.com 6960 690 200
11 15043685818 192.168.100.8 www.baidu.com 3659 3538 200
12 15959002129 192.168.100.9 www.baidu.com 1938 180 500
13 13560439638 192.168.100.10 918 4938 200
14 13470253144 192.168.100.11 180 180 200
15 13682846555 192.168.100.12 www.qq.com 1938 2910 200
16 13992314666 192.168.100.13 www.gaga.com 3008 3720 200
17 13509468723 192.168.100.14 www.qinghua.com 7335 110349 404
18 18390173782 192.168.100.15 www.sogou.com 9531 2412 200
19 13975057813 192.168.100.16 www.baidu.com 11058 48243 200
20 13768778790 192.168.100.17 120 120 200
21 13568436656 192.168.100.18 www.alibaba.com 2481 24681 200
22 13568436656 192.168.100.19 1116 954 200
hadoop jar /home/hadoop/temp/hadoop_demo-1.0-SNAPSHOT.jar  xyz.leeyangy.hdfs.flow.FlowDriver /input/phone_data.txt /user/hadoop/outputs


找出学生最大的成绩

待处理数据

Future 684
chinasoft 265
Bed 543
Mary 341
Adair 345
Chad 664
Colin 464
Eden 154
Grover 630
Future 340
chinasoft 367
Bed 567
Mary 367
Adair 664
Chad 543
Colin 574
Eden 663
Grover 614
Future 312
chinasoft 513
Bed 641
Mary 467
Adair 613
Chad 697
Colin 271
Eden 463
Grover 452
Future 548
Alex 285
Bed 554
Mary 596
Adair 681
Chad 584
Colin 699
Eden 708
Grover 345

driver


mapper


reducer


bean





本文标签
 {{tag}}
点了个评