hive表的数据源有四种:
hbase
hdfs
本地
其他hive表
而hive表本身有两种:
内部表和外部表。
而hbase的数据在hive中,可以建立对应的外部表(参看hive和hbase整合)
内部表和外部表
区别:删除时,内部表删除hadoop上的数据;而外部表不删,其数据在外部存储,hive表只是查看数据的形式,看时从外部读入数据:
内部表:CREATETABLE tab(column1 STRING, column2 STRING);
外部表:用EXTERNAL 关键字,且必须在表结尾指定如下参数
CREATE EXTERNAL TABLE tab
(column1 STRING,
column2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY’\t’
stored as textfile
location ‘hdfs://namenode/tmp/lmj/tab/’;
分隔符的指定
有两种方式
(1)DELIMITED方式:
ROW FORMAT DELIMITED
[FIELDS TERMINATED BY char]
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char]
[LINES TERMINATED BY char]
(2)SERDE方式:
SERDE serde_name [WITH SERDEPROPERTIES(property_name=property_value, property_name=property_value, …)]
其中,[ROW FORMAT DELIMITED]关键字,是设置建表时加载数据所支持的列分隔符;
如果没有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED,则会自动使用自带的 SerDe。
另外,建表时,用户还要为表指定列,同时也会指定自定义的SerDe,Hive通过SerDe确定表的具体的列数据。
举例如下:
内部表
create table user_info (user_id int, cid string,ckid string, username string)
row format delimited
fields terminated by ‘\t’
lines terminated by ‘\n’;
外部表
CREATE EXTERNAL TABLE test_1(id INT, name STRING, citySTRING)
SORTED AS TEXTFILE
ROW FORMAT DELIMITED
FIELDS TERMINATED BY’\t’
LOCATION ‘hdfs://http://www.cnblogs.com/..’
文件存储格式
如上,用关键字[STORED AS file_format]设置加载数据的文件类型,默认采用[STORED AS TEXTFILE]。主要格式有
STORED AS
SEQUENCEFILE
| TEXTFILE
|RCFILE
|INPUTFORMATinput_format_classname
OUTPUTFORMAT output_format_classname
其中用STORED AS TEXTFILE 存储纯文本文件。如果数据需要压缩,使用 STORED AS SEQUENCE 。Hive本身支持的文件格式只有:Text File,Sequence File。
数据导入与导出:
一.数据导入:
1.1导入内部表
(1)本地或者hdfs导入:
LOAD DATA[LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO TABLEtablename [PARTITION(partcol1=val1, partcol2=val2 …)]
区别是看有无关键字local,有local表示从本地路径导入,无local表示从hadoop(hbase或hdfs)导入。
导入的前提是目标表必须存在。如果无表要先建表,再导入:
CREATE TABLE myword(idSTRING, counts INT, dt STRING) row formatdelimitedfields terminated by ‘\t’;
(2)用hive表的select结果导入
INSERT OVERWRITE TABLE T1 SELECT * FROMT2;
其中,INSERT OVERWRITE TABLE表示覆盖,删除原数据;
而INSERT into TABLE 表示增量的插入,不删除原数据。
另外,
删除表:drop table if exists T1;
清空表:truncate table T1;
1.2 导入外部表:
建表时直接指定数据源(不能指定本地文件,必须是hdfs路径):
(1)Hdfs上数据导入hive:
CREATE EXTERNAL TABLE wizad_mdm_dev_lmj_edition_20141120 ( cookie_id STRING, guid STRING )
ROWFORMAT DELIMITED FIELDSTERMINATEDBY ',' LINESTERMINATEDBY '\n' storedas textfile LOCATION'/user/wizad/test/lmj/edition_compare/';
其中,也可以用全路径location’hdfs://namenode/user/wizad/test/lmj/edition_compare/’;
(2)Hbase上数据导入hive表:
先指定参数
SET mapred.job.queue.name=queue3;
SEThbase.client.scanner.caching=5000;
SEThbase.zookeeper.quorum=datanode06,datanode07,datanode08;
SET zookeeper.znode.parent=/hbase;
有map类型结构,建表时需要指明:
CREATE EXTERNAL TABLE lxw2 (
key string,
value map<STRING,STRING>
)
STORED BY’org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES(“hbase.columns.mapping” =”:key,fixeddim:”)
TBLPROPERTIES(“hbase.table.name”=”wizad_mdm_task_geely_hk_20141014”);
查询结果
SELECT KEY,dim_name,dim_value FROM lxw2
LATERAL VIEW explode(VALUE) myTable1AS dim_name,dim_value
–WHERE KEY = ‘000000054153796
这里,读取Hbase库的数据,可能会导入失败,因为scan时间过大,可以设置长时间
sethbase.regionserver.lease.period=180000;
hbase与本地表jion时,可能出现启动后,无限等待。原因:
二.数据导出:
三种导出:
(1)导出到其他hive表:
覆盖:INSERT OVERWRITE TABLE t1 select * from t2;
不覆盖:INSERT INTO TABLE t1 select * from t2;
注意hive不支持 子查询结果直接建表,如 create table t1 as select * from t2; 在hive中是错误的
(2)导出到本地,hdfs(有无local):
INSERT OVERWRITE [LOCAL]DIRECTORY directory1 select_statement1
这里注意:
导出本地时可以指定列分隔符,
而导出到hdfs上不可以使用hive默认\001(^A)
导入到hdfs上不能指定列的分隔符:
使用语句ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘,’ 会报错,不能识别。
代码如
INSERT OVERWRITE DIRECTORY’/user/wizad/tmp/mytest’
select * from wizad_mdm_dev_lmj_edition_insterest
因为hive导出到hdfs上,默认使用^A作为列分隔符,其对应着001。官方文档:Data written to the filesystem is serialized as text with columns separated by ^A。(所以,python中用line.split(‘\x01’)或者line.split(‘\001’)切分。)
但这样的数据在pig中无法读入,用’\001’或者’\\001’或者’^A’都无法读入。
解决办法:作为一个列整体读入后,在用STRSPLIT分隔按’\\001’(使用’^A’无效),可以返回一个元组类似((a,b)),pig代码如下
%default interestFlie/user/wizad/tmp/mytest/*
–无效interest_data =LOAD ‘$interestFlie’ USING PigStorage(‘\\001’)
–无效interest_data =LOAD ‘$interestFlie’ USING PigStorage(‘^A’)
interest_data = LOAD ‘$interestFlie’
AS(cookie_id:chararray
— guid:chararray,
— dimkey :chararray,
— dimvalue:chararray
);
test2 = foreach interest_data generateSTRSPLIT(cookie_id,’\\001′);
DUMP res;
describe res;
结果:result结构:{(null)}
((B2BEF56E09EC115456E221,352751019523267,interest_11,161))
((B2BEF56E09EC115456E221,352751019523267,interest_13,102))
((EC15860E09D5EA545700E9,352751019523267,interest_11,161))
((EC15860E09D5EA545700E9,352751019523267,interest_13,102))
((daca0487-5424ef1a0dc3fa-11535481,352751019523267,interest_4,61))
((daca0487-5424ef1a0dc3fa-11535481,352751019523267,interest_21,21))
((daca0487-5424ef1a0dc3fa-11535481,352751019523267,interest_11,161))
((daca0487-5424ef1a0dc3fa-11535481,352751019523267,interest_13,102))
指定导出全路径也不行。
INSERT OVERWRITE DIRECTORY’hdfs://namenode/user/wizad/tmp/interest2/’
ROW FORMAT DELIMITED FIELDS TERMINATED BY’,’
select * fromwizad_mdm_dev_lmj_edition_insterest
2导出到本地,可以指定列分隔符:
INSERT OVERWRITE local DIRECTORY’/home/wizad/lmj/inserest2′
ROW FORMAT DELIMITED FIELDS TERMINATED BY’,’
select * fromwizad_mdm_dev_lmj_edition_insterest
导入到本地可直接用-e命令,默认使用\t分隔:
hive -e ‘use wizad;
select * fromwizad_mdm_dev_lmj_edition_insterest;’>> mytest
查询结果使用\t作为列分隔符,mytest中
3531 3631 3730 3631 3931 3635 34360969 51617061916546.i
vim中16进制(%!xxd)两位对应一个字符,看到”.”对应的09,在asii码表中09,就是tab制表符
也可以用hive -f:
[wyp@master ~/local]$ cat wyp.sql
select * from wyp
[wyp@master ~/local]$ hive -f wyp.sql>> local/wyp2.txt
http://blog.csdn.net/longshenlmj/article/details/41519503
Would you offer guest writers to write content available
for you? I wouldn’t mind producing a post or elaborating on many
of the subjects you write regarding here. Again,
awesome site!
are wishing for blogging.
to ask. Does managing a well-established website like yours take a lot of work?
I’m brand new to operating a blog however
I do write in my diary daily. I’d like to start a blog so I can easily share my own experience and views online.
Please let me know if you have any kind of suggestions or tips for
brand new aspiring bloggers. Appreciate it!
could write a litte more on this topic? I’d
be very grateful if you could elaborate a little bit further.
Thanks!
and i was just curious if you get a lot of spam remarks?
If so how do you reduce it, any plugin or anything you can advise?
I get so much lately it’s driving me mad so any support
is very much appreciated.