hive的数据导入与数据导出:(本地,云hdfs,hbase),列分隔符的设置

hive表的数据源有四种:

hbase

hdfs

本地

其他hive表

 

而hive表本身有两种:

内部表和外部表。

而hbase的数据在hive中,可以建立对应的外部表(参看hive和hbase整合)

 

内部表和外部表

区别:删除时,内部表删除hadoop上的数据;而外部表不删,其数据在外部存储,hive表只是查看数据的形式,看时从外部读入数据:

内部表:CREATETABLE tab(column1 STRING, column2 STRING);
外部表:用EXTERNAL 关键字,且必须在表结尾指定如下参数

CREATE EXTERNAL TABLE  tab

(column1 STRING,

column2 STRING

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY’\t’

stored as textfile

location ‘hdfs://namenode/tmp/lmj/tab/’;

 

分隔符的指定

有两种方式

(1)DELIMITED方式:

ROW FORMAT DELIMITED

[FIELDS TERMINATED BY char]

[COLLECTION ITEMS TERMINATED BY char]

[MAP KEYS TERMINATED BY char]

[LINES TERMINATED BY char]

(2)SERDE方式:

SERDE serde_name [WITH SERDEPROPERTIES(property_name=property_value, property_name=property_value, …)]

其中,[ROW FORMAT DELIMITED]关键字,是设置建表时加载数据所支持的列分隔符;

如果没有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED,则会自动使用自带的 SerDe。

另外,建表时,用户还要为表指定列,同时也会指定自定义的SerDe,Hive通过SerDe确定表的具体的列数据。

举例如下:

内部表

create table user_info (user_id int, cid string,ckid string, username string)
row format delimited
fields terminated by ‘\t’
lines terminated by ‘\n’;
外部表
CREATE EXTERNAL TABLE test_1(id INT, name STRING, citySTRING)
SORTED AS TEXTFILE
ROW FORMAT DELIMITED
FIELDS TERMINATED BY’\t’
LOCATION ‘hdfs://http://www.cnblogs.com/..’

文件存储格式

如上,用关键字[STORED AS file_format]设置加载数据的文件类型,默认采用[STORED AS TEXTFILE]。主要格式有

STORED AS

SEQUENCEFILE

| TEXTFILE

|RCFILE

|INPUTFORMATinput_format_classname

OUTPUTFORMAT            output_format_classname

其中用STORED AS TEXTFILE 存储纯文本文件。如果数据需要压缩,使用 STORED AS SEQUENCE 。Hive本身支持的文件格式只有:Text File,Sequence File。

 

数据导入与导出:

一.数据导入:

1.1导入内部表
(1)本地或者hdfs导入:
        LOAD DATA[LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO TABLEtablename [PARTITION(partcol1=val1, partcol2=val2 …)]
区别是看有无关键字local,有local表示从本地路径导入,无local表示从hadoop(hbase或hdfs)导入。
导入的前提是目标表必须存在。如果无表要先建表,再导入:
        CREATE TABLE myword(idSTRING, counts INT, dt STRING) row formatdelimitedfields terminated by ‘\t’;
(2)用hive表的select结果导入

INSERT OVERWRITE TABLE T1 SELECT * FROMT2;

其中,INSERT OVERWRITE TABLE表示覆盖,删除原数据;

而INSERT into TABLE 表示增量的插入,不删除原数据。

另外,

删除表:drop table if exists T1;

清空表:truncate table T1;

 

1.2 导入外部表:
建表时直接指定数据源(不能指定本地文件,必须是hdfs路径):

(1)Hdfs上数据导入hive:

	CREATE EXTERNAL TABLE wizad_mdm_dev_lmj_edition_20141120 (
	cookie_id STRING,
	guid STRING
	)
	 ROWFORMAT DELIMITED
         FIELDSTERMINATEDBY ','
         LINESTERMINATEDBY '\n'
         storedas textfile
       	 LOCATION'/user/wizad/test/lmj/edition_compare/';

其中,也可以用全路径location’hdfs://namenode/user/wizad/test/lmj/edition_compare/’;

(2)Hbase上数据导入hive表:

先指定参数

SET mapred.job.queue.name=queue3;

SEThbase.client.scanner.caching=5000;

SEThbase.zookeeper.quorum=datanode06,datanode07,datanode08;

SET zookeeper.znode.parent=/hbase;

有map类型结构,建表时需要指明:

CREATE EXTERNAL TABLE lxw2 (

key string,

value map<STRING,STRING>

)

STORED BY’org.apache.hadoop.hive.hbase.HBaseStorageHandler’

WITH SERDEPROPERTIES(“hbase.columns.mapping” =”:key,fixeddim:”)

TBLPROPERTIES(“hbase.table.name”=”wizad_mdm_task_geely_hk_20141014”);

 

查询结果

SELECT KEY,dim_name,dim_value FROM lxw2

LATERAL VIEW explode(VALUE) myTable1AS dim_name,dim_value

–WHERE KEY = ‘000000054153796

 

这里,读取Hbase库的数据,可能会导入失败,因为scan时间过大,可以设置长时间

sethbase.regionserver.lease.period=180000;

hbase与本地表jion时,可能出现启动后,无限等待。原因:

二.数据导出:

三种导出:

(1)导出到其他hive表:

覆盖:INSERT OVERWRITE TABLE t1 select * from t2;

不覆盖:INSERT INTO TABLE t1 select * from t2;

注意hive不支持 子查询结果直接建表,如 create table t1 as select * from t2; 在hive中是错误的

(2)导出到本地,hdfs(有无local):

INSERT OVERWRITE [LOCAL]DIRECTORY directory1 select_statement1

这里注意:

导出本地时可以指定列分隔符,

而导出到hdfs上不可以使用hive默认\001(^A)

导入到hdfs上不能指定列的分隔符:

使用语句ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘,’ 会报错,不能识别。

代码如

INSERT OVERWRITE DIRECTORY’/user/wizad/tmp/mytest’

select * from wizad_mdm_dev_lmj_edition_insterest

因为hive导出到hdfs上,默认使用^A作为列分隔符,其对应着001。官方文档:Data written to the filesystem is serialized as text with columns separated by ^A。(所以,python中用line.split(‘\x01’)或者line.split(‘\001’)切分。)

但这样的数据在pig中无法读入,用’\001’或者’\\001’或者’^A’都无法读入。

解决办法:作为一个列整体读入后,在用STRSPLIT分隔按’\\001’(使用’^A’无效),可以返回一个元组类似((a,b)),pig代码如下

%default interestFlie/user/wizad/tmp/mytest/*

–无效interest_data =LOAD ‘$interestFlie’ USING PigStorage(‘\\001’)

–无效interest_data =LOAD ‘$interestFlie’ USING PigStorage(‘^A’)

interest_data = LOAD ‘$interestFlie’

AS(cookie_id:chararray

—  guid:chararray,

—  dimkey :chararray,

—  dimvalue:chararray

);

test2 = foreach interest_data generateSTRSPLIT(cookie_id,’\\001′);

DUMP res;

describe res;

结果:result结构:{(null)}

((B2BEF56E09EC115456E221,352751019523267,interest_11,161))

((B2BEF56E09EC115456E221,352751019523267,interest_13,102))

((EC15860E09D5EA545700E9,352751019523267,interest_11,161))

((EC15860E09D5EA545700E9,352751019523267,interest_13,102))

((daca0487-5424ef1a0dc3fa-11535481,352751019523267,interest_4,61))

((daca0487-5424ef1a0dc3fa-11535481,352751019523267,interest_21,21))

((daca0487-5424ef1a0dc3fa-11535481,352751019523267,interest_11,161))

((daca0487-5424ef1a0dc3fa-11535481,352751019523267,interest_13,102))

 

 

指定导出全路径也不行。

INSERT OVERWRITE DIRECTORY’hdfs://namenode/user/wizad/tmp/interest2/’

ROW FORMAT DELIMITED FIELDS TERMINATED BY’,’

select * fromwizad_mdm_dev_lmj_edition_insterest

 

2导出到本地,可以指定列分隔符:

INSERT OVERWRITE local DIRECTORY’/home/wizad/lmj/inserest2′

ROW FORMAT DELIMITED FIELDS TERMINATED BY’,’

select * fromwizad_mdm_dev_lmj_edition_insterest

 

导入到本地可直接用-e命令,默认使用\t分隔:

hive -e ‘use wizad;

select * fromwizad_mdm_dev_lmj_edition_insterest;’>> mytest

查询结果使用\t作为列分隔符,mytest中

3531 3631 3730 3631 3931 3635 34360969  51617061916546.i

vim中16进制(%!xxd)两位对应一个字符,看到”.”对应的09,在asii码表中09,就是tab制表符

 

也可以用hive -f:

[wyp@master ~/local]$ cat wyp.sql

select * from wyp

[wyp@master ~/local]$ hive -f wyp.sql>> local/wyp2.txt

 

 

http://blog.csdn.net/longshenlmj/article/details/41519503

13 Responses so far.

  1. We absolutely love your blog and find the majority of your post’s to be exactly I’m looking for.
    Would you offer guest writers to write content available
    for you? I wouldn’t mind producing a post or elaborating on many
    of the subjects you write regarding here. Again,
    awesome site!
  2. This post is genuinely a fastidious one it assists new net users, who
    are wishing for blogging.
  3. Hey! I understand this is sort of off-topic however I needed
    to ask. Does managing a well-established website like yours take a lot of work?
    I’m brand new to operating a blog however
    I do write in my diary daily. I’d like to start a blog so I can easily share my own experience and views online.
    Please let me know if you have any kind of suggestions or tips for
    brand new aspiring bloggers. Appreciate it!
  4. Excellent post however I was wanting to know if you
    could write a litte more on this topic? I’d
    be very grateful if you could elaborate a little bit further.
    Thanks!
  5. Hi, i read your blog occasionally and i own a similar one
    and i was just curious if you get a lot of spam remarks?
    If so how do you reduce it, any plugin or anything you can advise?
    I get so much lately it’s driving me mad so any support
    is very much appreciated.
  6. Rupert Darey says:
    I dugg some of you post as I cogitated they were very beneficial invaluable
    • Cindy says:
      sob… ho avuto un’accesa diisscsuone con un’amica tempo fa.. non è servito a nulla tentare di spiegarle la bufala.La discussione è finita che lei ha iniziato ad insultarmi e che comunque lei aveva la coscienza a posto per aver segnalato questo hacker a tutti i suoi contatti mentre io agivo in malafede perchè è meglio prevenire e secondo lei io tentavo di censurare la realtà crecando di convincerla a non inoltrare più queste catene..
  7. Aundre says:
    Das könnte einfach eine Vereg¶Ãzrung sein. Wenn es sich nicht von alleine erledigt, könntest du die URL hier posten, und ich schau mir dann den speziellen Fall mal genauer an.

Leave a Reply to Mircea Cancel reply