您的位置:首页 > 运维架构

Hadoop常见错误及解决办法汇总

2015-11-27 15:44 162 查看
转载自http://www.sharpcloud.cn/thread-4927-1-1.html

错误一:java.io.IOException: Incompatible
clusterIDs 时常出现在namenode重新格式化之后

" s9 S1 b+ N. |7 w, Y7 E; Z

2014-04-29 14:32:53,877 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode:
Initialization failed for block pool Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000

java.io.IOException: Incompatible clusterIDs in /data/dfs/data: namenode clusterID = CID-d1448b9e-da0f-499e-b1d4-78cb18ecdebb; datanode clusterID
= CID-ff0faa40-2940-4838-b321-98272eb0dee3! k( U+ r: ]& [: ?1 K2 F

at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)

at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)

at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219).
M0 W1 J. r* E+ R0 p) X

at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)

at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)9
`& G! A1 k/ @7 m+ J6 K( e* O% m! a

at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)

at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)

at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664);
]) Z3 I. t! @% }" h/ X V. k

at java.lang.Thread.run(Thread.java:722)2
m/ y; f6 ^5 A) }

2014-04-29 14:32:53,885 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1480406410-192.168.1.181-1398701121586
(storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000

2014-04-29 14:32:53,889 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1480406410-192.168.1.181-1398701121586 (storage
id DS-167510828-192.168.1.191-50010-1398750515421)& r7 a! [& s" A- z+ v

2014-04-29 14:32:55,897 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode

原因:每次namenode
format会重新创建一个namenodeId,而data目录包含了上次format时的id,namenode format清空了namenode下的数据,但是没有清空datanode下的数据,导致启动时失败,所要做的就是每次fotmat前,清空data下的所有目录.

6 T1 i3 l7 Z0 D! ?

解决办法:停掉集群,删除问题节点的data目录下的所有内容。即hdfs-site.xml文件中配置的dfs.data.dir目录。重新格式化namenode。

& ?7 W8 f* O$ z8 A) R

另一个更省事的办法:先停掉集群,然后将datanode节点目录/dfs/data/current/VERSION中的修改为与namenode一致即可。

错误二:org.apache.hadoop.yarn.exceptions.YarnException:
Unauthorized request to start container

14/04/29 02:45:07 INFO mapreduce.Job: Job job_1398704073313_0021 failed with state FAILED due to: Application application_1398704073313_0021 failed
2 times due to Error launching appattempt_1398704073313_0021_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. 7
S- u0 e$ J; }& O

This token is expired. current time is 1398762692768 found 1398711306590'
j% p: ^9 {( V( Y/ U; g0 }/ b9 G4 b

at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:525)

at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)!
p* v f4 f' e+ K+ V) A

at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)

at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)

at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249),
J; J9 f/ \- ]" i$ `6 ^6 [

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)2
P0 e+ l! [8 A% j) B, L @) `

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:722);
O* O5 b' @' D% q2 \7 s

. Failing the application./
R9 A; `5 Z* O% h

14/04/29 02:45:07 INFO mapreduce.Job: Counters: 0+
z8 c' c# X* L9 `9 A

9 }8 v! j6 P% q5 Z% g/ S; x8 F

问题原因:namenode,datanode时间同步问题

4 n0 m2 B g7 B9 Z) k& n

解决办法:多个datanode与namenode进行时间同步,在每台服务器执行:ntpdate
time.nist.gov,确认时间同步成功。

最好在每台服务器的
/etc/crontab 中加入一行:

0 2 * * * root ntpdate time.nist.gov &&
hwclock -w

错误:java.net.SocketTimeoutException:
480000 millis timeout while waiting for channel to be ready for write

2014-05-06 14:28:09,386 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010

ataXceiver
error processing READ_BLOCK operation src: /192.168.1.191:48854 dest: /192.168.1.191:50010

java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected
local=/192.168.1.191:50010 remote=/192.168.1.191:48854]7 S* A9 Y% r3 K* n5 R0 ^

at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)

at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)+
Q3 p0 B7 \' g9 o7 S. p8 O

at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)

at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546).
y9 D% c9 `( U1 @

at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710)2
o5 i7 g* W- f" ^

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340)'
A; q& X1 ?7 d9 } h. {

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101)7
H, M3 Y, U2 k9 }$ ]

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65),
T7 u& G7 t Y7 g6 _

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)

at java.lang.Thread.run(Thread.java:722)%
x* l: k; S$ d7 k4 R9 K

, X; ~, h1 k" B

原因:IO超时5
Z8 a r: A5 Z, f9 _4 I" D8 c+ u5 Y

" |8 b4 `1 f5 }+ q' d* G

解决方法:

修改hadoop配置文件hdfs-site.xml,增加dfs.datanode.socket.write.timeout和dfs.socket.timeout两个属性的设置。

<property>

<name>dfs.datanode.socket.write.timeout</name>+
}9 p1 |# {. d0 ^

<value>6000000</value>

</property>(
w% k% ]. ?& H. ]% m, C/ T' O0 M

<property>

<name>dfs.socket.timeout</name>;
]- f3 n3 r( a; M+ I4 \" u! c( |( A

<value>6000000</value>1
u) h& E$ ?/ w1 m: k! @2 e! l

</property>0
R, X7 N e, [/ o( P$ I! r) b

( S( \' D+ a, U+ R1 O& [

注意: 超时上限值以毫秒为单位。0表示无限制。

错误:DataXceiver
error processing WRITE_BLOCK operation% N8 C+ p) ^* S$ D" e8 Y, |

2014-05-06 15:21:30,378 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010

ataXceiver
error processing WRITE_BLOCK operation src: /192.168.1.193:34147 dest: /192.168.1.191:50010

java.io.IOException: Premature EOF from inputStream

at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)

at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)'
{+ ~: }/ R( L6 B7 J

at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)6
r) ?/ P. }9 i4 \% T. u

at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)*
Y, z' P6 f6 W9 @

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569).
w. z1 \) A+ [# H# f

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)*
Q. v, b+ _0 K u+ H3 q# r. B

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)

at java.lang.Thread.run(Thread.java:722)

原因:文件操作超租期,实际上就是data
stream操作过程中文件被删掉了。9 c( ?5 f* l1 Z5 t' g) } p

% ~9 ]. D7 T; C K T; o7 i

解决办法:

修改hdfs-site.xml (针对2.x版本,1.x版本属性名应该是:dfs.datanode.max.xcievers):0
O* C9 E+ n' B4 z' H# G) }8 x7 C

<property>

<name>dfs.datanode.max.transfer.threads</name>

<value>8192</value> *
?' D! `2 P# k

</property>

拷贝到各datanode节点并重启datanode即可

错误:java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.(
l/ b9 S3 h; L' b2 E- a

2014-05-07 12:21:41,820 WARN [Thread-115] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Graceful stop failed& R$ y {7 T1 c' ]) L

org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010,
192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.

at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514)

at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:332)

at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)

at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)1 p) Q) w4 F# k; h

at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)

at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:159)- u3 j; h: q+ |

at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132)

at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)

at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:548)" a6 n% L' G8 x6 U% V

at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:599)

Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]).
The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.%
x4 E9 @+ y% y- G# y. L2 s7 Z3 r

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:860)' \4 q# Y0 y! V5 j0 T! C: L! J% I$ i

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:925)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)2 j' [$ M5 p( l% T* e2 B

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475)

原因:无法写入;我的环境中有3个datanode,备份数量设置的是3。在写操作时,它会在pipeline中写3个机器。默认replace-datanode-on-failure.policy是DEFAULT,如果系统中的datanode大于等于3,它会找另外一个datanode来拷贝。目前机器只有3台,因此只要一台datanode出问题,就一直无法写入成功。

$ A, _+ B" V; u7 r8 r

解决办法:修改hdfs-site.xml文件,添加或者修改如下两项:

<property>

<name>dfs.client.block.write.replace-datanode-on-failure.enable</name>;
?3 H/ X! P8 I; h. S

<value>true</value>! A- x% |8 x6 T! f6 p* K1 C1
I

</property>

<property>

<name>dfs.client.block.write.replace-datanode-on-failure.policy</name>,
y( z1 e3 y" u; Y- d, e/ ?

<value>NEVER</value>+
H" Q+ k9 d; O! e! A) \

</property>

对于dfs.client.block.write.replace-datanode-on-failure.enable,客户端在写失败的时候,是否使用更换策略,默认是true没有问题。

对于,dfs.client.block.write.replace-datanode-on-failure.policy,default在3个或以上备份的时候,是会尝试更换结点尝试写入datanode。而在两个备份的时候,不更换datanode,直接开始写。对于3个datanode的集群,只要一个节点没响应写入就会出问题,所以可以关掉。
错误:org.apache.hadoop.util.DiskChecker$DiskErrorException:
Could not find any valid local directory for

14/05/08 18:24:59 INFO mapreduce.Job: Task Id : attempt_1399539856880_0016_m_000029_2, Status : FAILED

Error: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1399539856880_0016_m_000029_2_spill_0.out

at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)4
r, L- O8 @2 _4 F$ _% [

at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)

at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)!
z& h' }0 }, n+ j# ]) d

at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)'
t: G( A V& j; |3 a+ d

at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)5
z. ~/ w% E+ q

at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1467).
t' b$ |7 Y$ M+ E4 r5 H

at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)1
L) K# z2 ~7 F

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:769)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)*
X' @9 [; {9 |4 C$ i! Z

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)3
_ e5 s8 H; U D, q- j+ d9 ]

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)

Container killed by the ApplicationMaster.4
J& g0 Z3 w' P4 S- U

原因:两种可能,hadoop.tmp.dir或者data目录存储空间不足。

0 E. r! C7 ^) X/ a* s

解决办法:看了一下我的dfs状态,data使用率不到40%,所以推测是hadoop.tmp.dir空间不足,导致无法创建Jog临时文件。查看core-site.xml发现没有配置hadoop.tmp.dir,因此使用的是默认的/tmp目录,在这目录一旦服务器重启数据就会丢失,因此需要修改。添加:

<property>

<name>hadoop.tmp.dir</dir>/
D& l1 M, a& V! g$ q# _: p

<value>/data/tmp</value>+
C. c8 w/ B/ j+ ~' r0 B- \, y+ w

</property>

然后重新格式化:hadoop namenode -format+
m# u, [& ~4 G, M/ R

重启。

2014-06-19
10:00:32,181 INFO [org.apache.hadoop.mapred.MapTask] - Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewOutputCollector@17bda0f2

java.io.IOException: Spill failed:
]$ F3 O( E# Q5 a

at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)

at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1447)9
_% O9 ]( k9 j1 R* y0 Y- w, F

at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699).
t# w& c, x9 E0 B' F

at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1997)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)%
E$ W# A! ]. X& d

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)7
P! U) B- {2 G7 d

at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)

at java.util.concurrent.FutureTask.run(FutureTask.java:166)5
w: a j6 L! A- J- g! |5 L

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)/
t! Q3 G8 Z) K4 e

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)*
i, }3 L7 Z0 L, Z% {, Q

at java.lang.Thread.run(Thread.java:722)

Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out

at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)

at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)1
}, O- E4 |* S) K- h* z" w! Y( m

at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)&
r% A( h, }/ K. Y& C" R/ P

at org.apache.hadoop.mapred.MROutputFiles.getSpillFileForWrite(MROutputFiles.java:146)

at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)

at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)

at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)"
M, o$ V# P' R3 A

1 I: X* ]9 {7 H1 H& Q5 p

错误原因:本地磁盘空间不足非hdfs (我是在myeclipse中调试程序,本地tmp目录占满).
p6 F8 d$ O+ S

解决办法:清理、增加空间

2014-06-23
10:21:01,479 INFO [IPC Server handler 3 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1403488126955_0002_m_000000_0 is : 0.308017162014-06-23 10:21:01,512 FATAL [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl:
Task: attempt_1403488126955_0002_m_000000_0 - exited : java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180) at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local
directory for attempt_1403488126955_0002_m_000000_0_spill_53.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131) at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,513 INFO [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl:
Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180) at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local
directory for attempt_1403488126955_0002_m_000000_0_spill_53.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131) at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,514 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180) at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local
directory for attempt_1403488126955_0002_m_000000_0_spill_53.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131) at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403488126955_0002_m_000000_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP B7 q6 z2 `%
L( `8 Q% ]" y

错误很明显,磁盘空间不足,但郁闷的是,进各节点查看,磁盘空间使用不到40%,还有很多空间。

郁闷很长时间才发现,原来有个map任务运行时输出比较多,运行出错前,硬盘空间一路飙升,直到100%不够时报错。随后任务执行失败,释放空间,把任务分配给其它节点。正因为空间被释放,因此虽然报空间不足的错误,但查看当时磁盘还有很多剩余空间。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: