蟒蛇 - PipeMa pred.waitOutputThreads（）：子进程失败，code 1

人气：577 发布：2022-09-16 标签： mapreduce beautifulsoup hadoop-streaming

问题描述

近日，我想分析的网站，然后使用BeautifulSoup来筛选我想要什么，并在HDFS csv文件写入。

现在，我在过滤的网站code与BeautifulSoup的过程。

我想用马preduce方法来执行它：

  Hadoop的罐子/usr/lib/hadoop-0.20-ma$p$pduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.0.2.jar-mapper /pytemp/filter.py - 输入/用户/根/ PY /输入/ - 输出/用户/根/ PY / output40 /

输入文件是像KVS（每行）：（键，值）=（URL，内容）

的内容，我的意思是：

filter.py文件：

 ＃！的/ usr /斌/包膜蟒蛇＃！的/ usr / bin中/蟒蛇#coding：UTF-8从BS4进口BeautifulSoup进口SYS在sys.stdin行：    行= line.strip（）    键，含量= line.split（，）    ＃如果不存在以下两行，程序将成功执行    汤= BeautifulSoup（内容）    输出= soup.find（）    打印（开始-----------------）    打印（结束------------------）

顺便说一句，我觉得我不需要reduce.py做我的工作。

但是，我得到错误信息

 错误：了java.lang.RuntimeException：PipeMa pred.waitOutputThreads（）：子进程失败，code 1在org.apache.hadoop.streaming.PipeMa pred.waitOutputThreads（PipeMa pred.java：320）在org.apache.hadoop.streaming.PipeMa pred.ma predFinished（PipeMa pred.java：533）在org.apache.hadoop.streaming.PipeMapper.close（PipeMapper.java:130）在org.apache.hadoop.ma pred.MapRunner.run（MapRunner.java:61）在org.apache.hadoop.streaming.PipeMapRunner.run（PipeMapRunner.java:34）在org.apache.hadoop.ma pred.MapTask.runOldMapper（MapTask.java:430）在org.apache.hadoop.ma pred.MapTask.run（MapTask.java:342）在org.apache.hadoop.ma pred.YarnChild $ 2.run（YarnChild.java:168）在java.security.AccessController.doPrivileged（本机方法）在javax.security.auth.Subject.doAs（Subject.java:415）在org.apache.hadoop.security.UserGroupInformation.doAs（UserGroupInformation.java:1548）在org.apache.hadoop.ma pred.YarnChild.main（YarnChild.java:163）

下面是回复说，这是内存的问题，但我的输入文件只是3MB。http://grokbase.com/t/gg/rhadoop/13924fs4as/972-getting-error-pipema$p$pd-waitoutputthreads-while-running-ma$p$pduce-program-for-40mb-of-sizedataset

我不知道我的问题的想法。我搜索很多东西，但仍然无法正常工作。

我的环境是：

CentOS6 Python2.7 Cloudera的CDH5

我将AP preciate这种情况你的帮助。

上的EDIT 2016年6月24日

首先，我检查错误日志，发现问题的值过多解压的。（还多亏@kynan答案）

只是举个例子，为什么它的发生。

 ＆LT;字体颜色=＃0000FF＆GT;  SomeText1  ＆LT;字体颜色=＃0000FF＆GT;    SomeText2  ＆LT; / FONT＆GT;＆LT; / FONT＆GT;

如果中的内容的一部分的是像上面的，我叫soup.find（字体，颜色=＃0000FF），并分配到的输出的。这将导致两个的字体的被分配给一个的输出的，所以这就是为什么错误的值过多解压的

解决方案

只要改变输出= soup.find（）到（VAR1，VAR2，...）= soup.find_all（字体，颜色=＃0000FF，限制= AmountOfVar），做工精良：）

解决方案

这个错误通常意味着映射过程中死亡。为了找出原因检查 $ HADOOP_ preFIX /日志/ userlogs 用户登录：每个工作，每个容器一个目录内的一个目录。在每个容器目录为标准错误包含发送到stderr即错误信息输出的文件。

Recently, I want to parse websites and then use BeautifulSoup to filter what I want and write in csv file in hdfs.

Now, I am at the process of filtering website code with BeautifulSoup.

I want to use mapreduce method to execute it:

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.0.2.jar 
-mapper /pytemp/filter.py 
-input /user/root/py/input/ 
-output /user/root/py/output40/

input file is like kvs(PER LINE): (key, value) = (url, content)

content, I mean:

<html><head><title>...</title></head><body>...</body></html>

filter.py file:

#!/usr/bin/env python
#!/usr/bin/python
#coding:utf-8
from bs4 import BeautifulSoup
import sys

for line in sys.stdin:
    line = line.strip()
    key, content = line.split(",")

    #if the following two lines do not exist, the program will execute successfully
    soup = BeautifulSoup(content)
    output = soup.find()         

    print("Start-----------------")
    print("End------------------")

BTW, I think I do not need reduce.py to do my work.

However, I got error message:

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Here is a reply said it is memory issue but my input file just 3MB. http://grokbase.com/t/gg/rhadoop/13924fs4as/972-getting-error-pipemapred-waitoutputthreads-while-running-mapreduce-program-for-40mb-of-sizedataset

I have no idea about my problem. I search lots of things for it but still does not work.

My environment is:

CentOS6 Python2.7 Cloudera CDH5

I will appreciate your help with this situation.

EDIT on 2016/06/24

First of all, I checked error log and found the problem is too many values to unpack. (also thanks to @kynan answer)

Just give an example why it happened

<font color="#0000FF">
  SomeText1
  <font color="#0000FF">
    SomeText2
  </font>
</font>

If part of content is like above, and I call soup.find("font", color="#0000FF") and assign to output. It will cause two font to be assigned to one output, so that is why the error too many values to unpack

Solution

Just change output = soup.find() to (Var1, Var2, ...) = soup.find_all("font", color="#0000FF", limit=AmountOfVar) and work well :)

解决方案

This error usually means that the mapper process died. To find out why check the user logs in $HADOOP_PREFIX/logs/userlogs: there is one directory per job and inside one directory per container. In each container directory is a file stderr containing the output sent to stderr i.e. error messages.

693

上一篇：在UTF-8字符编码问题

下一篇：使用SoupStrainer选择性解析