简单的网络爬虫

人气：536 发布：2022-09-16 标签： python-2.7 beautifulsoup

问题描述

我在Python写了下面的程序非常简单的网络爬虫，但是当我运行它，它恢复了我'NoneType'对象不是可调用的'，你能帮帮我吗？

 进口BeautifulSoup进口的urllib2高清接头（P，Q）：    为电子商务在问：        如果E不在号码：            p.append（五）高清爬虫（SeedUrl）：    tocrawl = [SeedUrl]    爬= []    而tocrawl：        页= tocrawl.pop（）        pagesource = urllib2.urlopen（页）        S = pagesource.read（）        汤= BeautifulSoup.BeautifulSoup（S）        链接=汤（'A'）        如果页面不抓取：            工会（tocrawl，链接）            crawled.append（页）    返回爬网履带（'http://www.princeton.edu/main/'）

解决方案

汤（'A'）返回完整的HTML标记。

 ＆LT; A HREF =http://itunes.apple.com/us/store＆GT;购买音乐现在和LT; / A＆GT;

这样的的urlopen 提供错误 'NoneType'对象不是可调用的。你需要解压的唯一的URL / HREF。

 链接= soup.findAll（'A'中，href = TRUE）对于L中的链接：    打印（L [HREF']）

您需要的URL too.refer验证到以下anwsers

How你验证与Python中的常规前pression网址？

Python - 如何验证Python中的网址是什么？（格式不正确或不）

我再次想建议你使用Python套代替Arrays.you可以轻松地添加，ommit重复的URL。

http://docs.python.org/2/library/sets.html

请尝试以下code：

 进口重给出import httplib进口的urllib2从进口里urlparse里urlparse进口BeautifulSoup正则表达式= re.compile（        R'^（？：HTTP | FTP）S：//'＃http：//或https：//开头        R'（：（：[A-Z0-9]（[A-Z0-9  - ] {0,61} [A-Z0-9]）\\）+（？：？？？？[AZ] {2,6} \\ | [A-Z0-9  - ] {2} \\）|。？。？#domain ...        r'localhost | #localhost ...        R'\\ D {1,3} \\。\\ D {1,3} \\。\\ D {1,3} \\。\\ D {1,3}）'＃...或IP        R'（？:: \\ D +）？ ＃可选端口        R'（？：？| [/？] / \\ S +）$'，re.IGNORECASE）高清isValidUrl（URL）：    如果regex.match（URL）不无：        返回True;    返回False高清爬虫（SeedUrl）：    tocrawl = [SeedUrl]    爬= []    而tocrawl：        页= tocrawl.pop（）        打印抓取：'+页        pagesource = urllib2.urlopen（页）        S = pagesource.read（）        汤= BeautifulSoup.BeautifulSoup（S）        链接= soup.findAll（'A'中，href = TRUE）        如果页面不抓取：            对于L中的链接：                如果isValidUrl（L ['的href']）：                    tocrawl.append（L ['的href']）            crawled.append（页）    返回爬网履带（'http://www.princeton.edu/main/'）

i wrote below program in python for very simple web crawler, but when i run it it return me 'NoneType' object is not callable' , could you please help me?

import BeautifulSoup
import urllib2
def union(p,q):
    for e in q:
        if e not in p:
            p.append(e)

def crawler(SeedUrl):
    tocrawl=[SeedUrl]
    crawled=[]
    while tocrawl:
        page=tocrawl.pop()
        pagesource=urllib2.urlopen(page)
        s=pagesource.read()
        soup=BeautifulSoup.BeautifulSoup(s)
        links=soup('a')        
        if page not in crawled:
            union(tocrawl,links)
            crawled.append(page)

    return crawled
crawler('http://www.princeton.edu/main/')

解决方案

soup('a') returns the complete html tag.

<a href="http://itunes.apple.com/us/store">Buy Music Now</a>

so the urlopen gives the error 'NoneType' object is not callable'. you need extract the only the url/href.

links=soup.findAll('a',href=True)
for l in links:
    print(l['href'])

You need to validate the url too.refer to following anwsers

How do you validate a URL with a regular expression in Python?

Python - How to validate a url in python ? (Malformed or not)

Again i would like to suggest you to use python sets instead Arrays.you can easily add,ommit duplicate urls.

http://docs.python.org/2/library/sets.html

Try the following code:

import re
import httplib
import urllib2
from urlparse import urlparse
import BeautifulSoup

regex = re.compile(
        r'^(?:http|ftp)s?://' # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
        r'localhost|' #localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
        r'(?::\d+)?' # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)

def isValidUrl(url):
    if regex.match(url) is not None:
        return True;
    return False

def crawler(SeedUrl):
    tocrawl=[SeedUrl]
    crawled=[]
    while tocrawl:
        page=tocrawl.pop()
        print 'Crawled:'+page
        pagesource=urllib2.urlopen(page)
        s=pagesource.read()
        soup=BeautifulSoup.BeautifulSoup(s)
        links=soup.findAll('a',href=True)        
        if page not in crawled:
            for l in links:
                if isValidUrl(l['href']):
                    tocrawl.append(l['href'])
            crawled.append(page)   
    return crawled
crawler('http://www.princeton.edu/main/')

120

上一篇：网页图片刮痧 - 处理CSS和透明度

下一篇：BeautifulSoup：刮痧有源$ C $ C组相...