多线程在Python / BeautifulSoup刮不加快在所有

人气:953 发布:2022-09-16 标签: multithreading web-scraping parallel-processing python-2.7 beautifulsoup

问题描述

我有一个CSV文件,该文件中列出的所有我需要刮链接(SomeSiteValidURLs.csv)。在code是工作,将通过网址在CSV,刮去信息和记录/保存在另一个csv文件(Output.csv)。然而,因为我打算做的站点(> 10,000,000页)一大截,速度是很重要的。对于每一个环节,它需要大约1秒抓取并保存信息为CSV,这是该项目的规模过于缓慢。所以,我已经把多线程模块,并让我吃惊不加速的话,那仍然需要1秒的人联系。我做错什么了吗?是否有其他的方式来加快处理速度?

无多线程:

 进口的urllib2导入CSV从BS4进口BeautifulSoup进口螺纹高清crawlToCSV(文件名):    开放(文件名,RB)为f:        对于F中的URLrecords:            OpenSomeSiteURL = urllib2.urlopen(URLrecords)            Soup_SomeSite = BeautifulSoup(OpenSomeSiteURLLXML)            OpenSomeSiteURL.close()            tbodyTags = Soup_SomeSite.find(TBODY)            trTags = tbodyTags.find_all(TR,类_ =成果项目)            占位符= []            对于trTag在trTags:                tdTags = trTag.find(TD,类_ =结果价值)                tdTags_string = tdTags.string                placeHolder.append(tdTags_string)            开放(Output.csv,AB)为f:                WRITEFILE = csv.writer(F)                writeFile.writerow(占位符)crawltoCSV(SomeSiteValidURLs.csv) 

使用多线程:

 进口的urllib2导入CSV从BS4进口BeautifulSoup进口螺纹高清crawlToCSV(文件名):    开放(文件名,RB)为f:        对于F中的URLrecords:            OpenSomeSiteURL = urllib2.urlopen(URLrecords)            Soup_SomeSite = BeautifulSoup(OpenSomeSiteURLLXML)            OpenSomeSiteURL.close()            tbodyTags = Soup_SomeSite.find(TBODY)            trTags = tbodyTags.find_all(TR,类_ =成果项目)            占位符= []            对于trTag在trTags:                tdTags = trTag.find(TD,类_ =结果价值)                tdTags_string = tdTags.string                placeHolder.append(tdTags_string)            开放(Output.csv,AB)为f:                WRITEFILE = csv.writer(F)                writeFile.writerow(占位符)文件名=SomeSiteValidURLs.csv如果__name__ ==__main__:    T = threading.Thread(目标= crawlToCSV,ARGS =(文件名,))    t.start()    t.join() 

解决方案

你无法正常并行这一点。你真正想要做的是有工作你正在为循环同时发生在许多工人内部完成。现在你移动的所有的工作到一个后台线程,同步做这件事。这不会改善所有性能(它只是稍微伤害它,实际上)。

下面是一个使用线程池并行网络运行和解析的例子。这不是安全的尝试写入一次在多个线程csv文件,所以不是我们返回将被写回父数据,并有家长写的所有结果在最后的文件。

 进口的urllib2导入CSV从BS4进口BeautifulSoup从multiprocessing.dummy导入池#这是一个基于线程池从多进口CPU_COUNT高清crawlToCSV(URLrecord):    OpenSomeSiteURL = urllib2.urlopen(URLrecord)    Soup_SomeSite = BeautifulSoup(OpenSomeSiteURLLXML)    OpenSomeSiteURL.close()    tbodyTags = Soup_SomeSite.find(TBODY)    trTags = tbodyTags.find_all(TR,类_ =成果项目)    占位符= []    对于trTag在trTags:        tdTags = trTag.find(TD,类_ =结果价值)        tdTags_string = tdTags.string        placeHolder.append(tdTags_string)    返回占位符如果__name__ ==__main__:    文件名=SomeSiteValidURLs.csv    池=池(CPU_COUNT()* 2)#创建一个具有CPU_COUNT * 2个线程池。    开放(文件名,RB)为f:        结果= pool.map(crawlToCSV,F)#结果是每次调用crawlToCSV返回的所有占位符列表的列表    开放(Output.csv,AB)为f:        WRITEFILE = csv.writer(F)        对于结果的结果:            writeFile.writerow(结果) 

请注意,在Python中,线程实际上只加速I / O操作 - 因为GIL,CPU绑定的操作(如解析/搜索​​ BeautifulSoup 正在做)实际上不能在经由线程并行地完成的,因为只有一个线程可以在一个时间执行基于CPU的操作。所以,你仍然可能看不到速度达到你所期望的这种方法。当你需要加快在Python CPU绑定操作,你需要使用的,而不是线程的多个进程。幸运的是,你可以很容易地看到这个剧本有多个进程,而不是多线程表现如何;刚刚从多处理导入池到修改从multiprocessing.dummy导入池。没有其他的变化是必需的。

编辑:

如果您需要扩展这件事与10,000,000行的文件,你将需要调整这个codeA位 - Pool.map 转换迭代你之前将其发送给您的员工,这显然是不会有10,000,000参赛名单中很好地传递到它的列表;具有内存整个事情很可能会拖垮你的系统。同样的问题与存储在一个列表中的所有结果。相反,你应该使用 Pool.imap

  

IMAP(FUNC,迭代[,块大小])

    

图的懒惰()版本。

    

的块大小的参数是相同的由地图所使用的()  方法。对于使用大值块大小很长iterables能  使工作完成的速度远远超过使用默认值1。

 如果__name__ ==__main__:    文件名=SomeSiteValidURLs.csv    FILE_LINES = 10,000,000    NUM_WORKERS = CPU_COUNT()* 2    块大小= FILE_LINES // NUM_WORKERS * 4#尝试获得了良好的块大小。你可能会不得不调整这一点,虽然。尝试更小,更低的值,看看性能如何变化。    池=池(NUM_WORKERS)    开放(文件名,RB)为f:        result_iter = pool.imap(crawlToCSV,F)    开放(Output.csv,AB)为f:        WRITEFILE = csv.writer(F)        对于结果result_iter:#懒洋洋地结果迭代。            writeFile.writerow(结果) 

使用 IMAP ,我们从来没有把所有的 F的到内存中一次,我们也不存储所有的结果在存储器一次。我们曾经有在内存中最多的就是块大小线路f ,这应该是更易于管理。

I have a csv file ("SomeSiteValidURLs.csv") which listed all the links I need to scrape. The code is working and will go through the urls in the csv, scrape the information and record/save in another csv file ("Output.csv"). However, since I am planning to do it for a large portion of the site (for >10,000,000 pages), speed is important. For each link, it takes about 1s to crawl and save the info into the csv, which is too slow for the magnitude of the project. So I have incorporated the multithreading module and to my surprise it doesn't speed up at all, it still takes 1s person link. Did I do something wrong? Is there other way to speed up the processing speed?

Without multithreading:

import urllib2
import csv
from bs4 import BeautifulSoup
import threading

def crawlToCSV(FileName):

    with open(FileName, "rb") as f:
        for URLrecords in f:

            OpenSomeSiteURL = urllib2.urlopen(URLrecords)
            Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
            OpenSomeSiteURL.close()

            tbodyTags = Soup_SomeSite.find("tbody")
            trTags = tbodyTags.find_all("tr", class_="result-item ")

            placeHolder = []

            for trTag in trTags:
                tdTags = trTag.find("td", class_="result-value")
                tdTags_string = tdTags.string
                placeHolder.append(tdTags_string)

            with open("Output.csv", "ab") as f:
                writeFile = csv.writer(f)
                writeFile.writerow(placeHolder)

crawltoCSV("SomeSiteValidURLs.csv")

With multithreading:

import urllib2
import csv
from bs4 import BeautifulSoup
import threading

def crawlToCSV(FileName):

    with open(FileName, "rb") as f:
        for URLrecords in f:

            OpenSomeSiteURL = urllib2.urlopen(URLrecords)
            Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
            OpenSomeSiteURL.close()

            tbodyTags = Soup_SomeSite.find("tbody")
            trTags = tbodyTags.find_all("tr", class_="result-item ")

            placeHolder = []

            for trTag in trTags:
                tdTags = trTag.find("td", class_="result-value")
                tdTags_string = tdTags.string
                placeHolder.append(tdTags_string)

            with open("Output.csv", "ab") as f:
                writeFile = csv.writer(f)
                writeFile.writerow(placeHolder)

fileName = "SomeSiteValidURLs.csv"

if __name__ == "__main__":
    t = threading.Thread(target=crawlToCSV, args=(fileName, ))
    t.start()
    t.join()

解决方案

You're not parallelizing this properly. What you actually want to do is have the work being done inside your for loop happen concurrently across many workers. Right now you're moving all the work into one background thread, which does the whole thing synchronously. That's not going to improve performance at all (it will just slightly hurt it, actually).

Here's an example that uses a ThreadPool to parallelize the network operation and parsing. It's not safe to try to write to the csv file across many threads at once, so instead we return the data that would have been written back to the parent, and have the parent write all the results to the file at the end.

import urllib2
import csv
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool  # This is a thread-based Pool
from multiprocessing import cpu_count

def crawlToCSV(URLrecord):
    OpenSomeSiteURL = urllib2.urlopen(URLrecord)
    Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
    OpenSomeSiteURL.close()

    tbodyTags = Soup_SomeSite.find("tbody")
    trTags = tbodyTags.find_all("tr", class_="result-item ")

    placeHolder = []

    for trTag in trTags:
        tdTags = trTag.find("td", class_="result-value")
        tdTags_string = tdTags.string
        placeHolder.append(tdTags_string)

    return placeHolder


if __name__ == "__main__":
    fileName = "SomeSiteValidURLs.csv"
    pool = Pool(cpu_count() * 2)  # Creates a Pool with cpu_count * 2 threads.
    with open(FileName, "rb") as f:
        results = pool.map(crawlToCSV, f)  # results is a list of all the placeHolder lists returned from each call to crawlToCSV
    with open("Output.csv", "ab") as f:
        writeFile = csv.writer(f)
        for result in results:
            writeFile.writerow(result)

Note that in Python, threads only actually speed up I/O operations - because of the GIL, CPU-bound operations (like the parsing/searching BeautifulSoup is doing) can't actually be done in parallel via threads, because only one thread can do CPU-based operations at a time. So you still may not see the speed up you were hoping for with this approach. When you need to speed up CPU-bound operations in Python, you need to use multiple processes instead of threads. Luckily, you can easily see how this script performs with multiple processes instead of multiple threads; just change from multiprocessing.dummy import Pool to from multiprocessing import Pool. No other changes are required.

Edit:

If you need to scale this up to a file with 10,000,000 lines, you're going to need to adjust this code a bit - Pool.map converts the iterable you pass into it to a list prior to sending it off to your workers, which obviously isn't going to work very well with a 10,000,000 entry list; having that whole thing in memory is probably going to bog down your system. Same issue with storing all the results in a list. Instead, you should use Pool.imap:

imap(func, iterable[, chunksize])

A lazier version of map().

The chunksize argument is the same as the one used by the map() method. For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.

if __name__ == "__main__":
    fileName = "SomeSiteValidURLs.csv"
    FILE_LINES = 10,000,000
    NUM_WORKERS = cpu_count() * 2
    chunksize = FILE_LINES // NUM_WORKERS * 4  # Try to get a good chunksize. You're probably going to have to tweak this, though. Try smaller and lower values and see how performance changes.
    pool = Pool(NUM_WORKERS)

    with open(FileName, "rb") as f:
        result_iter = pool.imap(crawlToCSV, f)
    with open("Output.csv", "ab") as f:
        writeFile = csv.writer(f)
        for result in result_iter:  # lazily iterate over results.
            writeFile.writerow(result)

With imap, we never put the all of f into memory at once, nor do we store all the results in memory at once. The most we ever have in memory is chunksize lines of f, which should be more manageable.

717