多线程在Python / BeautifulSoup刮不加快在所有

人气：953 发布：2022-09-16 标签： multithreading web-scraping parallel-processing python-2.7 beautifulsoup

问题描述

我有一个CSV文件，该文件中列出的所有我需要刮链接（SomeSiteValidURLs.csv）。在code是工作，将通过网址在CSV，刮去信息和记录/保存在另一个csv文件（Output.csv）。然而，因为我打算做的站点（> 10,000,000页）一大截，速度是很重要的。对于每一个环节，它需要大约1秒抓取并保存信息为CSV，这是该项目的规模过于缓慢。所以，我已经把多线程模块，并让我吃惊不加速的话，那仍然需要1秒的人联系。我做错什么了吗？是否有其他的方式来加快处理速度？

无多线程：

 进口的urllib2导入CSV从BS4进口BeautifulSoup进口螺纹高清crawlToCSV（文件名）：    开放（文件名，RB）为f：        对于F中的URLrecords：            OpenSomeSiteURL = urllib2.urlopen（URLrecords）            Soup_SomeSite = BeautifulSoup（OpenSomeSiteURLLXML）            OpenSomeSiteURL.close（）            tbodyTags = Soup_SomeSite.find（TBODY）            trTags = tbodyTags.find_all（TR，类_ =成果项目）            占位符= []            对于trTag在trTags：                tdTags = trTag.find（TD，类_ =结果价值）                tdTags_string = tdTags.string                placeHolder.append（tdTags_string）            开放（Output.csv，AB）为f：                WRITEFILE = csv.writer（F）                writeFile.writerow（占位符）crawltoCSV（SomeSiteValidURLs.csv）

使用多线程：

 进口的urllib2导入CSV从BS4进口BeautifulSoup进口螺纹高清crawlToCSV（文件名）：    开放（文件名，RB）为f：        对于F中的URLrecords：            OpenSomeSiteURL = urllib2.urlopen（URLrecords）            Soup_SomeSite = BeautifulSoup（OpenSomeSiteURLLXML）            OpenSomeSiteURL.close（）            tbodyTags = Soup_SomeSite.find（TBODY）            trTags = tbodyTags.find_all（TR，类_ =成果项目）            占位符= []            对于trTag在trTags：                tdTags = trTag.find（TD，类_ =结果价值）                tdTags_string = tdTags.string                placeHolder.append（tdTags_string）            开放（Output.csv，AB）为f：                WRITEFILE = csv.writer（F）                writeFile.writerow（占位符）文件名=SomeSiteValidURLs.csv如果__name__ ==__main__：    T = threading.Thread（目标= crawlToCSV，ARGS =（文件名，））    t.start（）    t.join（）

解决方案

你无法正常并行这一点。你真正想要做的是有工作你正在为循环同时发生在许多工人内部完成。现在你移动的所有的工作到一个后台线程，同步做这件事。这不会改善所有性能（它只是稍微伤害它，实际上）。

下面是一个使用线程池并行网络运行和解析的例子。这不是安全的尝试写入一次在多个线程csv文件，所以不是我们返回将被写回父数据，并有家长写的所有结果在最后的文件。

 进口的urllib2导入CSV从BS4进口BeautifulSoup从multiprocessing.dummy导入池＃这是一个基于线程池从多进口CPU_COUNT高清crawlToCSV（URLrecord）：    OpenSomeSiteURL = urllib2.urlopen（URLrecord）    Soup_SomeSite = BeautifulSoup（OpenSomeSiteURLLXML）    OpenSomeSiteURL.close（）    tbodyTags = Soup_SomeSite.find（TBODY）    trTags = tbodyTags.find_all（TR，类_ =成果项目）    占位符= []    对于trTag在trTags：        tdTags = trTag.find（TD，类_ =结果价值）        tdTags_string = tdTags.string        placeHolder.append（tdTags_string）    返回占位符如果__name__ ==__main__：    文件名=SomeSiteValidURLs.csv    池=池（CPU_COUNT（）* 2）＃创建一个具有CPU_COUNT * 2个线程池。    开放（文件名，RB）为f：        结果= pool.map（crawlToCSV，F）＃结果是每次调用crawlToCSV返回的所有占位符列表的列表    开放（Output.csv，AB）为f：        WRITEFILE = csv.writer（F）        对于结果的结果：            writeFile.writerow（结果）

请注意，在Python中，线程实际上只加速I / O操作 - 因为GIL，CPU绑定的操作（如解析/搜索 BeautifulSoup 正在做）实际上不能在经由线程并行地完成的，因为只有一个线程可以在一个时间执行基于CPU的操作。所以，你仍然可能看不到速度达到你所期望的这种方法。当你需要加快在Python CPU绑定操作，你需要使用的，而不是线程的多个进程。幸运的是，你可以很容易地看到这个剧本有多个进程，而不是多线程表现如何;刚刚从多处理导入池到修改从multiprocessing.dummy导入池。没有其他的变化是必需的。

编辑：

如果您需要扩展这件事与10,000,000行的文件，你将需要调整这个codeA位 - Pool.map 转换迭代你之前将其发送给您的员工，这显然是不会有10,000,000参赛名单中很好地传递到它的列表;具有内存整个事情很可能会拖垮你的系统。同样的问题与存储在一个列表中的所有结果。相反，你应该使用 Pool.imap ：

    IMAP（FUNC，迭代[，块大小]） 
    图的懒惰（）版本。
    的块大小的参数是相同的由地图所使用的（）  方法。对于使用大值块大小很长iterables能  使工作完成的速度远远超过使用默认值1。
 如果__name__ ==__main__：    文件名=SomeSiteValidURLs.csv    FILE_LINES = 10,000,000    NUM_WORKERS = CPU_COUNT（）* 2    块大小= FILE_LINES // NUM_WORKERS * 4＃尝试获得了良好的块大小。你可能会不得不调整这一点，虽然。尝试更小，更低的值，看看性能如何变化。    池=池（NUM_WORKERS）    开放（文件名，RB）为f：        result_iter = pool.imap（crawlToCSV，F）    开放（Output.csv，AB）为f：        WRITEFILE = csv.writer（F）        对于结果result_iter：＃懒洋洋地结果迭代。            writeFile.writerow（结果） 
使用 IMAP ，我们从来没有把所有的 F的到内存中一次，我们也不存储所有的结果在存储器一次。我们曾经有在内存中最多的就是的块大小线路f ，这应该是更易于管理。
I have a csv file ("SomeSiteValidURLs.csv") which listed all the links I need to scrape. The code is working and will go through the urls in the csv, scrape the information and record/save in another csv file ("Output.csv"). However, since I am planning to do it for a large portion of the site (for >10,000,000 pages), speed is important. For each link, it takes about 1s to crawl and save the info into the csv, which is too slow for the magnitude of the project. So I have incorporated the multithreading module and to my surprise it doesn't speed up at all, it still takes 1s person link. Did I do something wrong? Is there other way to speed up the processing speed?

Without multithreading:
import urllib2
import csv
from bs4 import BeautifulSoup
import threading

def crawlToCSV(FileName):

    with open(FileName, "rb") as f:
        for URLrecords in f:

            OpenSomeSiteURL = urllib2.urlopen(URLrecords)
            Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
            OpenSomeSiteURL.close()

            tbodyTags = Soup_SomeSite.find("tbody")
            trTags = tbodyTags.find_all("tr", class_="result-item ")

            placeHolder = []

            for trTag in trTags:
                tdTags = trTag.find("td", class_="result-value")
                tdTags_string = tdTags.string
                placeHolder.append(tdTags_string)

            with open("Output.csv", "ab") as f:
                writeFile = csv.writer(f)
                writeFile.writerow(placeHolder)

crawltoCSV("SomeSiteValidURLs.csv")
With multithreading:
import urllib2
import csv
from bs4 import BeautifulSoup
import threading

def crawlToCSV(FileName):

    with open(FileName, "rb") as f:
        for URLrecords in f:

            OpenSomeSiteURL = urllib2.urlopen(URLrecords)
            Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
            OpenSomeSiteURL.close()

            tbodyTags = Soup_SomeSite.find("tbody")
            trTags = tbodyTags.find_all("tr", class_="result-item ")

            placeHolder = []

            for trTag in trTags:
                tdTags = trTag.find("td", class_="result-value")
                tdTags_string = tdTags.string
                placeHolder.append(tdTags_string)

            with open("Output.csv", "ab") as f:
                writeFile = csv.writer(f)
                writeFile.writerow(placeHolder)

fileName = "SomeSiteValidURLs.csv"

if __name__ == "__main__":
    t = threading.Thread(target=crawlToCSV, args=(fileName, ))
    t.start()
    t.join()

 解决方案 You're not parallelizing this properly. What you actually want to do is have the work being done inside your for loop happen concurrently across many workers. Right now you're moving all the work into one background thread, which does the whole thing synchronously. That's not going to improve performance at all (it will just slightly hurt it, actually).

Here's an example that uses a ThreadPool to parallelize the network operation and parsing. It's not safe to try to write to the csv file across many threads at once, so instead we return the data that would have been written back to the parent, and have the parent write all the results to the file at the end.
import urllib2
import csv
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool  # This is a thread-based Pool
from multiprocessing import cpu_count

def crawlToCSV(URLrecord):
    OpenSomeSiteURL = urllib2.urlopen(URLrecord)
    Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
    OpenSomeSiteURL.close()

    tbodyTags = Soup_SomeSite.find("tbody")
    trTags = tbodyTags.find_all("tr", class_="result-item ")

    placeHolder = []

    for trTag in trTags:
        tdTags = trTag.find("td", class_="result-value")
        tdTags_string = tdTags.string
        placeHolder.append(tdTags_string)

    return placeHolder


if __name__ == "__main__":
    fileName = "SomeSiteValidURLs.csv"
    pool = Pool(cpu_count() * 2)  # Creates a Pool with cpu_count * 2 threads.
    with open(FileName, "rb") as f:
        results = pool.map(crawlToCSV, f)  # results is a list of all the placeHolder lists returned from each call to crawlToCSV
    with open("Output.csv", "ab") as f:
        writeFile = csv.writer(f)
        for result in results:
            writeFile.writerow(result)
Note that in Python, threads only actually speed up I/O operations - because of the GIL, CPU-bound operations (like the parsing/searching BeautifulSoup is doing) can't actually be done in parallel via threads, because only one thread can do CPU-based operations at a time. So you still may not see the speed up you were hoping for with this approach. When you need to speed up CPU-bound operations in Python, you need to use multiple processes instead of threads. Luckily, you can easily see how this script performs with multiple processes instead of multiple threads; just change from multiprocessing.dummy import Pool to from multiprocessing import Pool. No other changes are required.

Edit:

If you need to scale this up to a file with 10,000,000 lines, you're going to need to adjust this code a bit - Pool.map converts the iterable you pass into it to a list prior to sending it off to your workers, which obviously isn't going to work very well with a 10,000,000 entry list; having that whole thing in memory is probably going to bog down your system. Same issue with storing all the results in a list. Instead, you should use Pool.imap:
  imap(func, iterable[, chunksize])
  
  A lazier version of map().
  
  The chunksize argument is the same as the one used by the map()
  method. For very long iterables using a large value for chunksize can
  make the job complete much faster than using the default value of 1.


if __name__ == "__main__":
    fileName = "SomeSiteValidURLs.csv"
    FILE_LINES = 10,000,000
    NUM_WORKERS = cpu_count() * 2
    chunksize = FILE_LINES // NUM_WORKERS * 4  # Try to get a good chunksize. You're probably going to have to tweak this, though. Try smaller and lower values and see how performance changes.
    pool = Pool(NUM_WORKERS)

    with open(FileName, "rb") as f:
        result_iter = pool.imap(crawlToCSV, f)
    with open("Output.csv", "ab") as f:
        writeFile = csv.writer(f)
        for result in result_iter:  # lazily iterate over results.
            writeFile.writerow(result)
With imap, we never put the all of f into memory at once, nor do we store all the results in memory at once. The most we ever have in memory is chunksize lines of f, which should be more manageable.


        
          上一篇：使用BeautifulSoup标签之间的文本提取
          下一篇：有没有更好的办法，我的Python的网络爬...


      
        
          最近更新
        
        
                    
            为什么没有为UNCalendarNotificationTrigger触发本地通知
            2023-01-03           
                    
            SwiftUI-为什么我的TabBar忽略了init()中的设置颜色方法？
            2023-01-03           
                    
            IOS画外音功能随捆绑包标识符而变化
            2023-01-03           
                    
            选项卡栏中间的选项卡角外
            2023-01-03           
                    
            将UIView控制器推送到UITabBar上方
            2023-01-03           
                    
            Apple App Loader/iTunes Connect问题(Apple ID无权访问iTunes Connect&)
            2023-01-03           
                    
            将Formik与打字稿(离子)配合使用
            2023-01-03           
                    
            在Formik中设置单选按钮组的初始值
            2023-01-03           
                    
            如何在Formick中使用REACTION数字格式
            2023-01-03           
                    
            使用&lt；Field数组/&gt；中的自定义组件，通过表单&lt；字段/&gt；设置&lt；Textfield/&gt；的值。
            2023-01-03           
                    
            Redux在Reaction中添加了另一个Aray内的对象数组
            2023-01-03           
                    
            两种情况在什么时候在YUP中反应
            2023-01-03           
                    
            当Formik表单更改时更新另一个组件
            2023-01-03           
                    
            Formik验证正在提交/isValiating未设置为True
            2023-01-03           
                    
            基于另一个字段值的必填字段-Formik，Yup
            2023-01-03           
                    
            如何使用Formik调用onChange中的两个函数
            2023-01-03           
                    
            YUP：验证可以为空的字符串数组
            2023-01-03           
                    
            如何防止Enter键触发提交
            2023-01-03           
                    
            使用Formik的Reaction验证最大范围
            2023-01-03           
                    
            材质用户界面切换按钮-选中时不能更改背景颜色
            2023-01-03           
                    
            使用Formik和YUP的Reaction-Date Picker：未在第一个模糊时验证日期值，而不是.Required()
            2023-01-03           
                    
            YUP/Formik带去反跳的异步验证
            2023-01-03           
                    
            对多个值进行YUP验证
            2023-01-03           
                    
            使用Formik、Yup和Reaction进行异步验证
            2023-01-03           
                    
            使用YUP检查字符串或数字长度的验证
            2023-01-03           
                    
            如何在Formik中禁用提交时的自动重置表单？
            2023-01-03           
                    
            更新Formik表单上的初始值属性不会更新输入值
            2023-01-03           
                    
            如何在YUP异步验证中设置动态错误消息？
            2023-01-03           
                    
            Formik+Yup：如何在提交前立即验证表单？
            2023-01-03           
                    
            有两个相关字段的YUP验证
            2023-01-03


    
      
        
          
                    
            c#
          
                    
            java
          
                    
            javascript
          
                    
            python
          
                    
            android
          
                    
            php
          
                    
            c++
          
                    
            asp.net
          
                    
            jquery
          
                    
            html
          
                    
            ios
          
                    
            .net
          
                    
            c
          
                    
            css
          
                    
            sql
          
                    
            mysql
          
                    
            r
          
                    
            objective-c
          
                    
            iphone
          
                    
            arrays
          
                    
            sql-server
          
                    
            node.js
          
                    
            vb.net
          
                    
            angularjs
          
                    
            json
          
                    
            swift
          
                    
            wpf
          
                    
            windows
          
                    
            ruby-on-rails
          
                    
            angular
          
                    
            linux
          
                    
            ASP.NET
          
                    
            django
          
                    
            xml
          
                    
            ajax
          
                    
            vb
          
                    
            asp.net-mvc
          
                    
            excel
          
                    
            git
          
                    
            string
          
                    
            spring
          
                    
            ruby
          
                    
            visual-studio
          
                    
            xcode
          
                    
            regex
          
                    
            database
          
                    
            eclipse
          
                    
            reactjs
          
                    
            multithreading
          
                    
            pandas