从巨大的CSV文件中读取随机行

人气：766 发布：2022-10-16 标签： file python csv random

问题描述

我有一个非常大的CSV文件(15 GB)，我需要从其中读取大约100万行随机行。就我所见和实现而言，Python中的csv实用程序只允许在文件中按顺序迭代。

将所有文件读取到内存中以使用某些随机选择非常耗费内存，并且遍历所有文件并丢弃某些值并选择其他值非常耗时，因此是否可以从CSV文件中选择一些随机行并只读此行？

我尝试了，但没有成功：

import csv

with open('linear_e_LAN2A_F_0_435keV.csv') as file:
    reader = csv.reader(file)
    print reader[someRandomInteger]

CSV文件示例：

331.093,329.735
251.188,249.994
374.468,373.782
295.643,295.159
83.9058,0
380.709,116.221
352.238,351.891
183.809,182.615
257.277,201.302
61.4598,40.7106

推荐答案

import random

filesize = 1500                 #size of the really big file
offset = random.randrange(filesize)

f = open('really_big_file')
f.seek(offset)                  #go to random position
f.readline()                    # discard - bound to be partial line
random_line = f.readline()      # bingo!

# extra to handle last/first line edge cases
if len(random_line) == 0:       # we have hit the end
    f.seek(0)
    random_line = f.readline()  # so we'll grab the first line instead

正如@AndreBoos指出的那样，这种方法将导致有偏见的选择。如果您知道行的最小长度和最大长度，则可以通过执行以下操作来消除此偏差：

假设(在本例中)我们有min=3和max=15

1)查找上一行的长度(Lp)。

则如果Lp=3，则该行最有偏向。因此，我们应该100%地使用它。如果Lp=15，则该线最偏向。我们应该只接受20%的时间，因为它被选中的可能性要高出5%。

我们通过在以下位置随机保留行X%来完成此操作：

X=min/lp

如果我们不遵守规则，我们将进行另一次随机选择，直到掷骰子结果良好。：-)

973

上一篇：FiRestore子集合与阵列

下一篇：在带有.NET标准库的Xamarin.Android上，...