通过云功能在 Google Cloud Storage 中创建新的 csv 文件

人气:981 发布:2022-10-16 标签: python csv google-cloud-platform google-cloud-storage google-cloud-functions

问题描述

第一次使用 Google Cloud Storage.下面我有一个云功能,只要将 csv 文件上传到我的存储桶内的 my-folder 就会触发该功能.我的目标是在同一文件夹中创建一个新的 csv 文件,读取上传的 csv 的内容并将每一行转换为将进入新创建的 csv 的 URL.问题是我一开始就创建新的 csv 时遇到了麻烦,更不用说实际写入它了.

First time working with Google Cloud Storage. Below I have a cloud function which is triggered whenever a csv file gets uploaded to my-folder inside my bucket. My goal is to create a new csv file in the same folder, read the contents of the uploaded csv and convert each line to a URL that will go into the newly created csv. Problem is I'm having trouble just creating the new csv in the first place, let alone actually writing to it.

我的代码:

import os.path
import csv
import sys
import json
from csv import reader, DictReader, DictWriter
from google.cloud import storage
from io import StringIO

def generate_urls(data, context):
    if context.event_type == 'google.storage.object.finalize':
        storage_client = storage.Client()
        bucket_name = data['bucket']
        bucket = storage_client.get_bucket(bucket_name)
        folder_name = 'my-folder'
        file_name = data['name']

        if not file_name.endswith('.csv'):
            return

接下来的几行来自 GCP 的 GitHub 存储库中的示例.这是我希望创建新 csv 的时候,但没有任何反应.

These next few lines came from an example in GCP's GitHub repo. This is when I would expect the new csv to be created, but nothing happens.

        # Prepend 'URL_' to the uploaded file name for the name of the new csv
        destination = bucket.blob(bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:])
        destination.content_type = 'text/csv'
        sources = [bucket.get_blob(file_name)]
        destination.compose(sources)
        output = bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:]


        # Transform uploaded csv to string - this was recommended on a similar SO post, not sure if this works or is the right approach...
        blob = bucket.blob(file_name)
        blob = blob.download_as_string()
        blob = blob.decode('utf-8')
        blob = StringIO(blob)

        input_csv = csv.reader(blob)

在下一行出现错误:No such file or directory: 'myProjectId/my-folder/URL_my_file.csv'

        with open(output, 'w') as output_csv:
            csv_dict_reader = csv.DictReader(input_csv, )
            csv_writer = csv.DictWriter(output_csv, fieldnames=['URL'], delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
            csv_writer.writeheader()
            line_count = 0
            for row in csv_dict_reader:
                line_count += 1
                url = ''
                ...
                # code that converts each line
                ...
                csv_writer.writerow({'URL': url})
            print(f'Total rows: {line_count}')

如果有人对我如何获得它来创建新的 csv 然后写入它有任何建议,那将是一个巨大的帮助.谢谢!

If anyone has any suggestions on how I could get this to create the new csv and then write to it, it would be a huge help. Thank you!

推荐答案

可能我会说我对代码和解决方案的设计有几个问题:

Probably I would say that I have a few questions about the code and the design of the solution:

据我了解 - 一方面,云功能由 finalise 事件触发 Google Cloud Storage 触发器,而不是他希望将新创建的文件保存到同一个存储桶中.成功后,该存储桶中新对象的出现将触发您的云函数的另一个实例.这是预期的行为吗?你的云功能准备好了吗?

As I understand - on one hand the cloud function is triggered by a finalise event Google Cloud Storage Triggers, not he other hand you would like to save a newly created file into the same bucket. Upon success, an appearance of a new object in that bucket is to trigger another instance of your cloud function. Is that the intended behaviour? You cloud function is ready for that?

在本体上没有文件夹这样的东西.因此在这段代码中:

Ontologically there is no such thing as folder. Thus in this code:

        folder_name = 'my-folder'
        file_name = data['name']

第一行有点多余,除非您想将该变量和值用于其他用途...并且 file_name 获取包含所有前缀的对象名称(您可以将它们视为文件夹".

the first line is a bit redundant, unless you would like to use that variable and value for something else... and the file_name gets the object name including all prefixes (you may consider them as "folders".

您引用的示例 - storage_compose_file.py - 是关于如何将 GCS 中的几个对象组合成一个.我不确定该示例是否与您的案例相关,除非您有一些额外的要求.

The example you refer - storage_compose_file.py - is about how a few objects in the GCS can be composed into one. I am not sure if that example is relevant for your case, unless you have some additional requirements.

现在,让我们看一下这段代码:

Now, let's have a look at this snippet:

        destination = bucket.blob(bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:])
        destination.content_type = 'text/csv'
        sources = [bucket.get_blob(file_name)]
        destination.compose(sources)

一个.bucket.blob - 是工厂构造函数 - 请参阅 API 存储桶说明.我不确定您是否真的想使用 bucket_name 作为其参数的元素...

a. bucket.blob - is a factory constructor - see API buckets description. I am not sure if you really would like to use a bucket_name as an element of its argument...

b.sources - 变成一个只有一个元素的列表 - 对 GCS 存储桶中现有对象的引用.

b. sources - becomes a list with only one element - a reference to the existing object in the GCS bucket.

c.destination.compose(sources) - 是否试图复制现有对象?如果成功 - 它可能会触发您的云函数的另一个实例.

c. destination.compose(sources) - is it an attempt to make a copy of the existing object? If successful - it may trigger another instance of your cloud function.

关于类型更改

        blob = bucket.blob(file_name)
        blob = blob.download_as_string()

在第一行之后,blob 变量的类型为 google.cloud.storage.blob.Blob.在第二个 - bytes 之后.我认为 Python 允许这样的事情......但你真的喜欢它吗?顺便说一句,download_as_string 方法已弃用 - 请参阅 Blobs/Objects API

After the first line the blob variable has the type google.cloud.storage.blob.Blob. After the second - bytes. I think Python allows such things... but would you really like it? BTW, the download_as_string method is deprecated - see Blobs / Objects API

关于输出:

   output = bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:]
    
   with open(output, 'w') as output_csv:

请记住 - 所有这些都发生在云函数的内存中.与 GCS 的 blob 桶无关.如果您想在云函数中使用临时文件 - 您将在 /tmp 目录中使用它们 - 从谷歌云函数写入临时文件我猜你会因为这个问题而收到错误.

Bear in mind - all of that happens inside the memory of the cloud function. Nothing to do with the GCS buckets of blobs. If you would like to use temporary files within cloud functions - you are to use them in the /tmp directory - Write temporary files from Google Cloud Function I would guess that you get the error because of this issue.

=>提出一些建议.

=> Coming to some suggestions.

您可能希望将对象下载到云函数内存中(进入 /tmp 目录).然后您想处理源文件并将结果保存在源附近.然后您想将结果上传到 另一个(不是源)存储桶.如果我的假设是正确的,我会建议逐步实施这些事情,并检查您是否在每一步都获得了预期的结果.

You probably would like to download the object into the cloud function memory (into the /tmp directory). Then you would like to process the source file and save the result near the source. Then you would like to upload the result to another (not the source) bucket. If my assumptions are correct, I would suggest to implement those things step by step, and check that you get the desired result on each step.

112