PYTHON基础技能 – Python中提升文件操作速度的7个秘诀

释放双眼，带上耳机，听听看~！

文章目录

Toggle

引言

在Python编程中，高效且安全地处理文件是一项重要技能。本文将探讨几种优化文件处理的方法，包括使用

with

语句、批量处理文件、设置缓冲区、使用二进制模式、利用多线程或多进程加速处理以及使用特定模块如

pickle

和

csv

等。下面逐一介绍这些方法及其应用场景。

1. 使用
1
with

语句安全地处理文件

在Python中，使用

with

语句打开文件是一种最佳实践。它能自动管理文件的打开和关闭，即使在文件操作过程中出现异常也能保证文件被正确关闭。

代码示例：


1
<em>#&nbsp;使用with语句安全地打开并读取文件</em><br>filename&nbsp;=&nbsp;'example.txt'<br><br>with&nbsp;open(filename,&nbsp;mode='r',&nbsp;encoding='utf-8')&nbsp;as&nbsp;file:<br>&nbsp;&nbsp;&nbsp;&nbsp;content&nbsp;=&nbsp;file.read()<br><br>print(content)

解释：

1
open()

函数用于打开文件。
1
'r'

表示以只读模式打开文件。
1
encoding='utf-8'

指定文件编码为UTF-8。
1
with

语句确保文件在使用完毕后自动关闭。

2. 批量处理文件

当需要处理大量文件时，可以将文件分批处理，避免一次性加载过多数据导致内存不足或处理时间过长。

代码示例：


1
import&nbsp;os<br><br>directory&nbsp;=&nbsp;'path/to/directory'<br>batch_size&nbsp;=&nbsp;1000&nbsp;&nbsp;<em>#&nbsp;每批处理的文件数量</em><br><br>files&nbsp;=&nbsp;os.listdir(directory)<br><br>for&nbsp;i&nbsp;in&nbsp;range(0,&nbsp;len(files),&nbsp;batch_size):<br>&nbsp;&nbsp;&nbsp;&nbsp;batch&nbsp;=&nbsp;files&#091;i:i&nbsp;+&nbsp;batch_size]<br>&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;for&nbsp;filename&nbsp;in&nbsp;batch:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;filepath&nbsp;=&nbsp;os.path.join(directory,&nbsp;filename)<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;with&nbsp;open(filepath,&nbsp;mode='r',&nbsp;encoding='utf-8')&nbsp;as&nbsp;file:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;content&nbsp;=&nbsp;file.read()<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;处理文件内容</em><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print(content)

解释：

1
os.listdir()

获取目录中的所有文件名。
1
range(0, len(files), batch_size)

生成批次索引。
1
files[i:i + batch_size]

切片获取每一批文件名。
循环处理每一批文件。

3. 使用缓冲区提高读写速度

通过设置文件对象的缓冲区大小，可以显著提高文件读写速度。

代码示例：


1
buffer_size&nbsp;=&nbsp;4096&nbsp;&nbsp;<em>#&nbsp;缓冲区大小</em><br><br>with&nbsp;open('large_file.txt',&nbsp;mode='r',&nbsp;encoding='utf-8',&nbsp;buffering=buffer_size)&nbsp;as&nbsp;file:<br>&nbsp;&nbsp;&nbsp;&nbsp;while&nbsp;True:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;chunk&nbsp;=&nbsp;file.read(buffer_size)<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;not&nbsp;chunk:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;break<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;处理数据块</em><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print(chunk)

解释：

1
buffering=buffer_size

设置缓冲区大小。
1
file.read(buffer_size)

每次读取指定大小的数据块。
1
if not chunk:

判断是否读取到文件末尾。

4. 使用二进制模式处理大文件

对于非常大的文件，建议使用二进制模式（

'rb'

）读取，这样可以更快地处理文件内容。

代码示例：


1
with&nbsp;open('large_binary_file.bin',&nbsp;mode='rb',&nbsp;buffering=4096)&nbsp;as&nbsp;file:<br>&nbsp;&nbsp;&nbsp;&nbsp;while&nbsp;True:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;chunk&nbsp;=&nbsp;file.read(4096)<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;not&nbsp;chunk:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;break<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;处理二进制数据块</em><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print(chunk)

解释：

1
'rb'

表示以二进制模式读取文件。
1
file.read(4096)

每次读取4096字节的数据块。

5. 利用多线程或进程加速文件处理

对于耗时较长的文件处理任务，可以使用多线程或多进程来加速处理过程。

代码示例：


1
import&nbsp;concurrent.futures<br><br>def&nbsp;process_file(filepath):<br>&nbsp;&nbsp;&nbsp;&nbsp;with&nbsp;open(filepath,&nbsp;mode='r',&nbsp;encoding='utf-8')&nbsp;as&nbsp;file:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;content&nbsp;=&nbsp;file.read()<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;处理文件内容</em><br>&nbsp;&nbsp;&nbsp;&nbsp;print(content)<br><br>directory&nbsp;=&nbsp;'path/to/directory'<br>files&nbsp;=&nbsp;os.listdir(directory)<br><br>with&nbsp;concurrent.futures.ThreadPoolExecutor(max_workers=4)&nbsp;as&nbsp;executor:<br>&nbsp;&nbsp;&nbsp;&nbsp;executor.map(process_file,&nbsp;&#091;os.path.join(directory,&nbsp;f)&nbsp;for&nbsp;f&nbsp;in&nbsp;files])

解释：

1
concurrent.futures.ThreadPoolExecutor

创建线程池。
1
executor.map()

并行执行

1
process_file

函数。
1
max_workers=4

设置最大线程数为4。

6. 使用
1
pickle

模块进行高效序列化

对于需要频繁读写的对象数据，使用

pickle

模块进行序列化和反序列化可以显著提高效率。

代码示例：


1
import&nbsp;pickle<br><br>data&nbsp;=&nbsp;{'name':&nbsp;'Alice',&nbsp;'age':&nbsp;30,&nbsp;'city':&nbsp;'New&nbsp;York'}<br><br><em>#&nbsp;将对象序列化并写入文件</em><br>with&nbsp;open('data.pickle',&nbsp;'wb')&nbsp;as&nbsp;file:<br>&nbsp;&nbsp;&nbsp;&nbsp;pickle.dump(data,&nbsp;file)<br><br><em>#&nbsp;从文件中读取并反序列化对象</em><br>with&nbsp;open('data.pickle',&nbsp;'rb')&nbsp;as&nbsp;file:<br>&nbsp;&nbsp;&nbsp;&nbsp;loaded_data&nbsp;=&nbsp;pickle.load(file)<br><br>print(loaded_data)

解释：

1
pickle.dump(data, file)

将对象序列化并写入文件。
1
pickle.load(file)

从文件中读取并反序列化对象。

7. 使用
1
csv

模块高效处理CSV文件

对于CSV格式的文件，使用

csv

模块可以更高效地读写数据。

代码示例：


1
import&nbsp;csv<br><br><em>#&nbsp;写入CSV文件</em><br>data&nbsp;=&nbsp;&#091;<br>&nbsp;&nbsp;&nbsp;&nbsp;&#091;'Name',&nbsp;'Age',&nbsp;'City'],<br>&nbsp;&nbsp;&nbsp;&nbsp;&#091;'Alice',&nbsp;30,&nbsp;'New&nbsp;York'],<br>&nbsp;&nbsp;&nbsp;&nbsp;&#091;'Bob',&nbsp;25,&nbsp;'Los&nbsp;Angeles']<br>]<br><br>with&nbsp;open('data.csv',&nbsp;mode='w',&nbsp;newline='',&nbsp;encoding='utf-8')&nbsp;as&nbsp;file:<br>&nbsp;&nbsp;&nbsp;&nbsp;writer&nbsp;=&nbsp;csv.writer(file)<br>&nbsp;&nbsp;&nbsp;&nbsp;writer.writerows(data)<br><br><em>#&nbsp;读取CSV文件</em><br>with&nbsp;open('data.csv',&nbsp;mode='r',&nbsp;encoding='utf-8')&nbsp;as&nbsp;file:<br>&nbsp;&nbsp;&nbsp;&nbsp;reader&nbsp;=&nbsp;csv.reader(file)<br>&nbsp;&nbsp;&nbsp;&nbsp;for&nbsp;row&nbsp;in&nbsp;reader:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print(row)

解释：

1
csv.writer(file)

创建CSV写入器。
1
writer.writerows(data)

写入多行数据。
1
csv.reader(file)

创建CSV读取器。
循环读取每一行数据。

实战案例：日志文件分析

假设有一个大型的日志文件，需要统计其中每种错误类型出现的次数。我们可以使用上述技巧来高效处理这个任务。

日志文件内容示例：


1
2
3
4
5
&#091;ERROR] - User Alice tried to access unauthorized resource.

&#091;WARNING] - Disk space is running low.

&#091;ERROR] - Database connection failed.

&#091;INFO] - User Bob logged in successfully.

...

代码示例：


1
import&nbsp;os<br><br><em>#&nbsp;定义错误类型计数器</em><br>error_counts&nbsp;=&nbsp;{}<br><br><em>#&nbsp;设置缓冲区大小</em><br>buffer_size&nbsp;=&nbsp;4096<br><br><em>#&nbsp;日志文件路径</em><br>log_file_path&nbsp;=&nbsp;'path/to/logfile.log'<br><br><em>#&nbsp;使用with语句安全地打开文件</em><br>with&nbsp;open(log_file_path,&nbsp;mode='r',&nbsp;encoding='utf-8',&nbsp;buffering=buffer_size)&nbsp;as&nbsp;log_file:<br>&nbsp;&nbsp;&nbsp;&nbsp;while&nbsp;True:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;chunk&nbsp;=&nbsp;log_file.read(buffer_size)<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;not&nbsp;chunk:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;break<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;分割数据块中的每一行</em><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;lines&nbsp;=&nbsp;chunk.splitlines()<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for&nbsp;line&nbsp;in&nbsp;lines:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;提取错误类型</em><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;error_type&nbsp;=&nbsp;line.split(']')&#091;0].strip('&#091;')<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;更新计数器</em><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;error_type&nbsp;in&nbsp;error_counts:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;error_counts&#091;error_type]&nbsp;+=&nbsp;1<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;error_counts&#091;error_type]&nbsp;=&nbsp;1<br><br><em>#&nbsp;输出结果</em><br>for&nbsp;error_type,&nbsp;count&nbsp;in&nbsp;error_counts.items():<br>&nbsp;&nbsp;&nbsp;&nbsp;print(f"{error_type}:&nbsp;{count}")

解释：

1
buffer_size = 4096

设置缓冲区大小。
1
with open(log_file_path, mode='r', encoding='utf-8', buffering=buffer_size)

使用

1
with

语句安全地打开文件。
1
chunk = log_file.read(buffer_size)

每次读取指定大小的数据块。
1
lines = chunk.splitlines()

分割数据块中的每一行。
1
error_type = line.split(']')[0].strip('[')

提取错误类型。
1
error_counts[error_type] += 1

更新计数器。

总结

本文介绍了多种Python中优化文件处理的方法，包括使用

with

语句、批量处理文件、设置缓冲区、使用二进制模式、利用多线程或多进程加速处理以及使用

pickle

和

csv

模块。通过这些方法，可以显著提高文件处理的速度和安全性。实战案例展示了如何应用这些技术来统计日志文件中的错误类型，进一步巩固了所学知识。

{{userData.name}}已认证

PYTHON基础技能 – Python中提升文件操作速度的7个秘诀

引言

1. 使用
1
with

语句安全地处理文件

2. 批量处理文件

3. 使用缓冲区提高读写速度

4. 使用二进制模式处理大文件

5. 利用多线程或进程加速文件处理

6. 使用
1
pickle

模块进行高效序列化

7. 使用
1
csv

模块高效处理CSV文件

实战案例：日志文件分析

总结

安全运维之道：发现、解决问题的有效闭环

稳定性建设 – 架构优化的关键策略

{{userData.name}}已认证

引言

1. 使用 1with 语句安全地处理文件

2. 批量处理文件

3. 使用缓冲区提高读写速度

4. 使用二进制模式处理大文件

5. 利用多线程或进程加速文件处理

6. 使用 1pickle 模块进行高效序列化

7. 使用 1csv 模块高效处理CSV文件

实战案例：日志文件分析

总结

Related posts:

安全运维之道：发现、解决问题的有效闭环

稳定性建设 – 架构优化的关键策略

带你玩转kubernetes-k8s（第40篇：深入分析集群安全机制三[Admission Control, Service Account]）

redis 和 memcache的区别

MySQL性能调优(2)存储引擎介绍、体系结构及运行机理

MySQL主从复制

1. 使用
1
with

语句安全地处理文件

6. 使用
1
pickle

模块进行高效序列化

7. 使用
1
csv

模块高效处理CSV文件