Working With Large CSV File Using Python.

Ankush kunwar
3 min readJan 9, 2023

--

There are several ways to work with large CSV files in Python, depending on your needs and the resources available to you. Here are a few options:

  1. Use a generator: You can use a generator to read the file one line at a time, rather than reading the entire file into memory at once. This can be a good option if you don’t need to access all of the data in the file at once and you want to avoid using too much memory.
  2. Use the csv module: You can use the csv module to read the file more efficiently, using the csv.reader function to iterate over the rows in the file. This can be a good option if you don't need to access all of the data in the file at once and you want to avoid parsing the file yourself.
  3. Use pandas: You can use the pandas.read_csv function to read the file into a pandas DataFrame, which can be more convenient for working with the data. You can use the chunksize parameter to read the file in chunks, rather than reading the entire file into memory at once.
  4. Use a database: If you need to store and query the data from the CSV file, you may want to consider loading the data into a database. This can be more efficient than working with the data in a flat file, especially if you need to filter or aggregate the data in complex ways.

First Method:

working with large CSV file using the generator

Using a generator can be a good way to work with large CSV files in Python, especially if you don’t need to access all of the data in the file at once. Generators allow you to work with the data in the file one line at a time, rather than reading the entire file into memory at once.

Here is an example of how you might use a generator to read a large CSV file in Python:

def read_csv(file_name):
with open(file_name, "r") as f:
for line in f:
yield line.split(",")

for row in read_csv("large_file.csv"):
process_row(row)

This code will open the file and iterate over the lines in the file, yielding each line as a list of values split by the comma delimiter. You can then process each row as you iterate over the generator. This allows you to work with the file one row at a time, rather than reading the entire file into memory at once.

Second Method

Using CSV module with generator

You can also use the csv module to read CSV files more efficiently. Here is an example of how you might do that:

import csv

def read_csv(file_name):
with open(file_name, "r") as f:
reader = csv.reader(f)
for row in reader:
yield row

for row in read_csv("large_file.csv"):
process_row(row)

This code will use the csv.reader function to iterate over the rows in the file and yield each row as a list of values. This can be more efficient than reading the file line by line and splitting the values yourself.

Third Method

Using Pandas with Generator

Pandas is a powerful library for working with large datasets in Python. One way to use pandas with large CSV files is to use the pandas.read_csv function in combination with a generator.

Here is an example of how you might use a generator with pandas.read_csv to process a large CSV file:

import pandas as pd

def read_csv(file_name):
for chunk in pd.read_csv(file_name, chunksize=10000):
yield chunk

for df in read_csv("large_file.csv"):
process_dataframe(df)

This code will read the CSV file in chunks of 10000 rows at a time and yield each chunk as a pandas DataFrame. You can then process each DataFrame as you iterate over the generator. This allows you to work with the file one chunk at a time, rather than reading the entire file into memory at once.

You can also use the pandas.read_csv function with the iterator parameter to return a generator that yields the rows of the file one at a time:

import pandas as pd

def read_csv(file_name):
for row in pd.read_csv(file_name, iterator=True):
yield row

for row in read_csv("large_file.csv"):
process_row(row)

This code will use the pandas.read_csv function with the iterator parameter set to True to return a generator that yields the rows of the file one at a time. This can be more efficient than reading the entire file into memory at once and iterating over the rows yourself.

Thank you for reading !!!

If you enjoy this article and would like to Buy Me a Coffee, please click here.

you can connect with me on Linkedin.

--

--

Ankush kunwar
Ankush kunwar

Written by Ankush kunwar

Experienced Software Engineer Skilled in Microservices, Backend Development, System Design, Python, Java, Kubernetes, Docker, AWS, and Problem Solving

Responses (1)