Simple Python profiling
Published
For a small side-project I’m working on with a couple of friends, we need to parse large amounts of data from the Finnish Parliament open data repository. While it’s great to be able to access all the data, such as votes, proposals, meeting minutes et cetera, the format of the data often leaves quite a lot to be desired. In particular, one of the available datasets (named Vaski) is a table with one significant column; raw XML strings of different types of documents (over 70 document types). In total, the data set weighs nearly 7 gigabytes, but at this point, we are only interested in some of the document types.
While developing pipelines for extracting interesting data, parsing through the
whole dataset exported as a .tsv file is quite cumbersome, especially when
trying to troubleshoot any errors in the extraction processes. We therefore
wanted to split the huge file into subfiles containing only a single document
type each. The initial script for this splitting worked well, but with little
consideration for performance, it took around 3 minutes to run. That’s not a
long time for a preprocessing task that isn’t supposed to be run frequently, but
as I wanted to add pull request automation that would run it way more often, the
runtime started to annoy me.
In order to start optimizing a code snippet such as this, it’s quite important
to be able to identify where the actual bottlenecks are. Python includes a
built-in profiler with
cProfile, but its output is
quite low level and verbose. As an alternative and a more high-level approach,
line_profiler seemed to suit my
needs much better. After installing it, adding a @profile decorator to the
function I was interested in and running the script with
kernprof -l -v script_name.py I got the following output:
Timer unit: 1e-06 s
Total time: 185.304 s
File: pipes/vaski_parser.py
Function: vaski_parser at line 9
Line # Hits Time Per Hit % Time Line Contents
==============================================================
9 @profile
10 def vaski_parser():
11 1 81.1 81.1 0.0 pd.options.mode.copy_on_write = True
12
13 1 15.5 15.5 0.0 vaski_fname = os.path.join("data", "raw", "VaskiData.tsv")
14
15 32 73644115.8 2.3e+06 39.7 for vaski_data in tqdm(pd.read_csv(vaski_fname, sep="\t", chunksize=10000), total=31):
16 62 59057615.8 952542.2 31.9 vaski_data['Xml'] = vaski_data["XmlData"].apply(
17 31 37.9 1.2 0.0 lambda x: etree.fromstring(x)
18 )
19 62 57218.3 922.9 0.0 vaski_data = vaski_data[ # Only keep finnish files
20 62 471448.0 7604.0 0.3 vaski_data['Xml'].apply(
21 31 28.8 0.9 0.0 lambda x: x[0][0].text.endswith('_fi')
22 )
23 ]
24 62 341058.9 5500.9 0.2 vaski_data['doctype'] = vaski_data['Xml'].apply(
25 31 28.2 0.9 0.0 lambda x: x[0][0].text[13:] # All doctypes start with VASKI_JULKVP_
26 )
27
28 1423 17664.2 12.4 0.0 for doctype in vaski_data['doctype'].unique():
29 2784 3080048.6 1106.3 1.7 sub_vaski = vaski_data[vaski_data['doctype'] == doctype].drop(
30 1392 1062.6 0.8 0.0 columns=["doctype", 'Xml']
31 )
32 1392 20919.4 15.0 0.0 f_path = os.path.join("data", "raw", "vaski", f"{doctype}.tsv")
33 1392 21747.1 15.6 0.0 if not os.path.exists(f_path): # If no file, write header
34 sub_vaski.to_csv(f_path, sep="\t", header=True, index=False)
35 else: # If file, no header and append
36 1392 48591207.5 34907.5 26.2 sub_vaski.to_csv(f_path, sep="\t", header=False, index=False, mode='a')
The simplest way to parse this is to just look at the % Time column, and to
see which lines take the most amount of time proportionally during the script’s
execution. Our culprits – lines 15, 16 and 36 – use up to 98% of the
execution time of the whole script.
The most obvious place for improvement is in line 16, where we map a
pandas.DataFrame
column converting its string contents to an
XML tree structure. Now that we know what data
we actually want from the XML, the language and the doctype that luckily
appear at the same part in each document, we can avoid parsing the whole thing
and instead just match a
regex at the right spot.
This also has the added benefit of making the code much more succinct.
After the change, we run the profiler again (this time I’ve omitted other
columns from the output besides Line #, % Time and Line Contents for
brevity):
Timer unit: 1e-06 s
Total time: 110.598 s
File: pipes/vaski_parser.py
Function: vaski_parser at line 9
Line # % Time Line Contents
=============================
9 @profile
10 def vaski_parser():
11 0.0 pd.options.mode.copy_on_write = True
12 0.0 vaski_fname = os.path.join("data", "raw", "VaskiData.tsv")
14 51.3 for vaski_data in tqdm(pd.read_csv(vaski_fname, sep="\t", chunksize=10000), total=31):
16 0.9 vaski_data['doctype'] = vaski_data["XmlData"].str.extract(r"(VASKI_JULKVP_\w+)")
17 0.3 vaski_data = vaski_data[vaski_data['doctype'].str.endswith('_fi')]
18 0.0 for doctype in vaski_data['doctype'].unique():
20 2.8 sub_vaski = vaski_data[vaski_data['doctype'] == doctype].drop(columns=["doctype"])
21 0.0 f_path = os.path.join("data", "raw", "vaski", f"{doctype}.tsv")
22 0.0 if not os.path.exists(f_path):
23 2.7 sub_vaski.to_csv(f_path, sep="\t", header=True, index=False)
24 else:
25 41.9 sub_vaski.to_csv(f_path, sep="\t", header=False, index=False, mode='a')
Awesome, we can now run the whole thing in 110 seconds instead of 180!
Now we see that almost all of the runtime is taken up by
pandas.read_csv()
and
DataFrame.to_csv().
There isn’t much that can be done to tweak these functions, as I tried a couple
of different values for the chunksize argument to read_csv without any
significant differences. I do however know that 7 gigabytes isn’t actually that
much data, especially considering that the
SSD in my desktop PC
should be capable of theoretical maximum throughputs of over 2 GB/s for both
read and write.
With not much to do using Pandas, I wanted to try playing with the new kid on
the block, Polars, who
claims to be able to achieve up to 50x performance improvements over pandas.
I had heard much praise for it, so I wanted to test these claims out. The API
for DataFrames is quite similar for both of them, so porting the code over was
quite simple.
Wrote profile results to 'vaski_parser.py.lprof'
Timer unit: 1e-06 s
Total time: 6.87593 s
File: pipes/vaski_parser.py
Function: vaski_parser at line 10
Line # % Time Line Contents
=============================
10 @profile
11 def vaski_parser():
12 0.0 vaski_fname = os.path.join("data", "raw", "VaskiData.tsv")
13
14 29.6 for vaski_data in tqdm(pl.read_csv(vaski_fname, separator="\t", has_header=True).iter_slices(n_rows=10_000), total=31):
15 1.5 vaski_data = vaski_data.with_columns(
16 0.0 pl.col("XmlData").str.extract(r"VASKI_JULKVP_(\w+)").alias("doctype")
17 )
18 0.3 vaski_data = vaski_data.filter(pl.col("doctype").str.ends_with("_fi"))
19
20 4.3 for (doctype,), sub_vaski in vaski_data.group_by("doctype"):
21 0.4 f_path = os.path.join("data", "raw", "vaski", f"{doctype}.tsv")
22 5.5 sub_vaski = sub_vaski.drop("doctype")
23 1.1 with open(f_path, mode="a" if os.path.exists(f_path) else "w") as f:
24 57.1 sub_vaski.write_csv(f, separator="\t", include_header=not os.path.exists(f_path))
Hot damn, it really is fast! Not a 50x increase, but dropping from 110 seconds to just 7 seconds (so almost 16x) is a remarkable improvement for such a small effort. This also takes the execution time really close to the theoretical maximum I/O of my hardware, so at this point I don’t see a reason to try to optimize further.
Overall, line_profiler worked
great for me in quickly identifying the bottlenecks of the code, and the
surprise(?) winner is surely polars for parsing and writing .{c,t}sv files
so much faster than the incumbent pandas.