Reading Large Files With xmltodict

We recently encountered an issue reading large XML files in order to sync information between a 3rd party CMS and our system. xmltodict is a python library that allows to easily parse XML files which according to their documentation it makes it "feel like you are working with JSON" .

Downloading the File

First we need to download the file that we're going to be syncing. Since this is a large file we don't want to store it all in memory, so we are going to be using requests streaming mode for that also. [1]

import requests

with requests.get(url, stream=True) as r:
try:
r.raise_for_status()
with open('/tmp/largefile.xml', 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
except requests.exceptions.HTTPError as e:
print(e)

Using xmltodict Streaming Mode

The streaming mode in xmltodict works by providing a function as a callback that will called every time it encounters an item of the depth specified. Make you read the file in bytes mode or else it will throw the following error: read() did not return a bytes object (type=str)

import xmltodict

def handle_item(_, item):
print(item)

with open('/tmp/largefile.xml', 'rb') as f:
xmltodict.parse(f, item_depth=2, encoding='utf-8', item_callback=handle_item)

I also found out xmltodict will throw not well-formed (invalid token): line 1, column 8 when passing just the filename e.g. xmltodict.parse('/tmp/largefile.xml'). Those are the errors I encountered since the examples they show uses a gzipfile instance [2], other than that it's a great library.

from gzip import GzipFile

xmltodict.parse(GzipFile('discogs_artists.xml.gz'), item_depth=2, item_callback=handle_artist)

References

[1] https://stackoverflow.com/a/16696317

[2] https://github.com/martinblech/xmltodict#streaming-mode