Reading Large Files With xmltodict

We recently encountered an issue reading large XML files in order to sync information between a 3rd party CMS and our system. xmltodict is a python library that allows to easily parse XML files which according to their documentation it makes it "feel like you are working with JSON" .

Downloading the File

First we need to download the file that we're going to be syncing. Since this is a large file we don't want to store it all in memory, so we are going to be using requests streaming mode for that also. [1]

import requests

with requests.get(url, stream=True) as r:
with open('/tmp/largefile.xml', 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
except requests.exceptions.HTTPError as e:

Using xmltodict Streaming Mode

The streaming mode in xmltodict works by providing a function as a callback that will called every time it encounters an item of the depth specified. Make you read the file in bytes mode or else it will throw the following error: read() did not return a bytes object (type=str)

import xmltodict

def handle_item(_, item):

with open('/tmp/largefile.xml', 'rb') as f:
xmltodict.parse(f, item_depth=2, encoding='utf-8', item_callback=handle_item)

I also found out xmltodict will throw not well-formed (invalid token): line 1, column 8 when passing just the filename e.g. xmltodict.parse('/tmp/largefile.xml'). Those are the errors I encountered since the examples they show uses a gzipfile instance [2], other than that it's a great library.

from gzip import GzipFile

xmltodict.parse(GzipFile('discogs_artists.xml.gz'), item_depth=2, item_callback=handle_artist)