Reading Large Files With xmltodict
We recently encountered an issue reading large XML files in order to sync information between a 3rd party CMS and our system.
xmltodict is a python library that allows to easily parse XML files which according to their documentation it makes it "feel like you are working with JSON" .
Downloading the File
First we need to download the file that we're going to be syncing. Since this is a large file we don't want to store it all in memory, so we are going to be using requests streaming mode for that also. 
with requests.get(url, stream=True) as r:
with open('/tmp/largefile.xml', 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
except requests.exceptions.HTTPError as e:
Using xmltodict Streaming Mode
The streaming mode in xmltodict works by providing a function as a callback that will called every time it encounters an item of the depth specified. Make you read the file in bytes mode or else it will throw the following error:
read() did not return a bytes object (type=str)
def handle_item(_, item):
with open('/tmp/largefile.xml', 'rb') as f:
xmltodict.parse(f, item_depth=2, encoding='utf-8', item_callback=handle_item)
I also found out xmltodict will throw
not well-formed (invalid token): line 1, column 8 when passing just the filename e.g. xmltodict.parse('/tmp/largefile.xml'). Those are the errors I encountered since the examples they show uses a gzipfile instance , other than that it's a great library.
from gzip import GzipFile
xmltodict.parse(GzipFile('discogs_artists.xml.gz'), item_depth=2, item_callback=handle_artist)