Reading large XML files

Have you ever had a task to read and deserialize a large XML file? Like 500 MB, that is impossible just to read to the end and parse. I’ve faced such problem having an XML that looks like the following:

<?xml version="1.0" encoding="UTF-8"?>
<record id="0" Text="Some text for record 0" />
<record id="1" Text="Some text for record 1" />
<record id="2" Text="Some text for record 2" />
<record id="3" Text="Some text for record 3" />
...
<record id="1000000" Text="Some text for record 1000000" />

We can’t just read the whole file as a string and pass it to the deserializer because it is too large. Moreover, we don’t have a root element, that breaks the deserializer.

Additionally, we probably don’t want to load all parsed elements into memory, we need to produce the IEnumerable to be able to process elements one by one.

An the last, it would be great to have the generic version of code :)

First, lets define the record class:

[XmlRoot(ElementName = "record")]
public class XmlRecord
{
    [XmlAttribute("id")]
    public int Id { get; set; }

    [XmlAttribute("Text")]
    public string Text { get; set; }
}

Make sure that class is marked as a XmlRoot because we gonna read file element by element and treat each record as a whole XML document.

In order to read the file by fragments we can use the XmlReader with appropriate settings:

var settings = new XmlReaderSettings
               {
                   ConformanceLevel = ConformanceLevel.Fragment,
                   IgnoreWhitespace = true
               };

using (var reader = XmlReader.Create(xmlFilePath, settings))
{
    // ...
}

We can start by putting cursor to the begin of content and then read elements one by one until the end of file:

reader.MoveToContent();
while (!reader.EOF)
{
    // reader.ReadSubtree();
    // ...
}

Let’s put everything together:

public static IEnumerable<T> ReadLargeXml<T>(string xmlFilePath)
{
    var settings = new XmlReaderSettings
                   {
                       ConformanceLevel = ConformanceLevel.Fragment,
                       IgnoreWhitespace = true
                   };

    using (var reader = XmlReader.Create(xmlFilePath, settings))
    {
        var serializer = new XmlSerializer(typeof(T));

        reader.MoveToContent();
        while (!reader.EOF)
        {
            var element = XElement.Load(reader.ReadSubtree());
            var record = (T)serializer.Deserialize(element.CreateReader());
            yield return record;
            reader.Read();
        }
    }
}

This method is generic. It returns enumeration of elements, so we can process them one by one. The input XML file remains opened, until all elements are read. And the method correctly handles the case when file is empty or contains only the XML definition.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s