Example using IRS MeF XML Files (In demo Directory)

This example demonstrates loading a sample IRS Modernized eFile tax return using a Joost STX transformation. The data is in the form of a complex XML file.

The U.S. Internal Revenue Service (IRS) made a significant commitment to XML and specifies its use in its Modernized e-File (MeF) system. In MeF, each tax return is an XML document with a deep hierarchical structure that closely reflects the particular form of the underlying tax code.

XML, XML Schema and stylesheets play a role in their data representation and business workflow. The actual XML data is extracted from a ZIP file attached to a MIME “transmission file” message. For more information about MeF, see Modernized e-File (Overview) on the IRS web site.

The sample XML document, RET990EZ_2006.xml, is about 350KB in size with two elements:

  • ReturnHeader
  • ReturnData

The element contains general details about the tax return such as the taxpayer’s name, the tax year of the return, and the preparer. The element contains multiple sections with specific details about the tax return and associated schedules.

The following is an abridged sample of the XML file.

  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <Return returnVersion="2006v2.0"
  3. xmlns="http://www.irs.gov/efile"
  4. xmlns:efile="http://www.irs.gov/efile"
  5. xsi:schemaLocation="http://www.irs.gov/efile"
  6. xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  7. <ReturnHeader binaryAttachmentCount="1">
  8. <ReturnId>AAAAAAAAAAAAAAAAAAAA</ReturnId>
  9. <Timestamp>1999-05-30T12:01:01+05:01</Timestamp>
  10. <ReturnType>990EZ</ReturnType>
  11. <TaxPeriodBeginDate>2005-01-01</TaxPeriodBeginDate>
  12. <TaxPeriodEndDate>2005-12-31</TaxPeriodEndDate>
  13. <Filer>
  14. <EIN>011248772</EIN>
  15. ... more data ...
  16. </Filer>
  17. <Preparer>
  18. <Name>Percy Polar</Name>
  19. ... more data ...
  20. </Preparer>
  21. <TaxYear>2005</TaxYear>
  22. </ReturnHeader>
  23. ... more data ..

The goal is to import all the data into a HAWQ database. First, convert the XML document into text with newlines “escaped”, with two columns: ReturnId and a single column on the end for the entire MeF tax return. For example:

  1. AAAAAAAAAAAAAAAAAAAA|<Return returnVersion="2006v2.0"...

Load the data into HAWQ.