Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Programming Communications Technology

Help/Opinions on Parsing OFX FIles? 49

innerweb asks: "I am looking for help and advice on using and parsing the OFX (Open Financial Exchange) file spec using C/C++ and/or Perl. I have read the standards, downloaded the DTD (ofx version 2), and tried to parse several files from different banks. They have all failed in my normal parsers (commercial and OSS), yet they load fine in Microsoft Money. It is not so complicated that I can not hand roll my own, and I have much of it working that way as a proof, but I would rather stick with something that is standards based, as this is a standard that in my opinion ought to work with standards based tools. Am I missing something here, or is this truly a file format that is broken as a feature?"
"I know the files are malformed when they come down, as they are missing the normal XML and SGML file headers ?XML or !DOCTYPE to define the dtd to use to parse the file. I know that the document is not 'well formed' as I understand it, as most of the tags in the datafile are not closed (open tag, but no corresponding closing tag). When I fix these errors, the files seem to parse. yet, I know that from what I have seen, MS Money takes in the same raw data and parses it. Microsoft lists the OFX file format as XML in some places and SGML in others. The OFX website seems to be saying this is SGML, not XML (XML is a subset of SGML in most cases, but the way it is *used* sometimes it is not really SGML at all.)

I have been reading like mad for a few weeks on OFX format files and usage, but not getting much useful information. I have worked with SGML in the past and XML, so I am at least familiar with these *conventions*. I need to be pointed in the right direction, and or told what I am doing wrong/overlooking. I know it is probably something obvious, but somehow I am not getting it.

Thanks in advance for any help that you can throw my way."
This discussion has been archived. No new comments can be posted.

Help/Opinions on Parsing OFX FIles?

Comments Filter:
  • What makes you think that CheckFree, Intuit and Microsoft would make it easy for a bunch of OSS developers to work with "their" standard.
  • by ciroknight ( 601098 ) on Wednesday February 02, 2005 @05:58PM (#11555411)
    Fix the XML, then parse. Microsoft's parser is probably broken in a way that it doesn't look for closing tags (think, regular expression matching for the open tag, then extracting data until it hits the next tag). It's not beyond Microsoft to break their own software just to make it difficult for others to use the format, but it does make it harder on us developers to keep going.

    So, I'd try finding a way to rebuild the XML file before parsing it, but only after detecting if it needs it.
    • Microsoft's parser is probably broken in a way that it doesn't look for closing tags

      Or, you could say that Microsoft's parser is much more robust in how it deals with malformed documents.

      Seriously, you can't blame MS for writing a better XML parser. Just because a parser knows that

      <foo>this<bar>that</foo>

      ...is not valid XML, does not mean it needs to choke and die. MSXML can easily validate a document as well as parse invalid ones.

      • Well, yeah you can blame them.

        When a company with monopoly power writes software that accepts broken, non standards compliance input, it's creating a new standard, one that isn't published, since people will generally only work on their software until it produces an output that works with the monopoly's software.

        It's just a trick to take what was an open standard and turn it into one with secret formats and rules only MS knows.

        Robust would mean it wouldn't crash on broken input. That's fine. It should
        • When a company with monopoly power writes software that accepts broken, non standards compliance input, it's creating a new standard, one that isn't published, since people will generally only work on their software until it produces an output that works with the monopoly's software.

          This is a non sequitur. Forget the fact that probably 60%+ of the internet is stillusing HTML 4.0 (which is non-compliant) - people are going to create invalid input regardless.

          Also, I don't really even get what you are asking

          • I don't want a browser to throw an error when it can't parse a page, I would have liked it if way back when, it wouldn't have rendered a page that was broken. That way, the web developers could instantly tell something was wrong, and fix their code.

            MS embraces broken standards by making an XML parser that parses XML files that are non-standard compliant, and in this case (MS Money), I think it should throw an error, instead of parsing. That would have instantly made people call up their financial institu
          • This is all about context and intent.

            The browser should output a warning or error when fed bad HTML, but it could be in some debug window that isn't normally shown. I guarantee if browsers had this function, the web would be much more compliant than it is today, since instead of just "playing with the tag soup until it looks right on IE", they could actually see where they misnested an element and fix it, or whatever they need to do to make it valid.

            I'll use a company that is not much higher in MS in mos
            • The site recommended:

              Store (encrypted) information in cookies even before transfer to the server, so information is preserved from all but the most serious "melt-downs."

              GigsVT responded:

              This seems useless to me. If my computer loses power right now, I will lose this very long message I have typed into this very small box on this form on slashdot, and there's nothing the Slash devs can do to help that.

              I think you missed the point. There is something the Slash devs can do as with every other web

              • Regarding your last comment, after more people replied I did see that the article was misleading, but since this thread already went down this path I decided to continue.

                There are other places MS has done crap like article accuses, even if this isn't one of them.
        • by Knights who say 'INT ( 708612 ) on Wednesday February 02, 2005 @09:25PM (#11557660) Journal
          Tsk. It's called the "be tolerant in what you accept and strict in what you send" rule.
          • It was meant for general user input.

            It's all well and good except when you have a feedback loop where the more tolerant you are, the worse the input gets as users get more and more sloppy.
            I don't want a 'tolerant' xhtml parser for instance. Otherwise we end up with the stupidity of html again.
      • <foo>this<bar>that</foo>

        ...is not a problem, as that is guessable from a generic parser's perspective, but something like:

        <foo>this<bar>that<bar2>thatagain</foo>

        presents problems with where to place the bar2. Is it a sibling to bar, or a child? By eyeballing the data, you can make an intuitive guess, and probably be right. Putting code into play that must always make the right decision on the other hand can not rely on intuition. Especially when it com

        • BTW, MS's own XML parser chokes on these docs as well. Money can parse it, but, the msxml parser chokes on it.

          I suspect that if you use the MSXML SAX parser, or any other SAX parser for that matter, you can ignore the exceptions it throws up and carry on parsing. Perhaps that is what MS Money is doing.

          I definitely wouldn't expect a DOM parser to handle such a broken document.

        • This is probably because these aren't XML documents, and Money doesn't use the XML parser to read them.
      • Seriously, you can't blame MS for writing a better XML parser.

        It is not better to silently ignore the fact that you are dealing with what looks like corrupt data.

    • Indeed. The same thing happens with QIF. The spec is open, and after reading the first page you find out that most banks produce files that don't adhere to the spec. Solution: adjust the parser-- it's a simple fix with QIF. The problem is probably that they all use the same software to create the slightly defective QIF, and MS is aware of this (more likely is that they use MS software to produce the QIFs).
    • Yeah, I was kind of hoping that the obvious answer was the wrong answer. The stuff is not too complicated, there is just a lot of it (DTD). I have a parser now that does not choke on the data. It needs much testing and time to make sure it really always works right. There is no room for error when it comes to people's money.

      It is also another piece of code that has to be maintained and updated as updates on the OFX standard come out.

      InnerWeb

  • Gnucash. (Score:2, Informative)

    by Anonymous Coward
    You did check Gnucash's importers to see how they did it, Right?
    • LibOFX (Score:4, Informative)

      by Noksagt ( 69097 ) on Wednesday February 02, 2005 @08:14PM (#11557039) Homepage
      They use LibOFX [sourceforge.net], available under the GPL
  • If you have written code that works better than the open source code you have tried, but you'd rather use said open source code, isn't it obvious that you should send some patches? That's how open source works, you know.
  • http://freshmeat.net/search/?q=ofx&section=project s&Go.x=0&Go.y=0 Also, have you tried any SGML parsers? Despite XML being a 'subset' of SGML, SGML allows many constructs not allowed in XML.
  • by joto ( 134244 ) on Wednesday February 02, 2005 @06:24PM (#11555658)
    OFX is based upon SGML and, like XML, it is an attempt to take the best features of SGML and remove much of the associated complexity. OFX is not technically an XML application. The syntax of OFX differs from that set out for XML applications in that OFX omits end-tags

    It's SGML, not XML. Unless you insist on doing it the hard way with a real SGML parser, I can't see what's wrong with using your own hand-rolled one. As you've already recognized yourself, it shouldn't be too hard.

    Other alternatives would be to have an a preprocessor that converts it to XML, or maybe use some too-tolerant XML-parser. On the other hand, if the file format isn't XML, I can't see why it would be easier to treat it as if it were.

    Am I missing something here, or is this truly a file format that is broken as a feature?

    Mu [watson-net.com]. Yes, you are missing the distinction between XML and SGML. No, it's not broken as a feature, it just predates XML.

    • The link the asker posted actually specifically states this - it's not XML, it's similiar to XML but omits end tags to conserve space. Roll your own parser, buddy.
    • That is true. And on MS's site, it is called XML. No matter, which it is, by XML standards it is broken, and by SGML standards it is broken. When you create a DTD for SGML, you have to actually note in the DTD whether or not a closing (or opening) tag is optional [virginia.edu]. They have not marked any tag in the DTD I have (from www.ofx.org) as optional.

      If they had marked tag closings as optional then in the DTD you sould see something like this:

      <!ELEMENT SEVERITY - O %SEVERITYENUM;>

      The - mean the openi

    • It's SGML, not XML. Unless you insist on doing it the hard way with a real SGML parser, I can't see what's wrong with using your own hand-rolled one. As you've already recognized yourself, it shouldn't be too hard.

      A lot gets wrong with a hand-rolled one: SGML is a huge standard and there are a number of constructs that can be used while still adhering to the DTD.

      A better suggestion would be to use e.g. J. Clark's superb-complete sp SGML parser (nsgmls) to read the DTDs and parse the SGML files: nsgmls

    • It's SGML, not XML

      Really? I was browsing what I thought was the official OFX site and ran across this:
      Since 2000, with the 2.0 specification, OFX has become XML 1.0 compliant and has added 1098, 1099 and W2 tax form download capabilities. (

      Did it move from SGML to SML with 2.0 (5 years ago), or has it always been "XML" in spirit only?

    • All very excellent points, but if you need an SGML parser to 'do it the hard way' (?) just grab a copy of James Clark's SP parser from here [jclark.com].

      You could probably use that as a starting point if your XML parsers don't like the doc format.

      Cheers.

  • check out libofx (Score:2, Informative)

    by UncleBoy ( 45706 )
    Get someone to show you how to use google.com while you're at it.
    • I did check it out. It looks promising, but it is GPL. The businesses that want the solution to include OFX do not want to have their internal code mixed with anything GPL.

      If it was just for me, or something I could release without worry, I would not hesitate to use it, but it would not be right to use it, distribute it and then not supply source code (which would be a breach of contract for me).

      Get someone to show you how to use google.com while you're at it

      Hmm. Nice.

      InnerWeb

  • The link included with the article says, "OFX is based upon SGML and, like XML, it is an attempt to take the best features of SGML and remove much of the associated complexity."

    Based on that, I'm not even sure if it's even an SGML application, or if it's just something that looks like SGML?

    Open Finanical Exchange Specification 1.0.2, Section 2.3.1:

    SGML is the basis for Open Financial Exchange. A DTD formally defines the SGML wire format for Open Financial Exchange. However, Open Financial Exchange is no

    • by innerweb ( 721995 ) on Thursday February 03, 2005 @01:59AM (#11559554)
      I will be the first to admit it has been a while since I worked with SGML, but IIRC, in the DTD for an SGML doc, you mark optional tags with an O, so that it would look like this:
      <!ELEMENT elemname - O (#PCDATA) >
      where the - means the opening tag is required, and the O means the closing tag is not required.

      Whether or not I think I remember it, OFX has sent me back to books I have had in boxes for almost a decade now. Normally when I pull old books out like that, they are picture albums for the family, not old programming and data manuals.

      InnerWeb

  • I built Checkfree's first OFX parser, and yes we rolled our own. After looking at the DTD's, and finding numerous mistakes, we decided that building our own would be easier, and IF the DTD's ever got cleaned up, we could use an SGML parser, in a latter version.

    Sounds like the DTD's haven't gotten cleaned up.
  • You could try running it thrrough the HTML tidy [sourceforge.net] program and see what happens.

    Graham

This restaurant was advertising breakfast any time. So I ordered french toast in the renaissance. - Steven Wright, comedian

Working...