Help/Opinions on Parsing OFX FIles?

Help/Opinions on Parsing OFX FIles? 49

Posted by Cliff on Wednesday February 02, 2005 @05:54PM from the not-so-standard-standards dept.

innerweb asks: "I am looking for help and advice on using and parsing the OFX (Open Financial Exchange) file spec using C/C++ and/or Perl. I have read the standards, downloaded the DTD (ofx version 2), and tried to parse several files from different banks. They have all failed in my normal parsers (commercial and OSS), yet they load fine in Microsoft Money. It is not so complicated that I can not hand roll my own, and I have much of it working that way as a proof, but I would rather stick with something that is standards based, as this is a standard that in my opinion ought to work with standards based tools. Am I missing something here, or is this truly a file format that is broken as a feature?"

"I know the files are malformed when they come down, as they are missing the normal XML and SGML file headers ?XML or !DOCTYPE to define the dtd to use to parse the file. I know that the document is not 'well formed' as I understand it, as most of the tags in the datafile are not closed (open tag, but no corresponding closing tag). When I fix these errors, the files seem to parse. yet, I know that from what I have seen, MS Money takes in the same raw data and parses it. Microsoft lists the OFX file format as XML in some places and SGML in others. The OFX website seems to be saying this is SGML, not XML (XML is a subset of SGML in most cases, but the way it is *used* sometimes it is not really SGML at all.)

I have been reading like mad for a few weeks on OFX format files and usage, but not getting much useful information. I have worked with SGML in the past and XML, so I am at least familiar with these *conventions*. I need to be pointed in the right direction, and or told what I am doing wrong/overlooking. I know it is probably something obvious, but somehow I am not getting it.

Thanks in advance for any help that you can throw my way."

Help/Opinions on Parsing OFX FIles?

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 49 Comments Log In/Create an Account

Comments Filter:

You're surprised? (Score:1, Troll)

by Smallpond ( 221300 ) writes:

What makes you think that CheckFree, Intuit and Microsoft would make it easy for a bunch of OSS developers to work with "their" standard.
- Or that mainstream banks... (Score:2)
  
  by leonbrooks ( 8043 ) writes:
  
  ...would do anything straightforwardly, if they had a chance to "improve" it?
I think you found your answer (Score:3, Interesting)

by ciroknight ( 601098 ) writes: on Wednesday February 02, 2005 @05:58PM (#11555411)

Fix the XML, then parse. Microsoft's parser is probably broken in a way that it doesn't look for closing tags (think, regular expression matching for the open tag, then extracting data until it hits the next tag). It's not beyond Microsoft to break their own software just to make it difficult for others to use the format, but it does make it harder on us developers to keep going.

So, I'd try finding a way to rebuild the XML file before parsing it, but only after detecting if it needs it.

- Re:I think you found your answer (Score:3, Insightful)
  
  by brunes69 ( 86786 ) writes:
  
  Microsoft's parser is probably broken in a way that it doesn't look for closing tags
  Or, you could say that Microsoft's parser is much more robust in how it deals with malformed documents.
  Seriously, you can't blame MS for writing a better XML parser. Just because a parser knows that
  
  <foo>this<bar>that</foo>
  
  ...is not valid XML, does not mean it needs to choke and die. MSXML can easily validate a document as well as parse invalid ones.
  - Re:I think you found your answer (Score:1)
    
    by GigsVT ( 208848 ) writes:
    
    Well, yeah you can blame them.
    
    When a company with monopoly power writes software that accepts broken, non standards compliance input, it's creating a new standard, one that isn't published, since people will generally only work on their software until it produces an output that works with the monopoly's software.
    
    It's just a trick to take what was an open standard and turn it into one with secret formats and rules only MS knows.
    
    Robust would mean it wouldn't crash on broken input. That's fine. It should
    - Re:I think you found your answer (Score:2)
      
      by brunes69 ( 86786 ) writes:
      
      When a company with monopoly power writes software that accepts broken, non standards compliance input, it's creating a new standard, one that isn't published, since people will generally only work on their software until it produces an output that works with the monopoly's software.
      This is a non sequitur. Forget the fact that probably 60%+ of the internet is stillusing HTML 4.0 (which is non-compliant) - people are going to create invalid input regardless.
      Also, I don't really even get what you are asking
      - Re:I think you found your answer (Score:2)
        
        by ciroknight ( 601098 ) writes:
        
        I don't want a browser to throw an error when it can't parse a page, I would have liked it if way back when, it wouldn't have rendered a page that was broken. That way, the web developers could instantly tell something was wrong, and fix their code.
        
        MS embraces broken standards by making an XML parser that parses XML files that are non-standard compliant, and in this case (MS Money), I think it should throw an error, instead of parsing. That would have instantly made people call up their financial institu
      - Re:I think you found your answer (Score:1)
        
        by GigsVT ( 208848 ) writes:
        
        This is all about context and intent.
        
        The browser should output a warning or error when fed bad HTML, but it could be in some debug window that isn't normally shown. I guarantee if browsers had this function, the web would be much more compliant than it is today, since instead of just "playing with the tag soup until it looks right on IE", they could actually see where they misnested an element and fix it, or whatever they need to do to make it valid.
        
        I'll use a company that is not much higher in MS in mos
        
        Quick save (OT) (Score:2)
        
        by OldMiner ( 589872 ) writes:
        
        The site recommended:
        
        Store (encrypted) information in cookies even before transfer to the server, so information is preserved from all but the most serious "melt-downs."
        
        GigsVT responded:
        This seems useless to me. If my computer loses power right now, I will lose this very long message I have typed into this very small box on this form on slashdot, and there's nothing the Slash devs can do to help that.
        I think you missed the point. There is something the Slash devs can do as with every other web
        
        Re:Quick save (OT) (Score:1)
        
        by GigsVT ( 208848 ) writes:
        
        Regarding your last comment, after more people replied I did see that the article was misleading, but since this thread already went down this path I decided to continue.
        
        There are other places MS has done crap like article accuses, even if this isn't one of them.
    - Re:I think you found your answer (Score:4, Insightful)
      
      by Knights who say 'INT ( 708612 ) writes: on Wednesday February 02, 2005 @09:25PM (#11557660) Journal
      
      Tsk. It's called the "be tolerant in what you accept and strict in what you send" rule.
      
      - Re:I think you found your answer (Score:2)
        
        by JohnFluxx ( 413620 ) writes:
        
        It was meant for general user input.
        
        It's all well and good except when you have a feedback loop where the more tolerant you are, the worse the input gets as users get more and more sloppy.
        I don't want a 'tolerant' xhtml parser for instance. Otherwise we end up with the stupidity of html again.
  - Re:I think you found your answer (Score:2)
    
    by innerweb ( 721995 ) writes:
    
    <foo>this<bar>that</foo>
    
    ...is not a problem, as that is guessable from a generic parser's perspective, but something like:
    <foo>this<bar>that<bar2>thatagain</foo>
    
    presents problems with where to place the bar2. Is it a sibling to bar, or a child? By eyeballing the data, you can make an intuitive guess, and probably be right. Putting code into play that must always make the right decision on the other hand can not rely on intuition. Especially when it com
    - Re:I think you found your answer (Score:2)
      
      by jrumney ( 197329 ) writes:
      
      BTW, MS's own XML parser chokes on these docs as well. Money can parse it, but, the msxml parser chokes on it.
      I suspect that if you use the MSXML SAX parser, or any other SAX parser for that matter, you can ignore the exceptions it throws up and carry on parsing. Perhaps that is what MS Money is doing.
      I definitely wouldn't expect a DOM parser to handle such a broken document.
    - Re:I think you found your answer (Score:2)
      
      by arkanes ( 521690 ) writes:
      
      This is probably because these aren't XML documents, and Money doesn't use the XML parser to read them.
  - Re:I think you found your answer (Score:2)
    
    by jrumney ( 197329 ) writes:
    
    Seriously, you can't blame MS for writing a better XML parser.
    It is not better to silently ignore the fact that you are dealing with what looks like corrupt data.
- QIF (Score:2)
  
  by twistedcubic ( 577194 ) writes:
  
  Indeed. The same thing happens with QIF. The spec is open, and after reading the first page you find out that most banks produce files that don't adhere to the spec. Solution: adjust the parser-- it's a simple fix with QIF. The problem is probably that they all use the same software to create the slightly defective QIF, and MS is aware of this (more likely is that they use MS software to produce the QIFs).
- Re:I think you found your answer (Score:2)
  
  by innerweb ( 721995 ) writes:
  
  Yeah, I was kind of hoping that the obvious answer was the wrong answer. The stuff is not too complicated, there is just a lot of it (DTD). I have a parser now that does not choke on the data. It needs much testing and time to make sure it really always works right. There is no room for error when it comes to people's money.
  
  It is also another piece of code that has to be maintained and updated as updates on the OFX standard come out.
  
  InnerWeb
Gnucash. (Score:2, Informative)

by Anonymous Coward writes:

You did check Gnucash's importers to see how they did it, Right?
- - Re:Gnucash. (Score:2)
    
    by ophix ( 680455 ) writes:
    
    worst. analogy. ever.
  - Re:Gnucash. (Score:2)
    
    by ComputerSlicer23 ( 516509 ) writes:
    
    Are you just being silly now or what? There are two things to consider. First, you could use the GPL'ed code will act as a filter. Take in said crappy file, output good working file. More then likely, you could take the GNUcash code and write some "plumbing" or "glue" code to get this accomplished. Where 99% of the real work is done by the GPL'ed code already written code. You just write the 1% to give you access to the interface the way you want it. Thus you wouldn't integrate all of that into your
  - Re:Gnucash. (Score:3, Informative)
    
    by GigsVT ( 208848 ) writes:
    
    In addition to the other excellent replies, it's a misconception that code ever unwittingly "becomes GPL".
    
    If you use GPL code in your application in a way that violates the GPL, you have violated copyright law. That's it. The GPL doesn't control the remedies, the legal system does. Generally the remedy would be in the form of a cash settlement to the copyright owner of the software you violated the copyright on.
  - He may not want to release his software .. (Score:1)
    
    by torpor ( 458 ) writes:
    
    .. it may just be for his own personal use .. he may not release it to the wild, under any terms ..
    
    or, to put it your way "who cares if you have herpes if you're a wanker, anyway?"
- LibOFX (Score:4, Informative)
  
  by Noksagt ( 69097 ) writes: on Wednesday February 02, 2005 @08:14PM (#11557039) Homepage
  
  They use LibOFX [sourceforge.net], available under the GPL
  
Isn’t it obvious? (Score:2, Insightful)

by Pan T. Hose ( 707794 ) writes:

If you have written code that works better than the open source code you have tried, but you'd rather use said open source code, isn't it obvious that you should send some patches? That's how open source works, you know.
Did you bother searching? (Score:1)

by rrsipov ( 648218 ) writes:

http://freshmeat.net/search/?q=ofx&section=project s&Go.x=0&Go.y=0 Also, have you tried any SGML parsers? Despite XML being a 'subset' of SGML, SGML allows many constructs not allowed in XML.
- Re:Did you bother searching? (Score:4, Funny)
  
  by Rick the Red ( 307103 ) writes: <Rick DOT The DOT Red AT gmail DOT com> on Wednesday February 02, 2005 @06:37PM (#11555798) Journal
  
  Despite XML being a 'subset' of SGML, SGML allows many constructs not allowed in XML.
  
  I don't think that word [onelook.com] means what you think it means.
  Or is this word [onelook.com] the one you don't understand?
  
  - Re:Did you bother searching? (Score:1)
    
    by ricochet81 ( 707864 ) writes:
    
    inconceivable
Um, the answer is in the link you posted. (Score:5, Informative)

by joto ( 134244 ) writes: on Wednesday February 02, 2005 @06:24PM (#11555658)

OFX is based upon SGML and, like XML, it is an attempt to take the best features of SGML and remove much of the associated complexity. OFX is not technically an XML application. The syntax of OFX differs from that set out for XML applications in that OFX omits end-tags
It's SGML, not XML. Unless you insist on doing it the hard way with a real SGML parser, I can't see what's wrong with using your own hand-rolled one. As you've already recognized yourself, it shouldn't be too hard.
Other alternatives would be to have an a preprocessor that converts it to XML, or maybe use some too-tolerant XML-parser. On the other hand, if the file format isn't XML, I can't see why it would be easier to treat it as if it were.
Am I missing something here, or is this truly a file format that is broken as a feature?
Mu [watson-net.com]. Yes, you are missing the distinction between XML and SGML. No, it's not broken as a feature, it just predates XML.

- Re:Um, the answer is in the link you posted. (Score:2)
  
  by arkanes ( 521690 ) writes:
  
  The link the asker posted actually specifically states this - it's not XML, it's similiar to XML but omits end tags to conserve space. Roll your own parser, buddy.
- Re:Um, the answer is in the link you posted. (Score:2)
  
  by innerweb ( 721995 ) writes:
  
  That is true. And on MS's site, it is called XML. No matter, which it is, by XML standards it is broken, and by SGML standards it is broken. When you create a DTD for SGML, you have to actually note in the DTD whether or not a closing (or opening) tag is optional [virginia.edu]. They have not marked any tag in the DTD I have (from www.ofx.org) as optional.
  If they had marked tag closings as optional then in the DTD you sould see something like this:
  <!ELEMENT SEVERITY - O %SEVERITYENUM;>
  
  The - mean the openi
- Re:Um, the answer is in the link you posted. (Score:1)
  
  by yason ( 249474 ) writes:
  
  It's SGML, not XML. Unless you insist on doing it the hard way with a real SGML parser, I can't see what's wrong with using your own hand-rolled one. As you've already recognized yourself, it shouldn't be too hard.
  
  A lot gets wrong with a hand-rolled one: SGML is a huge standard and there are a number of constructs that can be used while still adhering to the DTD.
  A better suggestion would be to use e.g. J. Clark's superb-complete sp SGML parser (nsgmls) to read the DTDs and parse the SGML files: nsgmls
- Re:Um, the answer is in the link you posted. (Score:2)
  
  by dschuetz ( 10924 ) writes:
  
  It's SGML, not XML
  
  Really? I was browsing what I thought was the official OFX site and ran across this:
  Since 2000, with the 2.0 specification, OFX has become XML 1.0 compliant and has added 1098, 1099 and W2 tax form download capabilities. (
  http://www.ofx.net/ofx/ab_main.asp [ofx.net])
  
  Did it move from SGML to SML with 2.0 (5 years ago), or has it always been "XML" in spirit only?
- Re:Um, the answer is in the link you posted. (Score:2)
  
  by gstoddart ( 321705 ) writes:
  
  All very excellent points, but if you need an SGML parser to 'do it the hard way' (?) just grab a copy of James Clark's SP parser from here [jclark.com].
  
  You could probably use that as a starting point if your XML parsers don't like the doc format.
  
  Cheers.
- Re:Heh (Score:2)
  
  by HotNeedleOfInquiry ( 598897 ) writes:
  
  To which I would add "Welcome to the Real World"...
check out libofx (Score:2, Informative)

by UncleBoy ( 45706 ) writes:

Get someone to show you how to use google.com while you're at it.
- Re:check out libofx (Score:2)
  
  by innerweb ( 721995 ) writes:
  
  I did check it out. It looks promising, but it is GPL. The businesses that want the solution to include OFX do not want to have their internal code mixed with anything GPL.
  
  If it was just for me, or something I could release without worry, I would not hesitate to use it, but it would not be right to use it, distribute it and then not supply source code (which would be a breach of contract for me).
  Get someone to show you how to use google.com while you're at it
  Hmm. Nice.
  InnerWeb
Is it an SGML application? (Score:2)

by Karma Farmer ( 595141 ) writes:

The link included with the article says, "OFX is based upon SGML and, like XML, it is an attempt to take the best features of SGML and remove much of the associated complexity."

Based on that, I'm not even sure if it's even an SGML application, or if it's just something that looks like SGML?

Open Finanical Exchange Specification 1.0.2, Section 2.3.1:
SGML is the basis for Open Financial Exchange. A DTD formally defines the SGML wire format for Open Financial Exchange. However, Open Financial Exchange is no
- Re:Is it an SGML application? (Score:4, Informative)
  
  by innerweb ( 721995 ) writes: on Thursday February 03, 2005 @01:59AM (#11559554)
  
  I will be the first to admit it has been a while since I worked with SGML, but IIRC, in the DTD for an SGML doc, you mark optional tags with an O, so that it would look like this:
  
  <!ELEMENT elemname - O (#PCDATA) >
  
  where the - means the opening tag is required, and the O means the closing tag is not required.
  Whether or not I think I remember it, OFX has sent me back to books I have had in boxes for almost a decade now. Normally when I pull old books out like that, they are picture albums for the family, not old programming and data manuals.
  InnerWeb
  
Not surprising (Score:1)

by Undertaker43017 ( 586306 ) writes:

I built Checkfree's first OFX parser, and yes we rolled our own. After looking at the DTD's, and finding numerous mistakes, we decided that building our own would be easier, and IF the DTD's ever got cleaned up, we could use an SGML parser, in a latter version.

Sounds like the DTD's haven't gotten cleaned up.
HTML tidy (Score:1)

by gfim ( 452121 ) writes:

You could try running it thrrough the HTML tidy [sourceforge.net] program and see what happens.

Graham

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

You're surprised? (Score:1, Troll)

Or that mainstream banks... (Score:2)

I think you found your answer (Score:3, Interesting)

Re:I think you found your answer (Score:3, Insightful)

Re:I think you found your answer (Score:1)

Re:I think you found your answer (Score:2)

Re:I think you found your answer (Score:2)

Re:I think you found your answer (Score:1)

Quick save (OT) (Score:2)

Re:Quick save (OT) (Score:1)

Re:I think you found your answer (Score:4, Insightful)

Re:I think you found your answer (Score:2)

Re:I think you found your answer (Score:2)

Re:I think you found your answer (Score:2)

Re:I think you found your answer (Score:2)

Re:I think you found your answer (Score:2)

QIF (Score:2)

Re:I think you found your answer (Score:2)

Gnucash. (Score:2, Informative)

Re:Gnucash. (Score:2)

Re:Gnucash. (Score:2)

Re:Gnucash. (Score:3, Informative)

He may not want to release his software .. (Score:1)

LibOFX (Score:4, Informative)

Isn’t it obvious? (Score:2, Insightful)

Did you bother searching? (Score:1)

Re:Did you bother searching? (Score:4, Funny)

Re:Did you bother searching? (Score:1)

Um, the answer is in the link you posted. (Score:5, Informative)

Re:Um, the answer is in the link you posted. (Score:2)

Re:Um, the answer is in the link you posted. (Score:2)

Re:Um, the answer is in the link you posted. (Score:1)

Re:Um, the answer is in the link you posted. (Score:2)

Re:Um, the answer is in the link you posted. (Score:2)

Re:Heh (Score:2)

check out libofx (Score:2, Informative)

Re:check out libofx (Score:2)

Is it an SGML application? (Score:2)

Re:Is it an SGML application? (Score:4, Informative)

Not surprising (Score:1)

HTML tidy (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals