Help/Opinions on Parsing OFX FIles? 49
innerweb asks: "I am looking for help and advice on using and parsing the OFX (Open Financial Exchange) file spec using C/C++ and/or Perl. I have read the standards, downloaded the DTD (ofx version 2), and tried to parse several files from different banks. They have all failed in my normal parsers (commercial and OSS), yet they load fine in Microsoft Money. It is not so complicated that I can not hand roll my own, and I have much of it working that way as a proof, but I would rather stick with something that is standards based, as this is a standard that in my opinion ought to work with standards based tools. Am I missing something here, or is this truly a file format that is broken as a feature?"
"I know the files are malformed when they come down, as they are missing the normal XML and SGML file headers ?XML or !DOCTYPE to define the dtd to use to parse the file. I know that the document is not 'well formed' as I understand it, as most of the tags in the datafile are not closed (open tag, but no corresponding closing tag). When I fix these errors, the files seem to parse. yet, I know that from what I have seen, MS Money takes in the same raw data and parses it. Microsoft lists the OFX file format as XML in some places and SGML in others. The OFX website seems to be saying this is SGML, not XML (XML is a subset of SGML in most cases, but the way it is *used* sometimes it is not really SGML at all.)
I have been reading like mad for a few weeks on OFX format files and usage, but not getting much useful information. I have worked with SGML in the past and XML, so I am at least familiar with these *conventions*. I need to be pointed in the right direction, and or told what I am doing wrong/overlooking. I know it is probably something obvious, but somehow I am not getting it.
Thanks in advance for any help that you can throw my way."
You're surprised? (Score:1, Troll)
Or that mainstream banks... (Score:2)
I think you found your answer (Score:3, Interesting)
So, I'd try finding a way to rebuild the XML file before parsing it, but only after detecting if it needs it.
Re:I think you found your answer (Score:3, Insightful)
Microsoft's parser is probably broken in a way that it doesn't look for closing tags
Or, you could say that Microsoft's parser is much more robust in how it deals with malformed documents.
Seriously, you can't blame MS for writing a better XML parser. Just because a parser knows that
Re:I think you found your answer (Score:1)
When a company with monopoly power writes software that accepts broken, non standards compliance input, it's creating a new standard, one that isn't published, since people will generally only work on their software until it produces an output that works with the monopoly's software.
It's just a trick to take what was an open standard and turn it into one with secret formats and rules only MS knows.
Robust would mean it wouldn't crash on broken input. That's fine. It should
Re:I think you found your answer (Score:2)
When a company with monopoly power writes software that accepts broken, non standards compliance input, it's creating a new standard, one that isn't published, since people will generally only work on their software until it produces an output that works with the monopoly's software.
This is a non sequitur. Forget the fact that probably 60%+ of the internet is stillusing HTML 4.0 (which is non-compliant) - people are going to create invalid input regardless.
Also, I don't really even get what you are asking
Re:I think you found your answer (Score:2)
MS embraces broken standards by making an XML parser that parses XML files that are non-standard compliant, and in this case (MS Money), I think it should throw an error, instead of parsing. That would have instantly made people call up their financial institu
Re:I think you found your answer (Score:1)
The browser should output a warning or error when fed bad HTML, but it could be in some debug window that isn't normally shown. I guarantee if browsers had this function, the web would be much more compliant than it is today, since instead of just "playing with the tag soup until it looks right on IE", they could actually see where they misnested an element and fix it, or whatever they need to do to make it valid.
I'll use a company that is not much higher in MS in mos
Quick save (OT) (Score:2)
The site recommended:
GigsVT responded:
I think you missed the point. There is something the Slash devs can do as with every other web
Re:Quick save (OT) (Score:1)
There are other places MS has done crap like article accuses, even if this isn't one of them.
Re:I think you found your answer (Score:4, Insightful)
Re:I think you found your answer (Score:2)
It's all well and good except when you have a feedback loop where the more tolerant you are, the worse the input gets as users get more and more sloppy.
I don't want a 'tolerant' xhtml parser for instance. Otherwise we end up with the stupidity of html again.
Re:I think you found your answer (Score:2)
...is not a problem, as that is guessable from a generic parser's perspective, but something like:
presents problems with where to place the bar2. Is it a sibling to bar, or a child? By eyeballing the data, you can make an intuitive guess, and probably be right. Putting code into play that must always make the right decision on the other hand can not rely on intuition. Especially when it com
Re:I think you found your answer (Score:2)
I suspect that if you use the MSXML SAX parser, or any other SAX parser for that matter, you can ignore the exceptions it throws up and carry on parsing. Perhaps that is what MS Money is doing.
I definitely wouldn't expect a DOM parser to handle such a broken document.
Re:I think you found your answer (Score:2)
Re:I think you found your answer (Score:2)
It is not better to silently ignore the fact that you are dealing with what looks like corrupt data.
QIF (Score:2)
Re:I think you found your answer (Score:2)
Yeah, I was kind of hoping that the obvious answer was the wrong answer. The stuff is not too complicated, there is just a lot of it (DTD). I have a parser now that does not choke on the data. It needs much testing and time to make sure it really always works right. There is no room for error when it comes to people's money.
It is also another piece of code that has to be maintained and updated as updates on the OFX standard come out.
InnerWeb
Gnucash. (Score:2, Informative)
Re:Gnucash. (Score:2)
Re:Gnucash. (Score:2)
Re:Gnucash. (Score:3, Informative)
If you use GPL code in your application in a way that violates the GPL, you have violated copyright law. That's it. The GPL doesn't control the remedies, the legal system does. Generally the remedy would be in the form of a cash settlement to the copyright owner of the software you violated the copyright on.
He may not want to release his software .. (Score:1)
or, to put it your way "who cares if you have herpes if you're a wanker, anyway?"
LibOFX (Score:4, Informative)
Isnt it obvious? (Score:2, Insightful)
Did you bother searching? (Score:1)
Re:Did you bother searching? (Score:4, Funny)
Or is this word [onelook.com] the one you don't understand?
Re:Did you bother searching? (Score:1)
Um, the answer is in the link you posted. (Score:5, Informative)
It's SGML, not XML. Unless you insist on doing it the hard way with a real SGML parser, I can't see what's wrong with using your own hand-rolled one. As you've already recognized yourself, it shouldn't be too hard.
Other alternatives would be to have an a preprocessor that converts it to XML, or maybe use some too-tolerant XML-parser. On the other hand, if the file format isn't XML, I can't see why it would be easier to treat it as if it were.
Am I missing something here, or is this truly a file format that is broken as a feature?
Mu [watson-net.com]. Yes, you are missing the distinction between XML and SGML. No, it's not broken as a feature, it just predates XML.
Re:Um, the answer is in the link you posted. (Score:2)
Re:Um, the answer is in the link you posted. (Score:2)
That is true. And on MS's site, it is called XML. No matter, which it is, by XML standards it is broken, and by SGML standards it is broken. When you create a DTD for SGML, you have to actually note in the DTD whether or not a closing (or opening) tag is optional [virginia.edu]. They have not marked any tag in the DTD I have (from www.ofx.org) as optional.
If they had marked tag closings as optional then in the DTD you sould see something like this:
The - mean the openi
Re:Um, the answer is in the link you posted. (Score:1)
A lot gets wrong with a hand-rolled one: SGML is a huge standard and there are a number of constructs that can be used while still adhering to the DTD.
A better suggestion would be to use e.g. J. Clark's superb-complete sp SGML parser (nsgmls) to read the DTDs and parse the SGML files: nsgmls
Re:Um, the answer is in the link you posted. (Score:2)
Really? I was browsing what I thought was the official OFX site and ran across this:
Did it move from SGML to SML with 2.0 (5 years ago), or has it always been "XML" in spirit only?
Re:Um, the answer is in the link you posted. (Score:2)
All very excellent points, but if you need an SGML parser to 'do it the hard way' (?) just grab a copy of James Clark's SP parser from here [jclark.com].
You could probably use that as a starting point if your XML parsers don't like the doc format.
Cheers.
Re:Heh (Score:2)
check out libofx (Score:2, Informative)
Re:check out libofx (Score:2)
I did check it out. It looks promising, but it is GPL. The businesses that want the solution to include OFX do not want to have their internal code mixed with anything GPL.
If it was just for me, or something I could release without worry, I would not hesitate to use it, but it would not be right to use it, distribute it and then not supply source code (which would be a breach of contract for me).
Get someone to show you how to use google.com while you're at it
Hmm. Nice.
InnerWeb
Is it an SGML application? (Score:2)
Based on that, I'm not even sure if it's even an SGML application, or if it's just something that looks like SGML?
Open Finanical Exchange Specification 1.0.2, Section 2.3.1:
Re:Is it an SGML application? (Score:4, Informative)
where the - means the opening tag is required, and the O means the closing tag is not required.
Whether or not I think I remember it, OFX has sent me back to books I have had in boxes for almost a decade now. Normally when I pull old books out like that, they are picture albums for the family, not old programming and data manuals.
InnerWeb
Not surprising (Score:1)
Sounds like the DTD's haven't gotten cleaned up.
HTML tidy (Score:1)
Graham