Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Education Media The Internet

HTML Tags For Academic Printing? 338

meketrefi writes "It's been quite a while since I got interested in the idea of using html (instead of .doc. or .odf) as a standard for saving documents — including the more official ones like academic papers. The problem is using HTML to create pages with a stable size that would deal with bibliographical references, page breaks, different printers, etc. Does anyone think it is possible to develop a decent tag like 'div,' but called 'page,' specially for this? Something that would make no use of CSS? Maybe something with attributes as follows: {page size="A4" borders="2.5cm,2.5cm,2cm,2cm" page_numbering="bottomleft,startfrom0"} — You get the idea... { /page} I guess you would not be able to tell when the page would be full, so the browser would have to be in charge of breaking the content into multiple pages when needed. Bibliographical references would probably need a special tag as well, positioned inside the tag ..." Is this such a crazy idea? What would you advise?
This discussion has been archived. No new comments can be posted.

HTML Tags For Academic Printing?

Comments Filter:
  • LaTeX (Score:5, Informative)

    by Anonymous Coward on Thursday July 02, 2009 @10:40PM (#28567623)
    You seem to be talking about LaTex. It already exists. Don't reinvent it.
  • PDF? (Score:5, Informative)

    by sys.stdout.write ( 1551563 ) on Thursday July 02, 2009 @10:44PM (#28567645)
    As much as I hate Adobe, there's a reason why PDF files dominate acadamia..
  • CSS3 is the solution (Score:5, Informative)

    by Tiles ( 993306 ) on Thursday July 02, 2009 @10:46PM (#28567673)
    This is exactly what CSS is designed for, presentation. The CSS3 Paged Media [w3.org] module already defines a number of the properties and settings you're going for. It even includes positions such as @bottom-center to allow you to position footnotes and references. The only thing missing is a way to mark this up in HTML, which could easily be done with anchors and the longdesc attribute, coupled with the CSS content: property. What you're looking for is a CSS3 enabled browser, not a new specification.
  • Re:LaTeX (Score:5, Informative)

    by The Snowman ( 116231 ) on Thursday July 02, 2009 @10:47PM (#28567679)

    You seem to be talking about LaTex. It already exists. Don't reinvent it.

    Another alternative is RTF, which is a sister SGML language of HTML. While it may have drawbacks, it would accomplish most if not all of what is required.

  • by sandford ( 1577271 ) on Thursday July 02, 2009 @10:49PM (#28567697)
    Is there a reason you don't want to use CSS? Because, there are already CSS extensions that do exactly what you want. The book Cascading Style Sheets - Designing for the web, was written using only HTML and CSS and prepped for printing using PrinceXML. The PrinceXML web site has a bunch of HTML+CSS similar samples, including academic papers.
  • by caffiend666 ( 598633 ) on Thursday July 02, 2009 @10:53PM (#28567729) Homepage

    Static configurations are available already, not the intelligent ones being requested. Has sufficed for what I needed:

    To have print page break add: <p style="page-break-before: always">

    Also, to hide odd font and underline for links:

    <STYLE TYPE="text/css" MEDIA=print> <!-- A { text-decoration: none; color: black } --> </STYLE>

    Yes, they have to be massaged a little.

  • Re:Nope (Score:4, Informative)

    by nlawalker ( 804108 ) on Thursday July 02, 2009 @11:06PM (#28567785)

    I should have been more clear: HTML describes the *structure* of a document, of which pages are not a part.

    As many have said above, you could use CSS if you really wanted to, since page specifications are presentational aspects of the document. Or, you could use LaTeX, which is designed for this kind of use.

  • Re:LaTeX (Score:3, Informative)

    by Anonymous Coward on Thursday July 02, 2009 @11:07PM (#28567795)
    No one really writes RTF by hand though. DocBook [wikipedia.org] is more like what this person is suggesting.
  • Re:LaTeX (Score:5, Informative)

    by fermion ( 181285 ) on Thursday July 02, 2009 @11:09PM (#28567809) Homepage Journal
    Latex is really the solution. There is no reason to reinvent the wheel. In fact, reinventing the wheel might cause problems when submitting papers. From what I have seen, many academic journals prefer .tex and .eps files. I can't imagine what they would do with HTML.

    The nice thing about LaTex is that, like HTML, it is a pure markup language, but it is a markup language that understands typesetting so one tend to get a good page layout no matter what. OTOH, HTML merely identifies blocks of text as various generic types, and really does not have a context for the types. The render engine is free visualize, make a sound, or do whatever it wishes to represent the blocks. CSS is what imposes a consistent visual framework, so what one needs to duplicate LaTex is in fact CSS.

    Use LaTex. Except for the often limited fonts, it is vastly superior to an word processor, because a word processor is not the write tool to create real documents. We have know that for many years. That is why people bought pagemaker. And I think the lack of fonts forces people to create compelling content. LaTex is free, there are many good books,and if you do have a hankering to code, you can always play with Tex.

  • by jipn4 ( 1367823 ) on Thursday July 02, 2009 @11:11PM (#28567823)

    If you want to save the source form or markup, use a language designed for it: LaTeX. LaTeX lets you represent all the things you would want to represent in an academic paper, it's fairly readable, very widespread, and has tons of tools. And LaTeX converts to both HTML and PDF.

    If you want to display on the web, use HTML. It's meant for the web. It's not a good representation for paged media. If you must represent paged media, you need to use CSS or XSL, but you probably don't want to.

    If you want archival quality paged representations, PDF is the only game in town really. HTML with CSS doesn't come close. But it doesn't make sense to save your own papers only in PDF because PDF is not really editable and doesn't have the semantic information.

  • Don't use HTML (Score:3, Informative)

    by emandres ( 857332 ) on Thursday July 02, 2009 @11:13PM (#28567837)
    You wouldn't want to use HTML for something like this, especially with newer versions of HTML. There has been a steady transition in HTML away from specification of the aesthetic appearance of a page. For this reason tags like <font> and <center> are considered nonstandard anymore, mostly because CSS does a way better (and cleaner) job of it.
  • Re:LaTeX (Score:3, Informative)

    by Liquidrage ( 640463 ) on Thursday July 02, 2009 @11:16PM (#28567849)
    Sadly I have. I've spit out RTF docs from websites many moons ago back when we had lots of printing issues on the web (still sucks, but at least now it's manageable).

    In hindsight there was probably a better way...but I was young, and string manipulation is easy.
  • Re:LaTeX (Score:5, Informative)

    by Anonymous Coward on Thursday July 02, 2009 @11:18PM (#28567861)

    Actually, the font problem is solved by using XeLaTeX (which uses XeTeX [wikipedia.org]).

    Full OpenType support. Looks amazing.

  • Re:LaTeX (Score:3, Informative)

    by dangitman ( 862676 ) on Thursday July 02, 2009 @11:26PM (#28567923)

    LaTex is a solution for certain (usually niche) purposes, but has its drawbacks, like everything else. The problem is that we are a dealing with an online world, where things are published in several different formats, online and offline. LaTex doesn't translate easily or cleanly into HTML, or vice-versa. And good luck getting people outside of math and science academia to use LaTex.

    This is a real problem, and shouldn't simply be brushed aside with "use this" comments. There currently is no workable format, no Lingua Franca for multipurpose documents. A solution should at least be attempted to make the web, word processing, page layout and typography interoperable.

    Don't get me wrong, I actually like LaTex, but it's an exercise in frustration trying to integrate it with the rest of the world.

  • XSL:FO (Score:5, Informative)

    by Roxton ( 73137 ) <roxton@g[ ]l.com ['mai' in gap]> on Thursday July 02, 2009 @11:30PM (#28567959) Homepage Journal

    There's a little-used standard that came out of the W3C along with XSLTs called XSL:FO. You write your document in XSL:FO markup, and then one of any number of processors like XEP [renderx.com] to convert it into PDF or what have you.

    http://www.w3schools.com/xslfo/default.asp [w3schools.com]

    One of the original purposes of it was so that you could use XSLTs to transform the same XML data into both XHTML or XSL:FO for publishing. The standard never took off though. XSL:FO just doesn't have enough options to be typographically interesting, compared to SVG.

    Of course, the right answer is LaTeX, but you might want to give XSL:FO a try for familiarity's sake.

  • Re:LaTeX (Score:5, Informative)

    by femto ( 459605 ) on Thursday July 02, 2009 @11:34PM (#28567981) Homepage
    LyX [lyx.org]. I wrote a thesis in it and didn't have to resort to any manual interventions in the generated LaTeX. Couple it with [lyx.org] SVG diagrams, generated by inkscape [inkscape.org], and you have a seamless authoring system that handles both text and graphics. SVG means there is no messy task of keeping source and postscript output synchronised (just right click a diagram within LyX to edit the SVG source with inkscape). Use gnuplot [gnuplot.info] to generate your (postscript) graphs and you have pretty well a complete authoring system. A few years ago, LyX and inkscape were too immature to use seriously, but they have matured. I recommend the combination.
  • Re:LaTeX (Score:2, Informative)

    by Anonymous Coward on Thursday July 02, 2009 @11:46PM (#28568075)
    I think you are overstating the enjoyment possible when using LaTeX :)
  • Use DITA (Score:4, Informative)

    by wooden pickle ( 1006975 ) on Thursday July 02, 2009 @11:47PM (#28568087)

    Someone mentioned XML/XSL/FO. Don't try to write your content in XSL-FO. You'll hate every minute of it.

    I'd look in to using DITA (Darwin Information Typing Architecture). It's a set of canned XML structures, plus a specification for how to process and customize those structures. It includes tags for stuff like footnotes...I bet it covers a lot of your use cases. There are some good intros to how these XML structures work here: http://dita.xml.org/book/dita-wiki-knowledgebase [xml.org]

    As DITA is XML, you can convert it to HTML and whatever else you feel like, pretty easily. There's an open-source implementation of the DITA spec called the DITA Open Toolkit (http://sourceforge.net/projects/dita-ot/). The DITA Open Toolkit includes stylesheets/scripts to publish HTML and PDF, among other things. PDFs are published via XSL-FO. Just like HTML needs a web browser to render something useful, XSL-FO requires a FO processor to create a PDF. So, in the end you write DITA, XSLT and other scripts transform that DITA to XSL-FO, the a FO processor consumes the XSL-FO and spits out a PDF. The DITA Open Toolkit comes with an open-source FO processor (Apache FOP). FOP doesn't fulfill everyone's needs, but it might work very well for you.

    Unfortunately, working with the Open Toolkit and customizing its output can be a bit unwieldy. http://groups.yahoo.com/search?query=dita+users [yahoo.com] is a pretty good place to look for help.

  • Re:LaTeX (Score:5, Informative)

    by plasticsquirrel ( 637166 ) on Thursday July 02, 2009 @11:59PM (#28568159)

    Use LaTex. Except for the often limited fonts, it is vastly superior to an word processor, because a word processor is not the write tool to create real documents. We have know that for many years. That is why people bought pagemaker. And I think the lack of fonts forces people to create compelling content. LaTex is free, there are many good books,and if you do have a hankering to code, you can always play with Tex.

    LaTeX has limited fonts, but if you use XeTeX [wikipedia.org] (which uses LaTeX), you can not only use all the LaTeX stuff, but also any TrueType or OpenType font, and native unicode support as well. This is a godsend for typesetting anything that includes words or characters not in English, or just for people who are picky about typography. My personal favorite font is the open source SIL Gentium [sil.org] family, which is not only much more beautiful and readable than Times, but contains a fuller character set that makes it compatible with many more languages. Once you start writing documents with XeTeX and nicer fonts, you see how lacking word processors are for good typography and well-structured documents, and how self-limiting the concept is.

    For newcomers to LaTeX and XeTeX, including packages and specifying options can be a bit time-consuming when you just want to get started with a basic A4-sized document. Here is the basic XeTeX file template I use for simple stuff. I'm picky about margins, line spacing, fonts, etc. so you know that it's a safe place to start out.
    -

    \documentclass[11pt]{article}

    %%
    % XeTeX packages
    %%
    \usepackage{fontspec}
    \usepackage{xunicode}
    \usepackage{xltxtra}

    %%
    % Formatting packages
    %%
    \usepackage{setspace}
    \usepackage[vcentering,dvips]{geometry}
    \usepackage{fancyhdr}

    \geometry{papersize={8.5in,11in},total={6.5in,8.8in}}

    \pdfpagewidth 8.5in
    \pdfpageheight 11in

    % 10pt font: 1.15
    % 11pt font: 1.1
    \setstretch{1.1}

    \setmainfont[Mapping=tex-text]{Gentium Basic}

    \begin{document}

    Some text...

    \end{document}

  • by Anonymous Coward on Friday July 03, 2009 @12:05AM (#28568205)

    Why not use a proper PDF viewer, like one derived from the OSS poppler stack? I find xpdf to be a valuable tool on Linux, and prefer it with PDF docs to my old (very old!) standard of ghostview with PS docs. (PostScript being what we academics used to exchange print-media formatted papers before PDF appeared.)

    I rather like the newer latex modules that can be enabled when producing PDF outputs, so that tables of content and references are hyperlinked in the PDF document, while still providing a print-quality typset document for the audience.

    Even for less academic technical documents, I find most reading/reviewing collaborations regarding documents are much better served by "let's now discuss page 74" rather than trying to navigate everyone via section labels, paragraph counts, etc. while they look on in their own re-flowed views.

  • by crmartin ( 98227 ) on Friday July 03, 2009 @12:12AM (#28568233)

    See, as someone has already pointed out, there's at least one such tool that's in wide use already: TeX and LaTeX. If you don't like that one, it turns out that HTML, with CSS and a little bit of Javascript, is perfectly capable of doing all the things you want, too. You just have to learn how. Have a look at Lie's Cascading Style Sheets: Designing for the Web [amazon.com] (written and typeset in HTML/CSS) and at Prince XML [princexml.com] for detailed examples.

  • Re:LaTeX (Score:3, Informative)

    by thefringthing ( 1502177 ) on Friday July 03, 2009 @12:25AM (#28568321)
    Most distributions of LaTeX come with some way to compile it and export to PDF, which is both portable and readable.
  • by mellon ( 7048 ) on Friday July 03, 2009 @12:27AM (#28568327) Homepage

    The ability to cite an HTML document is something that would indeed be useful. The ability to hard code page numbers into an HTML document isn't. The reason why academia and the press have been so resistant to HTML, historically, is that you don't get any control over page layout. Which means that you can't refer to things by page number.

    The solution isn't to fix HTML so that you can number pages. It is to fix the bibliographic references to not use page numbers. Generally speaking, it's not hard to number documents by section, and you can make the numbering fine-grained enough for bibliographic references. Then refer to the chapter and section, rather than the page number in your bibliography, and you're done. No need to "fix" HTML.

    It might make sense to ID paragraphs in HTML, so that you could simply refer to the paragraph ID in your bibliography. If this were simply document metadata, and didn't have anything to do with layout, it would work pretty well. As a bonus, you wouldn't need to renumber, because the ID would just be an arbitrary cookie, and wouldn't need to make sense to a human.

    Of course, with hypertext, there's really no need for a bibliography anyway. Just link to the text you're referencing... But I realize that that's impractical in academia at the moment. I'm just saying...

  • Re:LaTeX (Score:5, Informative)

    by Nitewing98 ( 308560 ) on Friday July 03, 2009 @01:14AM (#28568529) Homepage

    Yes, not using CSS is crazy. Especially since CSS includes the concept of media types. A simple "media='print'" in the CSS spec should hand that.

  • Re:LaTeX (Score:3, Informative)

    by Idbar ( 1034346 ) on Friday July 03, 2009 @01:20AM (#28568555)
    it's the eternal fight between intepreted vs. compiled languages. You are arguing that html is easier to read. Well , it's not, neither latex. The fact that you have an interpreter that i's more common (a browser) is another thing. But don't be confused reading a plain html is as or more annoying than reading a tex file.
    Please also remember that the "compiled" output of latex is dvi, not ps or pdf.
  • Re:LaTeX (Score:3, Informative)

    by amicusNYCL ( 1538833 ) on Friday July 03, 2009 @01:39AM (#28568633)

    Exactly. Until I can use a server scripting language to open a (binary) PDF template and do string find-replace, and not break the document, I'll stick with the ASCII-based RTF format.

    There are a lot of people who seem to be screaming "LaTex" as the solution to this non-problem. This is specifically what the PDF format was created for: to create a portable document that renders and prints the same anywhere. HTML is a fluid layout to fit various resolutions, PDF is guaranteed to print the same on any machine. That's exactly what it's for. There's no reason to bring LaTeX into the picture when you can export a regular word processing document to a PDF. Regardless of whether or not you can export to PDF from LaTeX, there's no reason to use LaTeX specifically vs. Word or Open Office or Acrobat or whatever else. The OP didn't even mention that the document should be editable by anyone other than the author.

    It's been quite a while since I got interested in the idea of using html (instead of .doc. or .odf) as a standard for saving documents

    PDF: you can export from your word processor, you can generate one easily from PHP, you can use Adobe's bloated software, whatever you want to use. It supports images, links, page size and margins, and it's guaranteed to print the same way on any printer. That's specifically what it's for.

  • Re:LaTeX (Score:1, Informative)

    by Anonymous Coward on Friday July 03, 2009 @02:26AM (#28568857)

    I've written few LaTeX documents that will work out of the box on most distributions, let alone all of them. Realistically, I can't send a LaTeX document to someone else and expect them to be able to edit it and read it, even if they have LaTeX installed.

    Then you're doing some very wrong. I frequent freenode#latex and we exchange files constantly across platforms and distributions with expectation of exactly-the-same performance. And we get it.

  • Re:texexplorer (Score:3, Informative)

    by Yobgod Ababua ( 68687 ) on Friday July 03, 2009 @02:29AM (#28568871)

    That example sounds like it could be well rendered using MathML... http://www.w3.org/Math/ [w3.org]

  • Re:texexplorer (Score:1, Informative)

    by Anonymous Coward on Friday July 03, 2009 @02:55AM (#28568973)

    There is LatexMathML though.

    http://math.etsu.edu/LaTeXMathML/

  • Re:LaTeX (Score:1, Informative)

    by Anonymous Coward on Friday July 03, 2009 @03:39AM (#28569145)

    I've written few LaTeX documents that will work out of the box on most distributions, let alone all of them.

    That's strange... academic journals do that all the time. Every journal I've published in accepts LaTeX sources and postscript/TIFF figures. As long as you use their stylesheets and follow their rules (generally imposed to prevent problems when they have to process dozens of papers in one go), you should get the same result that they do.

  • Re:LaTeX (Score:3, Informative)

    by colinrichardday ( 768814 ) <colin.day.6@hotmail.com> on Friday July 03, 2009 @04:34AM (#28569367)

    Because HTML continues to evolve LaTex does not.

    Really? How long is it taking the W3C to release HTML 5?

    It may be powerful at layout (but not as powerful as something like InDesign or Quark) but what I'm talking about it something that encompasses page layout, web design, and semantic markup.

    LaTeX is working on semantic markup
    http://tug.ctan.org/cgi-bin/ctanPackageInformation.py?id=stex [ctan.org]

    and

    http://tug.ctan.org/cgi-bin/ctanPackageInformation.py?id=cool [ctan.org]

    As for web design, people are working on converting LaTeX to MathML.

    it's also to be able to use that file to create decent web pages without any modification, that works with content management systems and the like.

    And how do you presume to get all the browser vendors on board?

  • restructuredtext... (Score:3, Informative)

    by tyroneking ( 258793 ) on Friday July 03, 2009 @04:38AM (#28569387)

    is made for what you're talking about

    Documents are written in human-readable text format - good for storing in version control and using for diffs
    Python Docutils is used to convert to HTML and/or LaTeX and a few other formats
    rst2pdf is a tool that converts to beautiful PDF (easier than using Docutils + LaTeX)

  • Re:LaTeX (Score:4, Informative)

    by TheRaven64 ( 641858 ) on Friday July 03, 2009 @04:49AM (#28569435) Journal
    CSS has had attributes for page breaks since CSS 2.1. I've not played with them for a while, but Opera supported them correctly back in 2003, so I'd imagine that they work well now. You can find the relevant part of the specification here [w3.org]. It lets you specify margins, soft breaks, rules for two-sided printing and so on. I played with the idea of using HTML instead of LaTeX for a while, but decided not to for two reasons. First, LaTeX is easier to type and read, and second because LaTeX produces nicer-looking output.
  • Re:LaTeX (Score:4, Informative)

    by TheRaven64 ( 641858 ) on Friday July 03, 2009 @04:52AM (#28569441) Journal
    LaTeX is a macro system built on top of TeX. XeTeX is a TeX implementation. Whether any given LaTeX package will work with XeTeX (or any other TeX implementation, like pdftex) depends on whether it uses any implementation-specific features. Most don't.
  • Re:LaTeX (Score:2, Informative)

    by cronostitan ( 573676 ) on Friday July 03, 2009 @05:01AM (#28569479)

    I recently had to implement a proper print-view with CSS covering several pages of print outpu. What I can tell is that it is a pain in the a** - since alot of print-specific CSS attributes are not supported by actual browser-version - Opera being the exception from the sad rule. For example most browser do not support the command to keep divs intact and do the pagebreak automatically before or after the div. We ended up having the user to decide when he needs a page break.

    I honestly can understand that someone get frustrated and wants to use a 'better' way.

  • Re:LaTeX (Score:5, Informative)

    by smallfries ( 601545 ) on Friday July 03, 2009 @06:06AM (#28569817) Homepage

    Further to the two AC posts: you are doing something wrong. As an academic I send / receive latex source all the time and expect it to compile and reproduce the same exact same results. How are you abusing TeX to get these kinds of problems?

  • Re:XSL:FO (Score:3, Informative)

    by styrotech ( 136124 ) on Friday July 03, 2009 @06:27AM (#28569899)

    You write your document in XSL:FO markup, and then one of any number of processors like XEP to convert it into PDF or what have you.

    Ouch :)

    Hand writing XSL:FO is extremely painful - very fiddly and the embedded layout/styling gets tedious quickly. It's kinda like writing a very very long webpage using HTML 3.2 with all the nasty old embedded presentation tags (but worse).

    One of the original purposes of it was so that you could use XSLTs to transform the same XML data into both XHTML or XSL:FO for publishing.

    I have a feeling that is a bit backwards. The original standard was XSL and it was going to include everything related to transforms and publishing, but it got too large and complex so they split it into XSLT for the transforms and XSL:FO for the page description language. Much better that way, as XSLT has wider uses than publishing.

    I think XSL:FO was always intended to be generated via XSLT rather than hand written, and I don't think that has changed at all. That way if you only need to a styling change, rather than making a zillion edits throughout the document you change the transform. It is analogous to how CSS make styling changes much easier with HTML.

    Personally I'd rather use some other semantic format (eg Docbook, DITA etc) that can be transformed into XSL:FO via XSLT when required (eg on the way to PDF generation). That way you already get some handy XSLT starting points to work with. Making the occasional small tweak to XSLT isn't too bad, but writing a large complex set of transforms from scratch isn't something I'd want to do :)

  • Re:LaTeX (Score:1, Informative)

    by Anonymous Coward on Friday July 03, 2009 @07:26AM (#28570113)

    LaTeX also doesn't give you the benefit of hypertext. Yes, there are various hacks you can use to add anchors and links to PDFs, although these are mere hacks on top of a broken format.

    \usepackage{hyperlink} gives hyperlinked table of contents, hyperlinked references, and if you specify \url{http://slashdot.org} that is linked too.

    It may be a hack, you may need to use pdflatex, but it works and is simple from the users standpoint. Implementation wise it could surely be a hack... so what?

  • Re:LaTeX (Score:1, Informative)

    by Anonymous Coward on Friday July 03, 2009 @07:33AM (#28570139)

    RTF is a "sister SGML language of HTML"? Huh???? It isn't even close! It's just a text version of the MS-Word doc file format and even the style definitions don't carry into the document body. Only in the most abstract sense is it related to SGML.

    If you want professional-quality documents, select PostScript, PDF (PostScript encapsulated, not meaning EPS) or one of the TeX flavors.

    Anything based on MS-Word and its ilk (including ODF), RTF (same thing, as I mentioned earlier) or HTML is dicey. All of these apps do their final typesetting by using the font metrics available on the computer where it was printed (NOT the computer where it was composed). It's not as noticeable these days thanks to extensive use of TrueType and similar soft fonts, but in the olden times when most of the fonts being used in document composition were the printer hardware fonts, you could get really badly messed-up line and page breaks just by composing on a machine with a different printer driver than the machine that the hardcopy was being produced on.

    PostScript and TeX are based on absolute metrics. HTML and the various word processors are not.

Those who can, do; those who can't, write. Those who can't write work for the Bell Labs Record.

Working...