Below is an overview of the inner workings of Bomjpacket. See
INSTALL for the administrative questions.
This is performed in two steps:
HTML -> XHTML -> WML
The differences between HTML and XHTML are:
* matched and properly nested tags, e.g. Test
* all attributes are in quotes: Interesting page
* no empty attributes:
is unacceptable
* case of the opening and closing tag mathces, i. e. is invalid
The main difference between XHTML and WML is that the latter supports
only a subset of the former's tags.
We will describe the two steps in greater detail below:
First, however, we need to do a few preprocessing actions, namely get rid of Javascript and CSS as they are
different languages. Then we get rid of HTML comments. After that, we need
to take care of encoded characters b/c the parser does not understand
them. For example ™ gets converted to TM.
HTML to XHTML
--------------------------------------------------------------------
Function need_quotes() adds quotes to attributes wherever they need them
For example, gets converted to
Then we need to fix unclosed quotes in tags and remove tags from
strings. Function fix_quotes_tags1() takes care of that. For example,
Page2
gets converted to
"Page1"Page2
using the following rules:
"a<..." -> "a"<.....
"a<.../>b" -> "a"<.../>"b"
".../>a" -> ..../>"a"
Imagine a string with quotes inside:
title="this is "OK" button"
We will need to get rid of internal quotes. The first quote is extended until
the next tag. Function fix_quotes_text() takes care of that.
Function tags_toupper($data) convert tag names to uppercase.
Function fix_tags($data) does a number of things:
* replace these tags with line breaks:
->
->
->
->
* eliminate all the tags except WML ones. We only need the following
tags:
'NOP', 'HTML', 'HEAD', 'TITLE', 'BODY', 'H1', 'H2', 'H3',
'H4', 'H5', 'H6', 'A', 'BR', 'B', 'I', 'EM', 'LI', 'ADDRESS',
'DIV', 'CODE', 'BLOCKQUOTE', 'TT', 'PRE', 'STRONG', 'SMALL',
'SUP', 'SUB'
* add opening and closing essential HTML tags if they are missing:
, , as we need them in each HTML document
We have already taken care of the attributes adding quotes around
them, but function fix_attrs() does more:
* Eliminate the following attributes as they might contain
javascript: "onclick", "onchange", "onfocus", "onmouseover",
"onmouseout", "onmousedown", "onmouseup", "onkeyup",
"onkeydown", "onkeypress", "onsubmit"
* Remove empty attributes
* For the A, IMG, and DIV tags keep only certain attributes. We need
to filter out certain attributes because the parser gets confused when the same attribute repeats which might happen.
* Certain tags do not allow other tags nested into them, for example
NO TAGS HERE. Function filter_content($data, $tag)
takes care of that.
Finally, add closing tags if they are missing and remove redundant
ones, e. g.
gets converted to
Function pair_tags($data) does that.
XHTML to WML
--------------------------------------------------------------------
Now we have a nice XHTML file which is what the cellphone wants. I
guess the reason why this is necessary is because cellphones do not
have processing power to fix HTML file if it is not properly
formatted. Therefore, this is what the server-side component has to
do. But XHTML file on its own is not enough either. Typically, it is a
very long web page, but think of a mobile's screen - it is very
short. Therefore, we need to break the XHTML file into a number of
smaller files, and we also convert them into WML language - a subset
of HTML that cellphones understand. A few of them understand XHTML
also but WML is simpler.
The breakdown algorithm takes into account the layout of the original
HTML page. A typical page includes a numebr of DIV elements which are
nested into each other. For example, DIV id="wrapper" might have DIV
id="menu" and DIV id="content" inside. Therefore, we will place two
links on the first WML page:
MENU
CONTENT
When a user clicks either of them (s)he goes to the appropriate WML
page of that section. Therefore, there are 3 WML pages in total.
Unfortunately, the names or IDs of the sections are not always
self-explanatory. Therefore, guessing what hides behind a given link
is often difficult. Instead of including links to the sections we give their preview,
that is, include a few lines from each section and insert a link at
the end that allows the user to expand that section:
HOME - NEWS - ARTICLES-...
This is a very interest...
Only when the user clicks on the "..." link does the cellphone go to
the appropriate section.