- Conversion rules
- Supported tags
- Managing the character set
In french: HTMLVersRTF
Converts an HTML string or an HTML buffer into a string in RTF format. The following operations are performed during the conversion:
- The HTML tags are deleted,
- The special HTML characters are converted,
- The CR characters (Carriage Return) are converted into space characters,
- The multiple spaces are converted into unique spaces.
The formatting is kept "as best as possible".
MyHTMLText is string = "<!--test-->"e;Hello!"e;"
Text is string = HTMLToRTF(MyHTMLText)
// Text is set to: "Hello"!
// If the HTML document is set to:
// <TITLE>This is a test for a Web page</TITLE>
// <META http-equiv="content-type" content="text/html; charset=UTF-8">
// <H2>This is an HTML page in English</H2>
// <A href="http://www.windev.com">This is a link</A>
Text = HTMLToRTF(MyHTMLText)
// Text will contain the RTF code corresponding to the following text:
// This is an HTML page in English.
// This is a link
<Result> = HTMLToRTF(<Text in HTML Format> [, <Charset Used>])
<Result>: Character string
RTF text corresponding to the result of the HTML conversion. The encoding used is the one of the current character set of WINDEV or WEBDEV.
<Text in HTML Format>: Character string or buffer (with quotes)
Text to convert.
<Charset Used>: Optional Integer constant
Constant identifying the character set used to write the <Text in HTML Format>. The current character set of WINDEV or WEBDEV is used by default (charsetCurrent constant). If information about the character set used is found in the <Text in HTML Format>, this information has priority over this parameter.
See Correspondence between languages, sub-languages, character sets and nations for more details.
- The HTML tags are analyzed in order to keep the best possible formatting in the output text (CR characters, space characters, tabulations). The formatting is kept as best as possible: bold, italic, colors, ...
- Do not appear in the RTF output:
- the HTML tags
- the content of the "header" (information found in the <HEAD> tag)
- the comments
- the control texts
- the scripts
- the SSL definitions
- the CSS styles (except the "color" attributes)
- Management of CR characters
- 2 CR characters are inserted to replace the following tags: <P>, <H1> to <H6>, <TABLE>, <UL> or <OL>
- 1 CR character is inserted to replace the following tags: <BR>, <TR>, <LI>, <DD> or <DIV>
- 1 single CR character is inserted if several identical tags (<TR>, <LI>, <DD> or <DIV>) are found one after another (except for <BR> tags)
- Management of arrays
- A CR character is inserted for each array row (<TR> tag).
- A tabulation is inserted for each array column (<TD> tag).
- Management of special characters
A special character is a character defined in the HTML standard. For example, a space character can be written as " ". This standard is automatically used.
The unsupported tags are ignored: their content is taken into account as text.
The supported tags are as follows:
- <UL>: Line break + tabulation
- <OL>: Line break + tabulation
- <LI>: Tabulation
- <H1>: Line break before and line break after, bold and size of the font applied
- <H2>: Line break before and line break after, bold and size of the font applied
- <H3>: Line break before and line break after, bold and size of the font applied
- <H4>: Line break before and line break after, bold and size of the font applied
- <H5>: Line break before and line break after, bold and size of the font applied
- <H6>: Line break before and line break after, bold and size of the font applied
- <P>: Line break before and line break after
- <BR>: Line break
- <B>: Bold
- <STRONG>: Bold
- <I>: Italics
- <EM>: Italics
- <FONT>: Size and color
- <A HREF>: Hypertext link
- <SPAN>: Style: Color
- <DL>: Line break
- <DT>: Line break
- <DD>: Tabulation and line break
- <TABLE>: Line break
- <TR>: Line break
- <TD>: Elements separated by a tabulation
- <HEAD>: Content ignored, except for the parameters of the character set
- <STYLE>: Content ignored
- <SCRIPT>: Content ignored
- <!-- -->: Comments ignored
Managing the character set
To find out the character set used in the HTML text, HTMLToRTF is using the information found in the CONTENT attribute of a <META> tag.
If this tag is not found, the character set used to write the HTML text must be specified in <Charset Used>.
Indeed, if the HTML content uses an Arabic character set while WINDEV/WEBDEV use a French character set by default, invalid characters will be found in the output text.
- If the output text contains several "?" characters, it means that the character of the character set used in the HTML document cannot be expressed with a character of the current language.
- The UTF8 character set is commonly used to encode the Web pages.
Unit examples (WEBDEV): The HTMLTo functions
This page is also available for…