Saturday, March 24, 2012

stripping out html tags for plain txt email

I was wondering what the best way to strip the html tags off of a page and add vbclf and such for plain text emailing of web pages

if you could point me in the right direction it would be greatTo remove HTML tags from awell-formed document:
Converting HTML to Text
awesome what a life saver
now how could I change the code below so that it replaces the </p> and <br> with a _vbclrf or a plaintext line break and also how can I get it to remove all of the dead space I see a comment posted on the page you linked about doing it but no one actually does


Public Function StripTags(ByVal HTML As String) As String
' Removes tags from passed HTML
Dim objRegEx As _
System.Text.RegularExpressions.Regex
Return objRegEx.Replace(HTML, "<[^>]*>", "")
End Function

Try the following:
 Import System.Text.RegularExpressions

Public Function StripTags(ByVal HTML As String) As String
Dim cleanString As String = HTML
Dim objRegEx As Regex

' First, remove whitespace
objRegEx = New RegEx( "\s{2,}" )
cleanString = objRegEx.Replace( cleanString, " " )

' Second, replace HTML linebreaks with text line breaks
objRegEx = New RegEx( "((</p>)|(<br ?/?>))" )
cleanString = objRegEx.Replace( cleanString, System.Environment.NewLine )

' Third, clean up any occurrence of newline + space
objRegEx = New RegEx( "(^|\n) +" )
cleanString = objRegEx.Replace( cleanString, String.Empty)

' Finally, remove HTML tags
objRegEx = New RegEx( "<[^>]*?>" )
cleanString = objRegEx.Replace( cleanString, String.Empty)

Return cleanString
End Function

0 comments:

Post a Comment