if you could point me in the right direction it would be greatTo remove HTML tags from awell-formed document:
Converting HTML to Text
awesome what a life saver
now how could I change the code below so that it replaces the </p> and <br> with a _vbclrf or a plaintext line break and also how can I get it to remove all of the dead space I see a comment posted on the page you linked about doing it but no one actually does
Public Function StripTags(ByVal HTML As String) As String
' Removes tags from passed HTML
Dim objRegEx As _
System.Text.RegularExpressions.Regex
Return objRegEx.Replace(HTML, "<[^>]*>", "")
End Function
Try the following:
Import System.Text.RegularExpressionsPublic Function StripTags(ByVal HTML As String) As String
Dim cleanString As String = HTML
Dim objRegEx As Regex' First, remove whitespace
objRegEx = New RegEx( "\s{2,}" )
cleanString = objRegEx.Replace( cleanString, " " )' Second, replace HTML linebreaks with text line breaks
objRegEx = New RegEx( "((</p>)|(<br ?/?>))" )
cleanString = objRegEx.Replace( cleanString, System.Environment.NewLine )' Third, clean up any occurrence of newline + space
objRegEx = New RegEx( "(^|\n) +" )
cleanString = objRegEx.Replace( cleanString, String.Empty)' Finally, remove HTML tags
objRegEx = New RegEx( "<[^>]*?>" )
cleanString = objRegEx.Replace( cleanString, String.Empty)Return cleanString
End Function
0 comments:
Post a Comment