Monday, March 26, 2012

Stringbuilder Parsing Text: Need to remove square character

Hi everyone,

I am attempting to parse data from a webpage and want to remove any unneccesary junk from the string. I start by removing any Tags from the text I have retrieved from the web page. Next, I split the text into an array. However, I end up with a string that contains a bunch of characters that look like ?, sometimes there are many in a row: ???? in each string. I would like to remove these characters.

These characters are the only things left separating me from the data I want, but I am not sure how to remove them... Does anyone have an idea? It is interesting that when I display my string in a textbox, or using a label, the squares are not visible (I think they display as spaces) but when I debug I see them.

Thanks,

zoop

Those chars probably are line feeds, tabs or some other control chars.

Try:

Dim junk asString ="bla bla bla....."

junk=junk.replace(ControlChars.Lf,"")

junk=junk.replace(ControlChars.NewLine,"")

junk=junk.replace(ControlChars.Tab,"")


etc, etc...


If I'm not mistaken you are seeing the hidden space place holder. You can use regular expressions to remove them. Try something like below to remove them.

using System.Text.RegularExpressions;

private string StripSpaces(string inputString)
{
return Regex.Replace(inputString, @."[^\s$]","");
}


Thanks for the input fellas. I have tried the suggestions above along with some variations and came up short. Replacing the ControlChars does not seem to make a difference on the format of the string. Running the stripSpaces function removes all of the data I am interested in and returns only the chars I am attempting to remove (blanks, or squares). Perhaps I lost something when I translated it into VB...

Anyway, since there is not much code to go over I thought perhaps it would help if I posted what I have here. I am going to omit the section where I replace any ControlChars for brevity, as it does not seem to make a difference.

1Private Sub btnSubmit_Click(ByVal senderAs Object,ByVal eAs System.EventArgs)Handles btnSubmit.Click2'// Store Page Source in DOC3Dim reqAs WebRequest = WebRequest.Create("http://world5.knightfight.co.uk/index.php?ac=highscore&vid=0")4Dim respAs WebResponse = req.GetResponse56Dim sAs Stream = resp.GetResponseStream7Dim srAs StreamReader =New StreamReader(s, Encoding.ASCII)8Dim docAs String = sr.ReadToEnd910'// Format text in DOC using StringBuilder Class11Dim sbAs StringBuilder =New StringBuilder(doc)12Dim endIndexAs Integer = sb.ToString.IndexOf("showuserid=")13Dim newString()As String14Dim strTestAs String1516'// Remove Unwanted Text17strTest = sb.Remove(0, endIndex).ToString1819'// Remove HTML Tags and Spaces20strTest = StripTags(strTest)21strTest = stripSpaces(strTest)2223'// TEMP - Display Preview in Multiline Textbox24txtbWebData.Text = strTest2526'// Split String into Array27newString = Split(strTest," ")28End Sub2930Public Function StripTags(ByVal HTMLAs String)As String31' Removes tags from passed HTML32Dim objRegExAs _33System.Text.RegularExpressions.Regex34Return objRegEx.Replace(HTML,"<[^>]*>","*")35End Function3637Public Function stripSpaces(ByVal inputStringAs String)As String38Dim objRegExAs _39System.Text.RegularExpressions.Regex40Return objRegEx.Replace(inputString,"[^\s$]","")41End Function42

I thought it would be an interesting exercise to try and capture some data from a highscores list on a website. The URL is in the code above, http://world5.knightfight.co.uk/index.php?ac=highscore&vid=0,so you could reproduce my results if you like. The end goal is to capture the data in the highscores list and format it in a way that makes sense. If you think there is a better approach please let me know :).

Thanks,

zoop


Sorry about that. Replace Return objRegEx.Replace(inputString, "[^\s$]", "") with Return objRegEx.Replace(inputString, "[^\S$]", ""). The \s matches any white space character and the \S matches any non-white space character.

0 comments:

Post a Comment