Consider using

Registered by Tim Black on 2013-09-03

Consider using to create screen-scrapers.

Blueprint information

Not started
Needs approval
Series goal:
Milestone target:

Related branches



On 09/10/2013 05:16 AM, Engineering wrote:
> I've had a play around on the website you are extracting emails from, but I have had limited success.
> You could use two email columns to map the two different html locations, this means that you can get the email address back from any of the pages, but it could be in either column, which is a little messy. Using this method also brings back "Click here for online map" in the first email column when the second type of email is found on the page..
> Hope this workaround is somewhat useful.
Thanks for the workaround suggestion. It's feasible, and not too hard to deal with in my application; my application can do some data-cleaning by applying a regex to both email columns to see which one contains an email address. But of course I'd prefer for the data to be clean when it comes out of, and I think it is possible for to be improved to accomplish this, as I've outlined below.
> Keep an eye on our bug fixes when we update the software. We'll be fixing this issue as soon as we can!
I'm guessing records the xpath of the highlighted data's container element, then if necessary creates a regex which contains a bit of context surrounding the data. The way I would recommend fixing the problem I'm having is to have the code record and try each of the following options to get successful matches on all pages of data, and have the code rank the options according to which most often returns data the user reports as correct:

1. Regardless of whether the xpaths for the two conflicting data locations are identical, have the code expand the two regexes' contexts by one punctuation mark, word, or HTML tag in one or both directions each time the user retrains a column after a training problem occurs (e.g., this could pick up the fact that the email address is always preceded by the text "Email: " in my pages' data.) This could be done several times quickly in a loop for each page of example data--first expand by one item (punctuation, word, or tag), and see if the results match the user's selections; if not, then expand by two items, etc.

2. If the xpaths are not identical and the user has highlighted a complete HTML element (so ordinarily a sub-element level regex is not needed), diff (multi-line and within-line) two data examples and create a regex using a portion of the data which is identical between the two data examples (this might be a heading which is always included in the data, or a pattern of punctuation) but which is not identical in one of the xpaths on each page of the data examples. (It is easier to use identical lines from the multi-line diff, but if they can't be found, it is possible to find patterns that are identical in a within-line diff.) Then, for each page, iterate through testing each xpath, and each regex, to find the most matches the user approves.

3. Other combinations of the above ideas may be necessary--e.g., if the xpaths are not identical and the user has not highlighted a complete HTML element, then separately try option 1, then option 2, then a combination of 1 + 2, to get the most user-approved matches.

4. If the above options fail, I recommend offering the user the option to manually correct the regex, and give a link to the JavaScript (or whatever language's) regex documentation (this is the solution my own code uses for such cases where the code can't figure out by automated means how to match the user's data). I recognize that doesn't want to require untechnical users to manually edit regexes, and it would be awesome if it was never necessary for the user to drop down to that level, but it seems to me that at times it really is necessary for users to do so. You could offer the user two kinds of regexes could use: one regex to identify a pattern inside the highlighted data, and one regex to identify the data within its context.


Work Items

This blueprint contains Public information 
Everyone can see this information.


No subscribers.