parsing data from html

keliko · Post by **keliko** » Wed Jun 08, 2022 5:28 pm

Hello Members,

I need your help.
how do i parse data from html.

this is an example of html

<div class="landing-title _3-XoE">!2@08rkjh5Cjfe9t3IGJnLjjo21RvdT6eWtj6Sue8V+1WjCn+bC1hNjtUE4pcen1NMKYXx0rmYSSI4q4A==,abKRWqczK08SB9C6yIsRh0t2YJFh43GqKsOqNvB1Whk=,gPo6XYdaMLln95RwHkf9FOhrMOwpGKIBY5lk8ckrWGA=,wrZN02dGbz/b0e9xcxb/bGrpWH+cgMjoEURGfgGv3L8=!!qrnya</div>

I want to get this value "!2@08rkjh5Cjfe9t3IGJnLjjo21RvdT6eWtj6Sue8V+1WjCn+bC1hNjtUE4pcen1NMKYXx0rmYSSI4q4A==,abKRWqczK08SB9C6yIsRh0t2YJFh43GqKsOqNvB1Whk=,gPo6XYdaMLln95RwHkf9FOhrMOwpGKIBY5lk8ckrWGA=,wrZN02dGbz/b0e9xcxb/bGrpWH+cgMjoEURGfgGv3L8=!!qrnya"

thanks

SWEdeAndy · Post by **SWEdeAndy** » Wed Jun 08, 2022 9:37 pm

This is one way to parse that particular line:

Code: Select all

put [your line of html] into tLine
set the itemDel to ">"
delete item 1 of tLine
set the itemDel to "<"
delete item -1 of tLine

tLine now contains your desired end result.

If the full html you're parsing is more varied than your example line, you may have to find other chunks of chars to use as item delimiters. But you get an idea here.

stam · Post by **stam** » Thu Jun 09, 2022 1:15 am

you could also use regex depending on how handy you are with this.
For example you can get the text of the line above with

Code: Select all

<div.*>(.*)<\/div>

In liveCode you can use regex with matchText, replaceText and Filter functions but regex is a bit apocryphal...

It's easier if you know there are specific DIVs you want to parse and have a matchText for each of these, for example

Code: Select all

local R
get matchText(field "source", "<div class=" & quote & "landing-title _3-XoE" & quote & ">(.*)<\/div>", R )

R will contain whatever is within the parentheses in the expression above, ie the text between the <div class="landing-title _3-XoE"> and </div>
I've used this type of approach when parsing long text outputs where i know there are specific patterns to search for.

S.

keliko · Post by **keliko** » Thu Jun 09, 2022 5:04 pm

Thanks stam and SWEdeAndy

dunbarx · Post by **dunbarx** » Thu Jun 09, 2022 9:50 pm

That is already two distinct ways to do the job, using two different tools that LC offers. I far prefer the itemDelimiter method.

Here is another, using the "offset" function, assuming you have your full string in a field 1:

Code: Select all

on mouseUp
   get fld 1
   delete char 1 to offset(">",it) of it
   delete char offset("<",it) to 10000 of it
   put it into fld 2
end mouseup

The "1000" is just to make sure we go all the way to the end. Could be a million.

Easy to read, like the itemDelimiter option, and yet another powerful tool that LC has.

Craig

dunbarx · Post by **dunbarx** » Thu Jun 09, 2022 9:58 pm

@Stam.

I rarely use or need regex, though I appreciate its compactness and power.

In the two "ordinary" methods above, extra work has to be done if there were many instances of either of those two chars, "<", and ">" and the OP wants only a certain segment. The way presented, that does not come up.

But if it did, does regex have the same issue, or can it be configured to still work in a single line, knowing the particular segment required? In other words, could it pick out the "XYZ" in:

"aaa<bbb<ccc>ddd><><XYZ<eee<><fff><>ggg"

No cheating, now. I could do this easily since LC supports multi-char strings for the itemDelimiter. In fact, better if I had:

"aaa<aaa<aaa>aaa><><XYZ<aaa<><aaa><>aaa"

Craig

SparkOut · Post by **SparkOut** » Thu Jun 09, 2022 10:20 pm

Regex certainly does have capability to extract complex patterns in a single match, but obviously it takes thought and understanding of the problem. I am not sure I understand the problem you set Craig. Actually, I am sure I don't understand. What pattern are you looking to match?

dunbarx · Post by **dunbarx** » Thu Jun 09, 2022 10:25 pm

Just trying to form a string where it is not so easy to isolate the portion of interest, buried in a sea of similar chars. Again, being able to examine the whole string allows one to create an itemDelimiter that can extract the portion in two lines as above. But what if you did not have that string open for view, and just had to extract "XYZ" from some unknown swamp?

Craig

SparkOut · Post by **SparkOut** » Fri Jun 10, 2022 7:50 am

So you can either know the shape of the haystack parts immediately before and after the needle, or you can know the needle and see

Code: Select all

if "XYZ" is in tHaystack then...

is that right?
The latter is trivial (of course). The former requires a definition of the delimiters but is simple enough if you know what the delimiters are going to be to isolate the needle. You can get very much more sophisticated with it too, but then you are heading into Thierry Territory.

SWEdeAndy · Post by **SWEdeAndy** » Fri Jun 10, 2022 9:44 am

dunbarx wrote: ↑
Thu Jun 09, 2022 9:50 pm
The "1000" is just to make sure we go all the way to the end. Could be a million.

Or simply "to -1 of it". Which covers "all the way to the end" for any number of chars...

stam · Post by **stam** » Fri Jun 10, 2022 10:40 am

dunbarx wrote: ↑
Thu Jun 09, 2022 9:58 pm
@Stam.

I rarely use or need regex, though I appreciate its compactness and power.

In the two "ordinary" methods above, extra work has to be done if there were many instances of either of those two chars, "<", and ">" and the OP wants only a certain segment. The way presented, that does not come up.

But if it did, does regex have the same issue, or can it be configured to still work in a single line, knowing the particular segment required? In other words, could it pick out the "XYZ" in:
"aaa<bbb<ccc>ddd><><XYZ<eee<><fff><>ggg"
No cheating, now. I could do this easily since LC supports multi-char strings for the itemDelimiter. In fact, better if I had:
"aaa<aaa<aaa>aaa><><XYZ<aaa<><aaa><>aaa"
Craig

Rather than me “cheating” why don’t you try it?

The task was to retrieve the text inside a div.
To be clear this is not intended to remove HTML tags, just retrieve the string between the DIV start and end tags.

The regex above will return anything between a <div xyz> and </div> tag (where ‘xyz’ can be any string ). No matter what else is included between those tags, even if other tags.

And to be clear, the ‘xyz’ is not known or required for this to work, but is not exptected to be a nested tag, as that would be invalid HTML - but that could be catered for easily if needed as well. Not sure what what you think is the problem is… ?

regex is the most powerful tool for searching for text patterns which is why it is dominant across practically all programming languages.

It is difficult to get into because of its compactness and I don’t profess to being more than amateur-to-intermediate at it but I use it frequently, it’s blindingly fast and optimised code. It is definitely worth gaining at least superficial familiarity with.

But not being familiar with it is not a reason to suggest it’s inferior to interating verbosely in LiveCode.
And in spite many here not considering it to be “LiveCode-y” enough, it is part of the language.

You probably would have benefited from this at some point but kludged it with verbose Iterations - but I’m a fan of using the right tool for the job, avoid using 5-10 lines of code if 1 will do and always remember the aphorism

when your only tool is a hammer, everything looks like a nail

But as my old boss used to say, there are many ways to god

If your solution works, it works.

This is just a discussion about the different ways of achieving the same goal and this would be my preferred way….

S.

stam · Post by **stam** » Fri Jun 10, 2022 11:09 am

dunbarx wrote: ↑
Thu Jun 09, 2022 10:25 pm
Just trying to form a string where it is not so easy to isolate the portion of interest, buried in a sea of similar chars. Again, being able to examine the whole string allows one to create an itemDelimiter that can extract the portion in two lines as above. But what if you did not have that string open for view, and just had to extract "XYZ" from some unknown swamp?

Craig

Craig, that is EXACTLY what regex was created for.
If you just want to search for

Code: Select all

XYZ

then that is the regex right there.

If you want to search for patterns where parts are unknown or missing, that is where regex excels. But it can become complex and requires upfront knowledge.

It has wide uses. For example, let’s take the problem of checking if an entered email is syntactically correct - that is it has an alphanumeric first part that may include a dot, underscore or hyphen, followed by ‘@‘ and the at at least a seemingly valid domain (2 or more strings separated by dots):

Code: Select all

^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$

This will catch most of the common errors in entering an invalid email. Just plug that into a matchText function which will return a Boolean denoting if email is valid or not.

A quick Googling will show any number of variations of this, as this problem has long since been solved in regex and you can have much more comprehensive solutions that guarantee even more that a valid email was entered. The most comprehensive I found (but do not use as the above suffices for me) is:

Code: Select all

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)

But I’d agree that’s less helpful even if more accurate

But the point is you can google what you need and chances are you’ll find the regex for it…

mwieder · Post by **mwieder** » Sun Jun 12, 2022 1:25 am

Wow... that's an impressive corpus of... er... characters.
Gotta ask "why octal?" though.

stam · Post by **stam** » Sun Jun 12, 2022 1:29 am

mwieder wrote: ↑
Sun Jun 12, 2022 1:25 am
Wow... that's an impressive corpus of... er... characters.
Gotta ask "why octal?" though.

Don’t ask me - copied from stackoverflow

That’s all pure regex, catering for all fringe cases. Excessive for my use - I prefer my single line which I can actually understand;)

LiveCode Forums

parsing data from html

parsing data from html

Re: parsing data from html

Re: parsing data from html

Re: parsing data from html

Re: parsing data from html

Re: parsing data from html

Re: parsing data from html

Re: parsing data from html

Re: parsing data from html

Re: parsing data from html

Re: parsing data from html

Re: parsing data from html

Re: parsing data from html

Re: parsing data from html