Posts: 5
Threads: 3
Joined: Sep 2014
Has anyone built a script that enumerates local Folders, HTML Files, then generates an HTML page with links to each page using the individual HTML title as the link?
For example, using Enumerate Files, I get excellent raw output:
C:\pub\1\index.html (Title: Miscellaneous Questions)*
C:\pub\2\index.html (Title: Key Questions)*
C:\pub\Overuse.html (Title: Over-using the System)*
* Enumerate Files does not show Titles, just file names.
I would like the output to Look Like:
\1 Miscellaneous Questions
\2 Key Questions
\ Over-using the System
Posts: 12,051
Threads: 140
Joined: Dec 2002
Function
GetTitleFromHTML
;/
function! $html str&title [flags] ;;flags: 1 html is file, 2 fast but unreliable
;Extracts title from <title> tag in HTML.
;Returns 1 if title found, 0 if not.
;html - HTML text. If flag 1 - full path of a local HTML file.
;title - variable that receives title text.
;flags:
;;;1 - html is HTML file path.
;;;2 - to extract title, use regular expression. Almost 1000 times faster, but unreliable, eg can extract <title> from comments or scripts. Without this flag uses HtmlDoc class to parse the HTML.
;EXAMPLE
;str title
;if(GetTitleFromHTML("c:\test\test.htm" title 1)) out title; else out "<NO TITLE>"
opt noerrorshere 1
if(flags&1) html=_s.getfile(html)
title.all
if flags&2
,if(findrx(html "(?si)<title.*?>(.+?)</title>" 0 0 title 1)<0) ret
,title.trim; title.replacerx("\s+" " ")
,ret 1
HtmlDoc d.InitFromText(html)
title=d.d.title
ret title.len!0
Posts: 5
Threads: 3
Joined: Sep 2014
Thanks Gintaras,
Sorry for the dim request. I thought someone might have something setting on the shelf. My code is much uglier than yours, but I managed to extract almost everything.
Now I'm down to the tough part (for me). I have not yet figured out how to extract a string like "Home-Test System-Notes-Oct 1999 from a set like this :
<div class="nv">
<a class="nv" href="../index.html">Home</a>
-
<a class="nv" href="../10/index.html">Test System</a>
-
<a class="nv" href="../783/index.html">Notes</a>
- Oct 1999
</div>
The number of lines varies, but each is within <div class="nv"></div> and first character of separation line is always "-".
Is there a clean way to do this with a regx?
Posts: 12,051
Threads: 140
Joined: Dec 2002
With regex difficult, unless HTML is quite simple.
Use HtmlDoc.GetText.
Macro
Macro2353
str html
;...
HtmlDoc d.InitFromText(html)
str text=d.GetText