Extracting a Particular Text Pattern from HTML Files?

I have more than 4000 HTML Files. I need only a particular line from those files.

Each file contains some lines with a same pattern, and I would like to extract only those lines.

The pattern that I would like to extract from all the files looks like: javascript:listen(’http://www.domainname.com’,'http://www.anotherdomainname.com/somefilename.ext’,0)

Can Anybody suggest me a solution for this?

PS: I don’t have a Linux System, hence I need a batch file Script or a kind of software which runs on Windows Platform for obtaining the solution for this problem.

Thanks in advance.

Hi

I understand from your question that you want to find all occurrences of the given pattern, with each occurrence potentially having a unique URL. I also assume that you are only interested in the URL, because you cut off the rest of the text in your example of the pattern. I then assume that you wish to manipulate or use the search results or URLs in some programmatic way.

Assuming then that you have some programming experience, I suggest that you write a program or script that uses regular expressions to find the occurrences. The following regular expression pattern should suffice:

javascript:listen\(’http://\w+.\w+.\w+[\/\w*.]*’

After some brief testing, this pattern found occurrences like:
1) javascript:listen(’http://www.domainname.com’
2) javascript:listen(’http://www.mydomainname.com/test/page.svc’

If this helped you in any way, but is not sufficient, you may read more about Regular Expressions here:
http://www.regular-expressions.info/reference.html

There are many programming languages that support the use of regular expressions, e.g. vbscript, Microsoft .NET C#, C++, VB, etc.

2 Responses to “Extracting a Particular Text Pattern from HTML Files?”

  1. Vyshali P Says:

    You have following options

    1. get grep for windows, and use it to search your expressions in files.
    2. use windows find command, it work like grep, but its not exact match to grep.

    following options are available with find

    C:\>find /?
    Searches for a text string in a file or files.

    FIND [/V] [/C] [/N] [/I] [/OFF[LINE]] "string" [[drive:][path]filename[ ...]]

    /V Displays all lines NOT containing the specified string.
    /C Displays only the count of lines containing the string.
    /N Displays line numbers with the displayed lines.
    /I Ignores the case of characters when searching for the string.
    /OFF[LINE] Do not skip files with offline attribute set.
    "string" Specifies the text string to find.
    [drive:][path]filename
    Specifies a file or files to search.

    If a path is not specified, FIND searches the text typed at the prompt
    or piped from another command.

    I hope this helps
    for any help send a mail to help@paijwar.com
    References :
    http://www.paijwar.com

  2. humilisquero Says:

    Hi

    I understand from your question that you want to find all occurrences of the given pattern, with each occurrence potentially having a unique URL. I also assume that you are only interested in the URL, because you cut off the rest of the text in your example of the pattern. I then assume that you wish to manipulate or use the search results or URLs in some programmatic way.

    Assuming then that you have some programming experience, I suggest that you write a program or script that uses regular expressions to find the occurrences. The following regular expression pattern should suffice:

    javascript:listen\(’http://\w+.\w+.\w+[\/\w*.]*’

    After some brief testing, this pattern found occurrences like:
    1) javascript:listen(’http://www.domainname.com’
    2) javascript:listen(’http://www.mydomainname.com/test/page.svc’

    If this helped you in any way, but is not sufficient, you may read more about Regular Expressions here:
    http://www.regular-expressions.info/reference.html

    There are many programming languages that support the use of regular expressions, e.g. vbscript, Microsoft .NET C#, C++, VB, etc.
    References :

Leave a Reply