Home

Hello,

I am working on a project where I'm using python to parse HTML pages,
transforming data between certain tags. Currently the HTMLParser class
is being used for this. In a nutshell, its pretty simple -- I'm
feeding the contents of the HTML page to HTMLParser, then I am
overriding the appropriate handle_ method to handle this extracted
data. In that method, I take the found data and I transform it into
another string based on some logic.

Now, what I would like to do here is take that transformed string and
put it "back into" the HTML document. Has anybody ever implemented
something like this with HTMLParser?

I'm thinking maybe somehow have HTMLParser append each character it
reads except for data inside tags in some kind of buffer? This way I
can have the HTML contents read into a buffer, then when I do my own
handle_ overrides, I can also append to that buffer with the
transformed data. Once the HTML page is finished parsing, ideally I
would be able to print the contents of the buffer and the HTML would
be identical except for the string transformations.

I also need to make sure that all newlines, tags, spacing, etc are
kept in tact -- this part is a requirement for other reasons.

Thanks!

previous
next

Re: Question about proprietary software development using GNU C++
Re: migrating to packages
Re: startswith( prefix[, start[, end]]) Query
visual c++ 8 (.net 2005) has unresolved external symbol errors
Re: sorting a list numbers stored as strings
Mam Marzenie
Pajacyk
Kidprotect
Akogo
Rodzic Po Ludzku