Defuddle is an open-source JS library I built to parse and extract the main content and metadata from web pages. It can also return the content as Markdown.
I built Defuddle while working on Obsidian Web Clipper[1] (also MIT-licensed) because Mozilla's Readability[2] appears to be mostly abandoned, and didn't work well for many sites.
It's still very much a work in progress, but I thought I'd share it today, in light of the announcement that Mozilla is shutting down Pocket. This library could be helpful to anyone building a read-it-later app.
Defuddle is also available as a CLI:
https://github.com/kepano/defuddle-cli
[1] https://github.com/obsidianmd/obsidian-clipper
[2] https://github.com/mozilla/readability
In the end I found the python trifatura library to extract the best quality content with accurate meta data.
You might want to compare your implementation to trifatura to see if there is room for improvement.
reply