Comments (4)
https://www.temu.com/privacy-and-cookie-policy.html
for this site as well, it doesn't scrape the entire page content
from trafilatura.
The page https://kickstarter.mycaptain.in/privacy-policy
puts content in <section>
tags, which is rarely seen (<p>
tags are expected).
The page https://www.shopify.com/legal/privacy
uses several <article>
tags within a <main>
frame which confuses the algorithm.
In both cases we need to develop the capacity to detect text segments further, thank you for your feedback.
from trafilatura.
Appreciate your reply.
from trafilatura.
Same problem here, any update?
from trafilatura.
Related Issues (20)
- Configure pre-commit for this repository and update documentation
- TXT output doesn't produce markdown-compliant paragraphs HOT 1
- Add correct image links for Pypi
- Standardize CSV output
- Missing Yoast FAQ block headers
- Add HTML output option HOT 1
- Add support for Netscape cookies file format
- fetch_url('spiegel.de/....') returns None HOT 5
- License HOT 7
- include_links option mixes texts and links HOT 6
- Update LXML to version 5.1+ HOT 1
- save cookies on redirect HOT 1
- Deprecate functions and arguments
- CLI: raise an error if `--config-file` doesn't exist
- Extract more text HOT 6
- Make markdown an explicit output format HOT 1
- Add download/processing date to metadata HOT 1
- Regroup functions dedicated to output conversion
- For all the articles from the source https://ognnews.com/ the extracted title is not right.
- Sitemaps: implement sleep and/or backoff strategy
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from trafilatura.