Comments (6)
Commands
curl -kLs https://s3.amazonaws.com/alexa-static/top-1m.csv.zip | gunzip > top-1m.csv
# top 1m in world
awk -F ',' '{print $2}' top-1m.csv > top-1m.txt
# top list in china
grep -Fx -f chinalist.txt top-1m.txt > top-cn.txt
# top 5k in china
head -5000 top-cn.txt > top-cn-5000.txt
# top 1k in china
head -1000 top-cn.txt > top-cn-1000.txt
# top 500 in china
head -500 top-cn.txt > top-cn-500.txt
from daily.
It seems that the Alexa top list is now restricted to paid customers and the link is no longer updated and incomplete (source).
The Cisco Umbrella Popularity List takes DNS query into account and reflects the most commonly connected domains (including website resources) that are not restricted to those "main domains".
I have conducted some testings and found that the Umbrella list has more common items compared with Alexa list with both gfwlist and chinalist.
- Umbrella: 1661 with gfwlist, 8562 with chinalist
- Alexa: 1444 with gfwlist, 3029 with chinalist
In addition, domains that are only in Umbrella list look much more common than those only in Alexa list. Therefore, I would suggest to replace Alexa list with the Umbrella list. This could better serve the purpose of generating a shorter chinalist and/or gfwlist.
from daily.
It seems that the Alexa top list is now restricted to paid customers and the link is no longer updated and incomplete (source).
https://s3.amazonaws.com/alexa-static/top-1m.csv.zip has never stopped updating, and Umbrella list contains too many subdomains, so it is not applicable here.
from daily.
Umbrella list contains too many subdomains, so it is not applicable here.
Actually that is one advantage of the Umbrella list, because both the gfwlist and the accelerated-domains.china.conf
contain a large number of subdomains, e.g. only a subdomain is blocked, or has server in CN. Therefore compared with Alexa toplist, the Umbrella list can actually match more domains in gfwlist and accelerated-domains.china.conf
.
The other thing is that the Alexa list mainly focuses on the main domain of a website and ignores (or puts much less weight on) domains serving static files for that website.
Here are some test chinalist sorted by Umbrella toplist, sorted by Alexa toplist, and sorted by both, generated by today. As you can see, the Umbrella list has 3231 matches with the accelerated-domains.china.conf
, whereas the Alexa list has 2642 matches, and there are 5038 matches if both Umbrella and Alexa are used.
Personally I'm using the Umbrella list only and the generated gfwlist and chinalist work well (BTW the matches can be used to generate a short chinalist for mobile apps). But maybe it is worth to using both Umbrella and Alexa? Just for your consideration.
from daily.
I'm sorry, I haven't touched this project for too long, I know almost nothing about its logic now.
Maybe I will deal with it later if I have time.
from daily.
Reopen.
from daily.
Related Issues (20)
- gfwlist & chinalist: remove root domain version HOT 1
- remove chinalist-lite version
- Performance issue in `scripts/gfwlist/gfwlist.sh` HOT 2
- chnroute: merge ipdeny.com data source HOT 2
- How to get data from Alexa? HOT 2
- could add the function : custom the proxy port HOT 3
- Undefind function isPlainHostName(line 73350) and dnsDomainIs(line 73318) in whitelist.pac file HOT 1
- pac: support for china ip list
- Suggestions on shadowrocket blacklist HOT 4
- chinalist: domain names that include `.cn`
- [Enhancement] add AutoProxy Format for GFW Whitelist HOT 1
- PAC list bugs
- openwrt scripts: wget timeout 10s & retry
- shadowrocket: ipv6 address rules
- Chnroute stop update HOT 1
- add google ip range
- provide a separate chnlist head file? HOT 2
- gfwlist.txt 缺少 googleapis.cn,导致 Google Play 无法下载应用,建议加入 list HOT 1
- 49.51.0.0/16 似乎为非**IP,建议从chnroute移除 HOT 1
- Top 1m domains lists
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from daily.