Coder Social home page Coder Social logo

add alexa list about daily HOT 6 OPEN

pexcn avatar pexcn commented on May 18, 2024
add alexa list

from daily.

Comments (6)

pexcn avatar pexcn commented on May 18, 2024

Commands

curl -kLs https://s3.amazonaws.com/alexa-static/top-1m.csv.zip | gunzip > top-1m.csv

# top 1m in world
awk -F ',' '{print $2}' top-1m.csv > top-1m.txt

# top list in china
grep -Fx -f chinalist.txt top-1m.txt > top-cn.txt

# top 5k in china
head -5000 top-cn.txt > top-cn-5000.txt

# top 1k in china
head -1000 top-cn.txt > top-cn-1000.txt

# top 500 in china
head -500 top-cn.txt > top-cn-500.txt

from daily.

wyf88 avatar wyf88 commented on May 18, 2024

It seems that the Alexa top list is now restricted to paid customers and the link is no longer updated and incomplete (source).

The Cisco Umbrella Popularity List takes DNS query into account and reflects the most commonly connected domains (including website resources) that are not restricted to those "main domains".

I have conducted some testings and found that the Umbrella list has more common items compared with Alexa list with both gfwlist and chinalist.

  • Umbrella: 1661 with gfwlist, 8562 with chinalist
  • Alexa: 1444 with gfwlist, 3029 with chinalist

In addition, domains that are only in Umbrella list look much more common than those only in Alexa list. Therefore, I would suggest to replace Alexa list with the Umbrella list. This could better serve the purpose of generating a shorter chinalist and/or gfwlist.

from daily.

pexcn avatar pexcn commented on May 18, 2024

It seems that the Alexa top list is now restricted to paid customers and the link is no longer updated and incomplete (source).

https://s3.amazonaws.com/alexa-static/top-1m.csv.zip has never stopped updating, and Umbrella list contains too many subdomains, so it is not applicable here.

from daily.

wyf88 avatar wyf88 commented on May 18, 2024

Umbrella list contains too many subdomains, so it is not applicable here.

Actually that is one advantage of the Umbrella list, because both the gfwlist and the accelerated-domains.china.conf contain a large number of subdomains, e.g. only a subdomain is blocked, or has server in CN. Therefore compared with Alexa toplist, the Umbrella list can actually match more domains in gfwlist and accelerated-domains.china.conf.

The other thing is that the Alexa list mainly focuses on the main domain of a website and ignores (or puts much less weight on) domains serving static files for that website.

Here are some test chinalist sorted by Umbrella toplist, sorted by Alexa toplist, and sorted by both, generated by today. As you can see, the Umbrella list has 3231 matches with the accelerated-domains.china.conf, whereas the Alexa list has 2642 matches, and there are 5038 matches if both Umbrella and Alexa are used.

Personally I'm using the Umbrella list only and the generated gfwlist and chinalist work well (BTW the matches can be used to generate a short chinalist for mobile apps). But maybe it is worth to using both Umbrella and Alexa? Just for your consideration.

from daily.

pexcn avatar pexcn commented on May 18, 2024

I'm sorry, I haven't touched this project for too long, I know almost nothing about its logic now.
Maybe I will deal with it later if I have time.

from daily.

pexcn avatar pexcn commented on May 18, 2024

Reopen.

from daily.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.