danithaca / netizen Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/netizen
Automatically exported from code.google.com/p/netizen
从HTML->XML或者从XML->TXT有个很奇怪的现象:所有txt都会集中到
第一个node里
面,造成第一个node就是整个thread, 似乎有些不太对。
很难recreate. 一个例子是tianya: 143469
Original issue reported on code.google.com by [email protected]
on 4 Jun 2010 at 2:14
When there's a <img> tag in the HTML data, BeautifulSoup will ignore the
rest of the test after <img>. Need to fix somehow.
Original issue reported on code.google.com by [email protected]
on 4 Jun 2010 at 2:11
腐败 as verb or 腐败 as noun are treated as the same term in the current
TextNetwork.py script. This should be fine for our purpose, but I'm a
little concerned that it'll mess up the user dictionary.
Original issue reported on code.google.com by [email protected]
on 4 Jun 2010 at 2:48
People's Daily still has the title, author, 版面, 编号, etc mixed up with
the contents. Need to clean/extract the metadata from the text contents.
Original issue reported on code.google.com by [email protected]
on 4 Jun 2010 at 2:37
Some considerations:
1. Replies sometimes quote the earlier messages. The quoted messages should
be removed
2. BeautifulSoup sometimes don't work very well (eg, the <img> tag). Need
to work more.
Original issue reported on code.google.com by [email protected]
on 4 Jun 2010 at 2:35
In tianya-milk network, there are synonym terms CCTV, 央视, **电视台,
which should be collapsed together into CCTV.
Original issue reported on code.google.com by [email protected]
on 4 Jun 2010 at 2:08
The current Chinese analyzer in Lucene uses the old ICTCLAS code. Perhaps
it can be set to use the new ICTCLAS library. This requires first to wrap
the ICTCLAS library in the StandardAnalyzer Lucene class.
However, this is not very important because Lucene now is only used for
search terms in text. The network building process is to first output the
terms, and then build the network.
Original issue reported on code.google.com by [email protected]
on 4 Jun 2010 at 2:24
The current logic is error-prone and perhaps not optimized for code-reuse.
Think about refactoring it.
Original issue reported on code.google.com by [email protected]
on 4 Jun 2010 at 2:40
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.