fra82 / twitter-crawler Goto Github PK
View Code? Open in Web Editor NEWREST and STREAMING crawlers of Twitter (java)
REST and STREAMING crawlers of Twitter (java)
I am using the [REST Keyword crawler], and I only want English tweets.
I did as the snippet in README.md, but I still get tweets of multi Langs. So I checked the file: "crawler.properties", and changed:
####################################################################################
# REST Cralwer of Twitter - by keyword(s)
# Class: org.backingdata.twitter.crawler.rest.TwitterRESTKeywordSearchCrawler
# - Full path of the txt file to read terms from (one term ID per line)
tweetKeyword.fullPathKeywordList=keywords.txt
# - Full path of the output folder to store crawling results
tweetKeyword.fullOutputDirPath=./data/
# - Storage format: "json" to store one tweet per line as tweet JSON object or "tab" to store
# one tweet per line as TWEET_IDTWEET_TEXT
tweetID.outputFormat=json
# - If not empty, it is possible specify a language to retrieve only tweet of a specific language
# (en, es, it, etc.) - if empty all tweet are retrieved, indipendently from their language
# IMPORTANT: The language code may be formatted as ISO 639-1 alpha-2 (en), ISO 639-3 alpha-3 (msa), or ISO 639-1 alpha-2 combined with an ISO 3166-1 alpha-2 localization (zh-tw).
tweetID.languageFilter=en
into:
####################################################################################
# REST Cralwer of Twitter - by keyword(s)
# Class: org.backingdata.twitter.crawler.rest.TwitterRESTKeywordSearchCrawler
# - Full path of the txt file to read terms from (one term ID per line)
tweetKeyword.fullPathKeywordList=keywords.txt
# - Full path of the output folder to store crawling results
tweetKeyword.fullOutputDirPath=./data/
# - Storage format: "json" to store one tweet per line as tweet JSON object or "tab" to store
# one tweet per line as TWEET_IDTWEET_TEXT
tweetKeyword.outputFormat=json
# - If not empty, it is possible specify a language to retrieve only tweet of a specific language
# (en, es, it, etc.) - if empty all tweet are retrieved, indipendently from their language
# IMPORTANT: The language code may be formatted as ISO 639-1 alpha-2 (en), ISO 639-3 alpha-3 (msa), or ISO 639-1 alpha-2 combined with an ISO 3166-1 alpha-2 localization (zh-tw).
tweetKeywod.languageFilter=en
And it worked!
Thanks for your great tool, which is useful and helped a lot!
I use REST Keyword crawler to collect tweets contains multiple keywords.
And my keywords.txt looks like:
love
hello
sentiment
The issue is: it only crawls tweet contains 1 specific keyword e.g. "love" instead of contains "love" | "hello" | "sentiment".
but when it launches, the key word list shows multi keywords to crawl.
<><><><><><><><><><><><><><><><><><><>
List of keywords to crawl:
1 keyword: love
2 keyword: sentiment
3 keyword: hello
<><><><><><><><><><><><><><><><><><><>
How to crawl tweets contains keyword A or B or C... using this tool.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.