Comments (7)
So far, there's only the list of examples in the README with a short description which data is extracted. In addition, every example will show a command-line help if called with option --help
.
What do you exactly need?
- one example call for every example similar to that to count server names
- one (maybe two) in-depth tutorial how to run an example, eventually including some steps how to customize it
- documentation how everything works: class hierarchy, principles of processing WARC/WAT/WET input, etc. Most of the examples are quite small (~50 lines of code), I hope they are easy to understand. But the whole design of the CCSparkJob may appear arcane to Spark newbies.
Let me know what you need! You may also ask for help and support on the Common Crawl forum.
from cc-pyspark.
@sebastian-nagel Thank you for the reply.
The first one you mentioned is what I imagined when I wrote the issue.
The second option is great, if you have time.
The third option seems too much for this repository.
Here is my story of struggle, and it is still going on. You may skip reading this part.
I am using Ubuntu 18.04.3 LTS. What I want to achieve is to extract a monolingual text from Common Crawl. I started from the command-line help of cc_index_word_count.py. I needed to search to find the path to Common Crawl index table and I also figured out that the optional query argument is not optional. I also need to change the default Java version. Those were fine.
Then, I got an error about s3: No FileSystem for scheme: s3. So, I searched for the internet and found that I need packages for that so I added --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 to the command, and changed the path to s3a://...
Now, it complained about AWS credentials, even though I "aws configure"d. My solution was to export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
So, this is the current bash script:
#!/bin/bash
export AWS_ACCESS_KEY_ID=***
export AWS_SECRET_ACCESS_KEY=***
query="SELECT url, warc_filename, warc_record_offset, warc_record_length FROM ccindex LIMIT 10"
$SPARK_HOME/bin/spark-submit \
--conf spark.hadoop.parquet.enable.dictionary=true \
--conf spark.hadoop.parquet.enable.summary-metadata=false \
--conf spark.sql.hive.metastorePartitionPruning=true \
--conf spark.sql.parquet.filterPushdown=true \
--conf spark.sql.parquet.mergeSchema=true \
--conf spark.dynamicAllocation.maxExecutors=1 \
--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 \
--executor-cores 1 \
--num-executors 1 \
cc_index_word_count.py --query "${query}" \
--num_input_partitions 1 \
--num_output_partitions 1 \
s3a://commoncrawl/cc-index/table/cc-main/warc/ word_count
And I'm getting org.apache.http.conn.ConnectionPoolTimeoutException. I tried to limit executors (somebody on the internet suggested it), but it doesn't work as I expected. The exception is happening at "df = spark.read.load(table_path)" line of sparkcc.py.
Thank you for reading!
from cc-pyspark.
Hi @calee88, thanks for the careful report. I've opened #13 and #14 to improve documentation and command-line help.
When querying the columnar index (--query
): the data is located in the AWS us-east-1 region (Northern Virginia). It can be accessed remotely but this requires a reliable and fast internet connection. In case you have own an AWS account, there are two options two avoid timeouts:
- use Athena to execute the query, download the result, and pass it to the cc-pyspark job via
--csv
- move the job execution closer to the data, ideally into the AWS cloud in the us-east-1 region
Let me know whether this works for you!
from cc-pyspark.
Thank you for the reply @sebastian-nagel! I'm using a reliable and fast internet, although I'm far from Northern Virginia. I think the internet should not be a problem here. Have you tried to access remotely using the script I posted? Were you successful?
Anyway, I'm going to try Athena or AWS as you suggested.
from cc-pyspark.
Hello @sebastian-nagel.
I am now able to query using Athena and use the csv file for the script.
I still cannot use query argument, but let me close this as my original issue is summarized on #13 .
from cc-pyspark.
Thanks, @calee88, for the feedback. #13 will get addressed soon. Yes, I'm able to run the script
- from Europe connected via fibre
- using Spark 2.4.4 and the following options:
/opt/spark/2.4.4/bin/spark-submit --executor-cores 1 --num-executors 1 \
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 \
./cc_index_word_count.py \
--query "SELECT url, warc_filename, warc_record_offset, warc_record_length FROM ccindex WHERE crawl = 'CC-MAIN-2019-51' AND subset = 'warc' AND url_host_tld = 'is' LIMIT 10" \
s3a://commoncrawl/cc-index/table/cc-main/warc/ ccindexwordcount
- looks like the Spark SQL engine isn't really efficient regarding
LIMIT
: without further restrictions in theWHERE
part it seems to look into every part of the table which currently contains 200 billion rows. That's why I've put extra restrictions: only Dec 2019, only the "warc" subset and only Icelandic sites. - the job took 10 min. and finally I've got the word counts:
df = sqlContext.read.parquet("spark-warehouse/ccindexwordcount")
for row in df.sort(df.val.desc()).take(10): print("%6i\t%6i\t%s" % (row['val']['tf'], row['val']['df'], row['key']))
...
245 8 hd
154 10 the
97 8 movies
76 10 of
71 8 2019
69 10 and
64 10 to
62 10 online
62 2 football
61 10 free
- because this look pretty English, I've checked whether there are at least some Icelandic words (the letter ð is specific):
for row in df.filter(df['key'].contains('ð')).take(10): print("%6i\t%6i\t%s" % (row['val']['tf'], row['val']['df'], row['key']))
...
1 1 tónleikaferð
1 1 annað
1 1 aðalmynd
2 2 sláðu
6 2 vönduð
1 1 viðeigandi
5 2 iðnó
2 2 með
2 2 lesið
3 2 jólastuði
from cc-pyspark.
Thank you for the reply @sebastian-nagel. Athena seems much faster, so I'll just keep using it. I hope someone find this thread helpful.
from cc-pyspark.
Related Issues (20)
- CCIndexWarcSparkJob requires one of --query or --csv
- Webgraph construction: include nodes with zero outgoing links
- Processing English only archives HOT 2
- Common Crawl Index Table - Need for Schema Merging to be documented HOT 2
- Test and update examples to work with ARC files of the 2008 - 2012 crawls
- Broken links in README HOT 1
- Use SparkSession instead of SQLContext HOT 5
- How To: process CC NEWS warc files, most recent first HOT 3
- Can not run server_count example on Windows locally HOT 6
- download only specific language data from wet files like Warc HOT 1
- Class org.apache.hadoop.fs.s3a.S3AFileSystem not found HOT 2
- boto3 credentials error when running CCSparkJob with ~100 S3 warc paths as input, but works with <10 S3 warc paths as input HOT 5
- Bad Substitution HOT 3
- Incompatible Architecture HOT 4
- Host link extraction does not represent every IDN as IDNA
- Provide classes to use FastWARC to read WARC/WAT/WET files
- Looks like ccspark tried to access everything from local file. What's wrong with the settings? HOT 1
- Drop support for Python 2.7 HOT 1
- Use simdjson to read WAT payloads
- Example using Resiliparse's HTML parser or text extractor
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cc-pyspark.