Coder Social home page Coder Social logo

mirror / wget Goto Github PK

View Code? Open in Web Editor NEW
364.0 26.0 127.0 32.25 MB

Wget Git mirror

License: GNU General Public License v3.0

Shell 3.04% Perl 10.11% C 68.74% Makefile 1.17% Python 12.15% Module Management System 2.59% Lex 0.23% DIGITAL Command Language 0.24% M4 1.73%

wget's Introduction

                                                          -*- text -*-
GNU Wget
========
                  Current Web home: https://www.gnu.org/software/wget/

GNU Wget is a free utility for non-interactive download of files from
the Web.  It supports HTTP, HTTPS, and FTP protocols, as well as
retrieval through HTTP proxies.

It can follow links in HTML pages and create local versions of remote
web sites, fully recreating the directory structure of the original
site.  This is sometimes referred to as "recursive downloading."
While doing that, Wget respects the Robot Exclusion Standard
(/robots.txt).  Wget can be instructed to convert the links in
downloaded HTML files to the local files for offline viewing.

Recursive downloading also works with FTP, where Wget can retrieve a
hierarchy of directories and files.

With both HTTP and FTP, Wget can check whether a remote file has
changed on the server since the previous run, and only download the
newer files.

Wget has been designed for robustness over slow or unstable network
connections; if a download fails due to a network problem, it will
keep retrying until the whole file has been retrieved.  If the server
supports regetting, it will instruct the server to continue the
download from where it left off.

If you are behind a firewall that requires the use of a socks style
gateway, you can get the socks library and compile wget with support
for socks.

Most of the features are configurable, either through command-line
options, or via initialization file .wgetrc.  Wget allows you to
install a global startup file (/usr/local/etc/wgetrc by default) for
site settings.

Wget works under almost all Unix variants in use today and, unlike
many of its historical predecessors, is written entirely in C, thus
requiring no additional software, such as Perl.  The external software
it does work with, such as OpenSSL, is optional.  As Wget uses the GNU
Autoconf, it is easily built on and ported to new Unix-like systems.
The installation procedure is described in the INSTALL file.

As with other GNU software, the latest version of Wget can be found at
the master GNU archive site ftp.gnu.org, and its mirrors.  Wget
resides at <ftp://ftp.gnu.org/pub/gnu/wget/>.

Please report bugs in Wget to <[email protected]>.

See the file `MAILING-LIST' for information about Wget mailing lists.
Wget's home page is at <https://www.gnu.org/software/wget/>.

If you would like to contribute code for Wget, please read
CONTRIBUTING.md.

Wget was originally written and mainained by Hrvoje Niksic.  Please see
the file AUTHORS for a list of major contributors, and the ChangeLogs
for a detailed listing of all contributions.


Copyright (C) 1995-2023 Free Software Foundation, Inc.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301
USA.

Additional permission under GNU GPL version 3 section 7

If you modify this program, or any covered work, by linking or
combining it with the OpenSSL project's OpenSSL library (or a
modified version of that library), containing parts covered by the
terms of the OpenSSL or SSLeay licenses, the Free Software Foundation
grants you additional permission to convey the resulting work.
Corresponding Source for a non-source form of such a combination
shall include the source code for the parts of OpenSSL used as well
as that of the covered work.

wget's People

Contributors

bagder avatar codervijo avatar cy6erbr4in avatar eli-zaretskii avatar feinorgh avatar filbranden avatar giuseppe avatar gnfalex avatar gvtulder avatar hubertta avatar jay avatar jcourreges avatar jff avatar juaristi avatar lifenjoiner avatar losgrandes avatar mehw avatar micahcowan avatar mllobet avatar modelrockettier avatar moparisthebest avatar nilsirl avatar rockdaboot avatar rohit89 avatar sburford avatar stoeckmann avatar thozza avatar vapier avatar vyache76 avatar yousong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wget's Issues

Segmentation fault in pure IPv4 wget

root@build-server-8:/home/smore/WGET/wget-1.18# ./src/wget -4 -np --limit-rate=100k --timeout=1 --delete-after --tries=1 --no-dns-cache --dns-servers=8.8.4.4 http://cpp.sh --bind-address=150.1.1.111 --bind-dns-address=150.1.1.111
--2020-06-13 23:12:27-- http://cpp.sh/
Resolving cpp.sh (cpp.sh)...
661 Total count = al1=0x2193d40, al2=0x600000077
662 Total count = al1=0x2193d40, al2=0x600000077Segmentation fault (core dumped)

root@build-server-8:/home/smore/WGET/wget-1.18# !gdb
gdb ./src/wget core
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./src/wget...(no debugging symbols found)...done.
[New LWP 5232]
Core was generated by `./src/wget -4 -np --limit-rate=100k --timeout=1 --delete-after --tries=1 --no-d'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000416a3f in merge_address_lists ()
(gdb) bt
#0 0x0000000000416a3f in merge_address_lists ()
#1 0x000000000041717c in lookup_host ()
#2 0x000000000040451b in connect_to_host ()
#3 0x000000000041e1d9 in establish_connection ()
#4 0x000000000041f3a9 in gethttp ()
#5 0x000000000042180a in http_loop ()
#6 0x0000000000430795 in retrieve_url ()
#7 0x00000000004298ba in main ()
(gdb)

There are uninitialized struct address_list * in lookup_host function.
For v4 only queries on Debian Jessie system we are getting crash due to this.

Build-Steps:

./configure PKG_CONFIG_PATH="/usr/lib/x86_64-linux-gnu/pkgconfig/" GNUTLS_CFLAGS="-I/usr/include/" --without-ssl --with-cares
make clean;make

Patch has been sent with mail-header Patch: Segmentation fault in pure IPv4 wget #12

Issue while downloading a dataset of videos.

Each video is saved into a separate folder which is already pre-existing with the right name. The names of all the recipes are stored in a text file.

File "fs_download_videos.py", line 93, in
file_name = wget.download(video_link_src, curr_file_path + "/")
File "/home/nsword/.local/lib/python3.8/site-packages/wget.py", line 534, in download
filename = filename_fix_existing(filename)
File "/home/nsword/.local/lib/python3.8/site-packages/wget.py", line 269, in filename_fix_existing
name,ext = filename.rsplit('.', 1)
ValueError: not enough values to unpack (expected 2, got 1)

Integer overflows in parse_content_range() and gethttp()

Security Vulnerability Report

File: src/http.c

Functions: parse_content_range() and gethttp()

Vulnerability Type: Integer Overflow

Location: Lines 936, 942, 955 and 3739

Severity: High

Description:

In the parse_content_range() function, at lines 936, 942, 955, there exists a vulnerability related to an integer overflow. The vulnerability arises from the calculation of the variable num, which is assigned the value of

num = 10 * num + (*hdr - '0');

Both the multiplication and addition can lead to an integer overflow, and lead to unexpected behavior, due to the lack of validation.

Furthermore similarly to curl/curl#12983, at line 3739 of function gethttp(), the calculation of the contlen variable can also overflow:

contlen = last_byte_pos - first_byte_pos + 1;

Exploitation Scenario:

An attacker may craft a malicious request with carefully chosen values in the Content-Range header, triggering an integer overflow during the calculation of num and contlen. This could potentially lead to various security issues, such as memory corruption, buffer overflows, or unexpected behavior, depending on how the num and contlen variables is subsequently used.

Impact:

The impact of this vulnerability could be severe, potentially leading to:

Memory Corruption: If the calculated num and contlen value are used to allocate memory or perform operations such as copying data, an integer overflow could result in memory corruption, leading to crashes or arbitrary code execution.

Security Bypass: In scenarios where num and contlen value are used to enforce boundaries or permissions, an attacker may exploit the integer overflow to bypass security checks or gain unauthorized access to sensitive resources.

Denial of Service (DoS): A carefully crafted request exploiting the integer overflow could cause the application to enter an unexpected state or consume excessive resources, leading to a denial of service condition.

Recommendations:

Bounds Checking: Implement proper bounds checking to ensure that the values of num and contlen are within acceptable ranges before performing calculations.

Safe Arithmetic Operations: Consider using safer arithmetic operations or alternative calculation methods to prevent integer overflows, especially when dealing with potentially large or close-to-boundary values.

Input Validation: Validate input parameters to ensure they adhere to expected ranges and constraints before performing calculations.

Error Handling: Implement robust error handling mechanisms to gracefully handle scenarios where input parameters result in unexpected or invalid calculations.

Severity Justification:

The presence of an integer overflow vulnerability at lines 936, 942, 955 and 3739 poses a high risk to the security and stability of the application. Exploitation of this vulnerability could lead to severe consequences, including memory corruption, security bypass, or denial of service conditions.

Affected Versions:

This vulnerability affects all versions of the application that include the vulnerable parse_content_range() and gethttp() functions.

References:

OWASP Integer Overflow
CWE-190: Integer Overflow or Wraparound
CERT Secure Coding - INT32-C

Conclusion:

The presence of an integer overflow vulnerability at lines 936, 942, 955 in the parse_content_range() function and line 3739 of gethttp() poses a high risk to the security and stability of the application. It is imperative to address this vulnerability promptly by implementing appropriate bounds checking and error handling mechanisms to prevent potential exploitation and associated security risks.

Add a security policy

Hey there!

I belong to an open source security research community, and a member (@Wninayyds) has found an issue, but doesn’t know the best way to disclose it.

If not a hassle, might you kindly add a SECURITY.md file with an email, or another contact method? GitHub recommends this best practice to ensure security issues are responsibly disclosed, and it would serve as a simple instruction for security researchers in the future.

Thank you for your consideration, and I look forward to hearing from you!

(cc @huntr-helper)

Currently wget does not support encrypted certificates, Can it be added?

so far, wget only support certificate file without password,
--certificate=FILE client certificate file

is it possible to add support client encrypted certificate file with password, such as like this: (as I known, curl support this feature now)
“wget --cert <certificate[:password]> Client certificate file and password” ?

having trouble downloading with 'wget' the new Llama 2 i have a URL

this is the error code i keep getting and i dont know enough to correc this issue myself can anyone help this is what the issue looks like on my end when prompted excluding the url and other dataC:/OMITTED MORE DATA HERE o>wget https://download2.llamameta.net REST OF URL WAS HERE
Warning: wildcards not supported in HTTP.
--2024-02-11 03:23:09-- URL WAS HERE
Resolving download2.llamameta.net (download2.llamameta.net)... 3.163.115.24, 3.163.115.8, 3.163.115.112, ...
Connecting to download2.llamameta.net (download2.llamameta.net)|3.163.115.24|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2024-02-11 03:23:09 ERROR 403: Forbidden.

'Signature' is not recognized as an internal or external command,
operable program or batch file.
'Key-Pair-Id' is not recognized as an internal or external command,
operable program or batch file.
'Download-Request-ID' is not recognized as an internal or external command,
operable program or batch file.

Paths are truncated to early when `pathconf` is available

I'm doing a wget --mirror operation on a fairly long URL (including the host folder ~ 242 chars) into another folder via --outdir adding another couple of chars.

This runs into the length limitation at

wget/src/url.c

Line 1523 in 9a35fe6

if (max_length > 0 && outlen > max_length)

However I think that limitation is to aggressive/sensitive: It takes the entire quoted path (i.e. at least the 242 chars) and compares it against pathconf(..., _PC_NAME_MAX) when that is available. See

wget/src/utils.c

Lines 2665 to 2671 in 9a35fe6

#if HAVE_PATHCONF
ret = pathconf (*p ? p : ".", name);
if (!(ret < 0 && errno == ENOENT))
break;
#else
ret = PATH_MAX;
#endif

Note that the _PC_NAME_MAX returns the maximum length for a filename while PATH_MAX is the maximum length of a path. The former is 255 while the latter is 4096 on "usual" Linux systems. So the 2 code paths are not nearly identical!

Given that the "chomp buffer" size (19) is additionally subtracted any output path is truncated at 255-19=236 chars which isn't enough for use-cases such as mine mirroring a larger hierarchy of folders (files in depth of 10 folders)

HTTP request sent, awaiting response... 404 Not Found on existing page

page exists, but not downloaded... i tried to put headers but not success. what to do?
is that a bug?
I am downloading wget2...

LANG=EN; wget --random-wait -4 --max-redirect 100 --level 100 --limit-rate=30K -r -nc -p -U Mozilla 'www.literotica.com/average-joe-ch-01?page=2'

URL transformed to HTTPS due to an HSTS policy
--2022-07-11 13:49:09-- https://www.literotica.com/average-joe-ch-01?page=2
Resolving www.literotica.com (www.literotica.com)... 216.150.65.200, 216.150.65.190
Connecting to www.literotica.com (www.literotica.com)|216.150.65.200|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-07-11 13:49:10 ERROR 404: Not Found.

Please use gzip/gunzip when fetching webpages

More often than not I try recursively downloading a webpage using wget, only to have it download a single index.html.gz then stop. Obviously wget can't read gzipped files so it fails to find any links for recursive downloading... I ended up using this wget fork that was last updated 10 years ago and it works fine, however I find it odd that such a basic feature never made it into mainline wget.

Please add a feature for automatically detecting and uncompressing gzipped webpages before crawling them.

"sed -i" in bootstrap.conf is not platform-neutral

Commit 32e26dc causes bootstrap to fail on OS X with:

  • invoke gl_INIT in ./configure.ac.
    Creating lib/unicase/special-casing-table.h for gperf < 3.1
    sed: 1: "lib/unicase/special-cas ...": extra characters at the end of l command
    ./bootstrap: bootstrap_post_import_hook failed

Per the discussion at
https://stackoverflow.com/questions/5694228/sed-in-place-flag-that-works-both-on-mac-bsd-and-linux, the -i option is not part of POSIX Sed, and is apparently implemented differently on BSD/Apple OS X/macOS.

how to download filenames in cyrillic and save them in cyrillic

Year ago it worked fine on my previous pc windows 10. Now it saves some gibberish as filenames.
--restrict-file-names do nothing
--local-encoding=UTF-8 no works
unrecognised option `--local-encoding=UTF-8'

I'm using wget -m -np -r -R "index.html* http://apache_listing.com/folder/folder_in_cyrillic/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.