Download Fixed Dump Original
pages-articles.xml.bz2 and pages-articles-multistream.xml.bz2 both contain the same xml contents. So if you unpack either, you get the same data. But with multistream, it is possible to get an article from the archive without unpacking the whole thing. Your reader should handle this for you, if your reader doesn't support it it will work anyway since multistream and non-multistream contain the same xml. The only downside to multistream is that it is marginally larger. You might be tempted to get the smaller non-multistream archive, but this will be useless if you don't unpack it. And it will unpack to 5-10 times its original size. Penny wise, pound foolish. Get multistream.
Download Dump Original
NOTE THAT the multistream dump file contains multiple bz2 'streams' (bz2 header, body, footer) concatenated together into one file, in contrast to the vanilla file which contains one stream. Each separate 'stream' (or really, file) in the multistream dump contains 100 pages, except possibly the last one.
In the dumps.wikimedia.org directory you will find the latest SQL and XML dumps for the projects, not just English. The sub-directories are named for the language code and the appropriate project. Some other directories (e.g. simple, nostalgia) exist, with the same structure. These dumps are also available from the Internet Archive.
Images and other uploaded media are available from mirrors in addition to being served directly from Wikimedia servers. Bulk download is (as of September 2013) available from mirrors but not offered directly from Wikimedia servers. See the list of current mirrors. You should rsync from the mirror, then fill in the missing images from upload.wikimedia.org; when downloading from upload.wikimedia.org you should throttle yourself to 1 cache miss per second (you can check headers on a response to see if was a hit or miss and then back off when you get a miss) and you shouldn't use more than one or two simultaneous HTTP connections. In any case, make sure you have an accurate user agent string with contact info (email address) so ops can contact you if there's an issue. You should be getting checksums from the mediawiki API and verifying them. The API Etiquette page contains some guidelines, although not all of them apply (for example, because upload.wikimedia.org isn't MediaWiki, there is no maxlag parameter).
Unlike most article text, images are not necessarily licensed under the GFDL & CC-BY-SA-3.0. They may be under one of many free licenses, in the public domain, believed to be fair use, or even copyright infringements (which should be deleted). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information. This information is included in image description pages, which are part of the text dumps available from dumps.wikimedia.org. In conclusion, download these images at your own risk (Legal).
Compressed dump files are significantly compressed, thus after being decompressed will take up large amounts of drive space. A large list of decompression programs are described in comparison of file archivers. The following programs in particular can be used to decompress bzip2, .bz2, .zip, and .7z files.
Before starting a download of a large file, check the storage device to ensure its file system can support files of such a large size, and check the amount of free space to ensure that it can hold the downloaded file.
It is useful to check the MD5 sums (provided in a file in the download directory) to make sure the download was complete and accurate. This can be checked by running the "md5sum" command on the files downloaded. Given their sizes, this may take some time to calculate. Due to the technical details of how files are stored, file sizes may be reported differently on different filesystems, and so are not necessarily reliable. Also, corruption may have occurred during the download, though this is unlikely.
If you plan to download Wikipedia Dump files to one computer and use an external USB flash drive or hard drive to copy them to other computers, then you will run into the 4 GB FAT32 file size limit. To work around this limit, reformat the >4 GB USB drive to a file system that supports larger file sizes. If working exclusively with Windows computers, then reformat the USB drive to NTFS file system.
If you seem to be hitting the 2 GB limit, try using wget version 1.10 or greater, cURL version 7.11.1-1 or greater, or a recent version of lynx (using -dump). Also, you can resume downloads (for example wget -c).
You can do Hadoop MapReduce queries on the current database dump, but you will need an extension to the InputRecordFormat tohave each be a single mapper input. A working set of java methods (jobControl, mapper, reducer, and XmlInputRecordFormat) is available at Hadoop on the Wikipedia
As part of Wikimedia Enterprise a partial mirror of HTML dumps is made public. Dumps are produced for a specific set of namespaces and wikis, and then made available for public download. Each dump output file consists of a tar.gz archive which, when uncompressed and untarred, contains one file, with a single line per article, in json format. This is currently an experimental service.
MediaWiki 1.5 includes routines to dump a wiki to HTML, rendering the HTML with the same parser used on a live wiki. As the following page states, putting one of these dumps on the web unmodified will constitute a trademark violation. They are intended for private viewing in an intranet or desktop installation.
The wikiviewer plugin for rockbox permits viewing converted Wikipedia dumps on many Rockbox devices.It needs a custom build and conversion of the wiki dumps using the instructions available at . The conversion recompresses the file and splits it into 1 GB files and an index file which all need to be in the same folder on the device or micro sd card.
Instead of converting a database dump file to many pieces of static HTML, one can also use a dynamic HTML generator. Browsing a wiki page is just like browsing a Wiki site, but the content is fetched and converted from a local dump file on request from the browser.
XOWA is a free, open-source application that helps download Wikipedia to a computer. Access all of Wikipedia offline, without an internet connection!It is currently in the beta stage of development, but is functional. It is available for download here.
MzReader by Mun206 works with (though is not affiliated with) BzReader, and allows further rendering of wikicode into better HTML, including an interpretation of the monobook skin. It aims to make pages more readable. Requires Microsoft Visual Basic 6.0 Runtime, which is not supplied with the download. Also requires Inet Control and Internet Controls (Internet Explorer 6 ActiveX), which are packaged with the download.
WP-MIRROR is a free utility for mirroring any desired set of WMF wikis. That is, it builds a wiki farm that the user can browse locally. WP-MIRROR builds a complete mirror with original size media files. WP-MIRROR is available for download.
DocumentationNCBI SRA Download GuideSRA Toolkit documentation
SRA File Formats Guide
Command line help: Type the command followed by '-h'
fasterq-dump guideImportant NotesModule Name: sratoolkit (see the modules page for more information)
fastq-dump is being deprecated. Use fasterq-dump instead -- it is much faster and more efficient.
fasterq-dump uses temporary space while downloading, so you must make sure you have enough space
Do not run more than the default 6 threads on Helix.
To run trimgalore/cutadapt/trinity on these files, the quality header needs to be changed, e.g.sed -r 's/(^[\@\+]SRR\S+)/\1\/1/' SRR10724344_1.filter.fastqsed -r 's/(^[\@\+]SRR\S+)/\1\/2/' SRR10724344_2.filter.fastq fasterq-dump requires tmp space during the download. This temporary directory will use approximately the size of the final output file. On Biowulf, the SRAtoolkit module is set upto use local disk as the temporary directory. Therefore, if running SRAtoolkit on Biowulf, you must allocate local disk as in the examples below.SRA Source Repositories SRA Data currently reside in 3 NIH repositories: NCBI - Bethesda and Sterling
Amazon Web Services (= 'Amazon cloud' = AWS) Google Cloud Platform (GCP)Two versions of the data exist: the original (raw) submission, and a normalized (extract, transform, load [ETL]) version. NCBI maintains only ETL data online, while AWS and GCP have both ETL and original submission format. Users who want access to the original bams can only get them from AWS or GCP today.In the case of ETL data, Sratoolkit tools on Biowulf will always pull from NCBI, because it is obviously nearer and there are no fees.Most sratoolkit tools such as fasterq-dump will pull ETL data from NCBI. prefetch is the only SRAtoolkit tool that provides access to the original bams. If requesting "original submission" files in bam or cram or some other format, they can ONLY be obtained from AWS or GCP and will require that the user provide a cloud-billing account to pay for egress charges. See -tools/wiki/03.-Quick-Toolkit-Configuration and -tools/wiki/04.-Cloud-Credentials. The user needs to establish account information, register it with the toolkit, and authorize the toolkit to pass this information to AWS or GCP to pay for egress charges.If you attempt to download non-ETL SRA data from AWS or GCP without the account information, you will see an error message along these lines:Bucket is requester pays bucket but no user project provided.Errors during downloadsIt is not unusual for users to get errors while downloading SRA data with prefetch, fasterq-dump, or hisat2, because many people are constantly downloading data and the servers can get overwhelmed. Please see the NCBI SRA page Connection TimeoutsEstimating space requirementsfasterq-dump takes significantly more space than the old fastq-dump, as it requires temporary space in addition to the final output. As a rule of thumb, the fasterq-dump guide suggests getting the size of the accession using 'vdb-dump', then estimating 7x for the output and 6x for the temp files. For example: helix% vdb-dump --info SRR2048331acc : SRR2048331path : -downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR2048331/SRR2048331.2size : 657,343,309type : Tableplatf : SRA_PLATFORM_ILLUMINASEQ : 16,600,251SCHEMA : NCBI:SRA:Illumina:tbl:q1:v2#1.1TIME : 0x0000000056644e79 (12/06/2015 10:04)FMT : FastqFMTVER : 2.5.4LDR : fastq-load.2.5.4LDRVER : 2.5.4LDRDATE: Sep 16 2015 (9/16/2015 0:0)Based on the third line, you should have 650 MB * 7 = 4550 MB = 4.5 GB for the tmp files, and 4 GB for the output file(s). It is also recommended that the output file and temporary files be on different filesystems, as in the examples below. Downloading data from SRAYou can download SRA fastq files using the fasterq-dump tool, which will download the fastq file into your current working directory by default. (Note: the old fastq-dump is being deprecated). During the download, a temporary directory will be created in the location specified by the -t flag (in the example below, in /scratch/$USER) that will get deleted after the download is complete.For example, on Helix, the interactive data transfer system, you can download as in the example below. To download on Biowulf, don't run on the Biowulf login node; use a batch job or interactive job instead. sratoolkit versions : Do not download to the top level of /data/$USER or /home/$USER. Instead, you must download the data to a new subdirectory, e.g. /data/$USER/sra which has no other files. 041b061a72