[00:06] anyways looking at the size of archiveteam's dumps for thingiverse [00:06] 38G per a dump [00:07] i'm now thinking my dumps may still be needed if possible [00:07] if anything else just to be more downloadable [00:14] i made my script faster by blocking /images/ui- [00:14] there images for interface that 404 on me [00:18] I think we also are getting everything with the warrior grab [00:22] *** JesseW has joined #archiveteam-bs [01:20] *** primus104 has quit IRC (Leaving.) [02:11] *** JesseW has quit IRC (Read error: Operation timed out) [02:25] i've uploaded english, french, russian, and japanese (demo) versions of the legoland game to IA: https://archive.org/search.php?query=subject%3A%22legoland%22%20AND%20subject%3A%22video%20game%22 [02:31] *** fie__ has quit IRC (Quit: Leaving) [02:38] ooh, i had that game. [02:39] *** Start has quit IRC (Read error: Connection reset by peer) [02:40] *** Start has joined #archiveteam-bs [02:56] *** zenguy_pc has quit IRC (Read error: Connection reset by peer) [03:08] what sort of compression and detail should i use for the scan of CD cover inserts and CDs [03:09] I'm guessing you'd rather me not upload 300mb BMP scans of them [03:11] *** zenguy_pc has joined #archiveteam-bs [03:22] *** JesseW has joined #archiveteam-bs [03:35] Ctrl-S: As long as it's lossless, and as large as you have, I think any format is fine. The derive task should handle it. https://archive.org/help/derivatives.php [03:36] okay [03:37] The BMPs are probably just fine [03:37] but I'm hardly knowledgable about this. [03:39] well, maybe do png or something like that at a minimum [03:42] *** zenguy_pc has quit IRC (Read error: Connection reset by peer) [03:43] is png lossless? [03:46] It uses deflate, I believe. [03:48] ah, it uses deflate with a prefilter. And yes, most libraries use the lossless mode by default. [03:52] or you could use xpm, which is both lossless and ascii, so you can see the pictures in your text editor in 100 years ;) [03:52] (a little large files though) [03:52] I thought that was black and white only [03:53] not at all [03:53] oops, I mistook it for xbm [03:53] also it's valid C code, just because [03:54] oh nope, I was actually thinking of PBM [03:54] so i scan at the highest res bitmap and you guys can convert it to something sane? [03:58] *** zenguy_pc has joined #archiveteam-bs [04:03] *** zenguy_pc has quit IRC (Read error: Connection reset by peer) [04:06] *** aaaaaaaa_ has joined #archiveteam-bs [04:06] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [04:06] *** swebb sets mode: +o aaaaaaaa_ [04:07] *** aaaaaaaa_ is now known as aaaaaaaaa [04:07] *** aaaaaaaaa has quit IRC (Client Quit) [04:19] *** zenguy_pc has joined #archiveteam-bs [06:10] *** PurpleSym has joined #archiveteam-bs [07:05] *** lbft has quit IRC (Read error: Operation timed out) [07:10] *** lbft has joined #archiveteam-bs [07:11] *** Aranje has quit IRC (Quit: Three sheets to the wind) [07:50] *** primus104 has joined #archiveteam-bs [07:51] *** JesseW has quit IRC (Read error: Operation timed out) [08:03] *** fie has joined #archiveteam-bs [08:27] *** primus104 has quit IRC (Leaving.) [09:25] I'd scan it in the highest resolution possibe, don't worry about the size [09:25] so bmp would be fine [09:26] if IA doesn't derive the bmp images, I think you should also upload a converted version of the bmp as preview [10:34] *** schbirid has joined #archiveteam-bs [10:39] *** primus104 has joined #archiveteam-bs [10:58] *** fie_ has joined #archiveteam-bs [11:00] *** fie has quit IRC (Read error: Operation timed out) [11:01] *** Infreq has quit IRC (Read error: Operation timed out) [11:01] *** Baljem_ has joined #archiveteam-bs [11:01] *** Infreq has joined #archiveteam-bs [11:02] *** Baljem has quit IRC (Read error: Operation timed out) [11:05] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [11:21] *** DopefishJ has joined #archiveteam-bs [11:21] *** swebb sets mode: +o DopefishJ [11:22] *** DFJustin has quit IRC (Read error: Operation timed out) [14:26] *** zenguy_pc has quit IRC (Ping timeout: 252 seconds) [14:28] *** vitzli has joined #archiveteam-bs [14:38] *** zenguy_pc has joined #archiveteam-bs [14:48] *** Start has quit IRC (Quit: Disconnected.) [14:54] *** Smiley has quit IRC (Read error: Connection reset by peer) [14:54] *** Stiletto has joined #archiveteam-bs [14:54] *** RichardG_ has joined #archiveteam-bs [14:54] *** logan has joined #archiveteam-bs [14:55] *** Rai-chan has quit IRC (Ping timeout: 268 seconds) [14:55] *** kniffy has quit IRC (Ping timeout: 268 seconds) [14:55] *** Fusl has quit IRC (Ping timeout: 268 seconds) [14:55] *** joepie91 has quit IRC (Ping timeout: 268 seconds) [14:56] *** matthusby has quit IRC (Ping timeout: 268 seconds) [14:56] *** RichardG has quit IRC (Ping timeout: 268 seconds) [14:56] *** espes__ has quit IRC (Ping timeout: 268 seconds) [14:56] *** jk[SVP] has quit IRC (Ping timeout: 268 seconds) [14:56] *** logan2 has quit IRC (Ping timeout: 268 seconds) [14:56] *** matthusby has joined #archiveteam-bs [14:56] *** jk[SVP] has joined #archiveteam-bs [14:57] *** edsu has joined #archiveteam-bs [14:57] *** swebb sets mode: +o edsu [14:57] *** ohhdemgir has quit IRC (Ping timeout: 268 seconds) [14:57] *** zhongfu has quit IRC (Ping timeout: 268 seconds) [14:57] *** edsu_ has quit IRC (Ping timeout: 268 seconds) [14:58] *** wp494_ has joined #archiveteam-bs [14:58] *** joepie91 has joined #archiveteam-bs [14:59] *** zhongfu has joined #archiveteam-bs [15:00] *** will- has joined #archiveteam-bs [15:01] *** ohhdemgir has joined #archiveteam-bs [15:01] *** Stilett0 has quit IRC (Ping timeout: 268 seconds) [15:01] *** vtyl has quit IRC (Ping timeout: 268 seconds) [15:01] *** pwnsrv has quit IRC (Ping timeout: 268 seconds) [15:01] *** wp494 has quit IRC (Ping timeout: 268 seconds) [15:02] *** vtyl has joined #archiveteam-bs [15:02] *** goekesmi has quit IRC (Ping timeout: 268 seconds) [15:02] *** no2pencil has quit IRC (Ping timeout: 268 seconds) [15:02] *** will has quit IRC (Ping timeout: 268 seconds) [15:02] *** will- is now known as will [15:03] *** wednesda- has joined #archiveteam-bs [15:03] *** primus104 has quit IRC (Leaving.) [15:04] *** goekesmi has joined #archiveteam-bs [15:04] *** Fusl has joined #archiveteam-bs [15:05] *** no2pencil has joined #archiveteam-bs [15:07] *** zenguy_pc has quit IRC (Ping timeout: 310 seconds) [15:07] *** SimpBrain has joined #archiveteam-bs [15:08] *** wacky_ has quit IRC (Ping timeout: 268 seconds) [15:08] *** wednesday has quit IRC (Ping timeout: 268 seconds) [15:15] *** zenguy_pc has joined #archiveteam-bs [15:24] *** wacky_ has joined #archiveteam-bs [15:32] *** espes__ has joined #archiveteam-bs [15:36] *** vitzli has quit IRC (Quit: Leaving) [15:36] *** Smiley has joined #archiveteam-bs [15:39] *** Rye has joined #archiveteam-bs [15:39] *** kniffy has joined #archiveteam-bs [16:45] *** primus104 has joined #archiveteam-bs [16:55] *** godane has quit IRC (Ping timeout: 492 seconds) [17:10] *** RichardG_ is now known as RichardG [17:23] *** JesseW has joined #archiveteam-bs [17:37] *** godane has joined #archiveteam-bs [17:41] *** JesseW has quit IRC (Read error: Operation timed out) [17:42] *** Start has joined #archiveteam-bs [17:50] *** aaaaaaaaa has joined #archiveteam-bs [17:50] *** swebb sets mode: +o aaaaaaaaa [18:34] *** JesseW has joined #archiveteam-bs [19:04] *** wp494_ is now known as wp494 [19:04] *** garyrh has quit IRC (Remote host closed the connection) [19:06] *** dashcloud has quit IRC (Read error: Operation timed out) [19:11] *** dashcloud has joined #archiveteam-bs [19:42] *** garyrh has joined #archiveteam-bs [19:49] *** primus has joined #archiveteam-bs [19:55] Hi all, I need help with archiving a website. The site is not big, so i thought i'd try downloading it with wget. [19:56] Try using wpull instead. [19:57] Here's the last wpull query I ran. [19:57] wpull 'http://www.musictheory.net/lessons' --no-check-certificate --user-agent "wpull-web-archiver" --page-requisites --recursive --level inf --span-hosts-allow linked-pages,page-requisites --recursive --level inf --escaped-fragment --strip-session-id --sitemaps --reject-regex "/login\.php" --tries 3 --retry-connrefused --retry-dns-error --timeout 60 --session-timeout 21600 --database "music-theory.db" -np [19:57] --header="Contact: acstubbins@openmailbox.org" --warc-file "music-theory" --warc-max-size 9999999999999999999 -np [19:57] Thanks, does it also make browsable local website? [19:58] Yeah. [19:58] Hmm, it doesn't seem to be available in yum for centos [19:59] It isn't. You should install it with pip, or from the github. [19:59] What is its main advantage over wget if you don't mind me asking? [20:00] actually I was only looking for advice if the command from tutorial on wget would be work for my purpose: wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL [20:03] Okay. Let me go over the options slowly. [20:03] --no-check-certificate means don't check the https certificate. [20:04] --user-agent "wpull-web-archiver" really isn't neccesary, but it essentially lets web masters know they're being archived, if they ever check their logs [20:05] --page-requisites is so that it archives all the images and other stuff on the page. [20:05] --recursive --level inf means to recurse infinitely. [20:06] --span-hosts-allow linked-pages,page-requisites tells wpull to download elements from different hosts, as long as it's a linked page, or something that falls under page-requisites [20:07] Actually, you might want to remove linked-pages. [20:07] I forget what --escapted-fragment, --strip-session-id are for [20:08] Wouldn't infinite level of recursion mean it falls into a loop? [20:09] The site i'm going to archive is very badly maintained so i'm quite concerned about loops. [20:16] primus: No. It won't archive anything twice. [20:18] Thank you for all your advice, I appreciate it [20:18] You're welcome. [20:18] *** JesseW has quit IRC (Read error: Operation timed out) [20:36] you know, the point of the --warc-max-size is to set it to something reasonable, not ~10 exabytes [20:37] aaaaaaaaa: That was to work around an annoying bug in webarchiveplayer. [20:39] primus: wpull supports some options that wget doesn't (and vice versa: https://wpull.readthedocs.org/en/master/differences.html) but the biggest benefit is if you spot a bug, the maintainer is on this channel. [20:40] oh and it doesn't try to keep everything in memory as well. [20:40] That's great. At the moment wget is already running. After it's finished I'll also run wpull, just to be on the safe side. [20:48] *** JesseW has joined #archiveteam-bs [20:58] *** BiggieJon has joined #archiveteam-bs [21:11] *** PurpleSym has quit IRC (Remote host closed the connection) [21:40] *** phiren has quit IRC (Ping timeout: 252 seconds) [21:40] *** phiren has joined #archiveteam-bs [21:45] *** JesseW has quit IRC (Leaving.) [21:45] *** JesseW has joined #archiveteam-bs [22:39] *** schbirid has quit IRC (Quit: Leaving) [22:39] *** BlueMaxim has joined #archiveteam-bs [23:58] *** JesseW has quit IRC (Read error: Operation timed out)