[00:32] *** pnJay has quit IRC (Leaving) [01:09] *** tfgbd_znc has quit IRC (Ping timeout: 600 seconds) [01:19] *** tfgbd_znc has joined #archiveteam-bs [01:19] *** pizzaiolo has left [01:26] *** kristian_ has quit IRC (Quit: Leaving) [01:48] *** ndiddy has joined #archiveteam-bs [02:36] *** tfgbd_znc has quit IRC (Ping timeout: 600 seconds) [02:39] *** tfgbd_znc has joined #archiveteam-bs [02:59] *** tfgbd_znc has quit IRC (Ping timeout: 600 seconds) [03:06] *** tfgbd_znc has joined #archiveteam-bs [03:09] *** Honno has joined #archiveteam-bs [03:57] *** fie has quit IRC (Read error: Operation timed out) [04:05] *** fie has joined #archiveteam-bs [04:14] *** tammy_ has quit IRC (Ping timeout: 244 seconds) [04:26] *** tammy_ has joined #archiveteam-bs [04:44] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [05:51] *** Aranje has quit IRC (Read error: Operation timed out) [06:55] *** kyounko has joined #archiveteam-bs [07:05] *** odemg has quit IRC (Remote host closed the connection) [07:53] *** kristian_ has joined #archiveteam-bs [08:00] *** BlueMaxim has quit IRC (Read error: Operation timed out) [08:01] *** BlueMaxim has joined #archiveteam-bs [08:39] *** paparus has joined #archiveteam-bs [08:44] *** flipflop has joined #archiveteam-bs [08:47] *** prokuz has joined #archiveteam-bs [08:48] *** GE has joined #archiveteam-bs [08:56] *** paparus has quit IRC (Read error: Operation timed out) [08:56] *** paparus has joined #archiveteam-bs [08:57] *** fie has quit IRC (Read error: Connection reset by peer) [08:59] *** flipflop has quit IRC (Read error: Operation timed out) [09:04] *** prokuz has quit IRC (Read error: Operation timed out) [09:16] *** fie has joined #archiveteam-bs [09:17] *** JAA has joined #archiveteam-bs [10:00] *** paparus has quit IRC (Quit: Leaving) [10:40] *** pnJay has joined #archiveteam-bs [10:43] *** GE has quit IRC (Remote host closed the connection) [10:50] *** Silvan has joined #archiveteam-bs [10:50] *** SilSte has quit IRC (Read error: Connection reset by peer) [11:37] the average size of the .torrent files is 49mb? that can't be right [11:40] The average size of the data in the torrents is 50 MiB, yes (not the .torrent files themselves). It actually looks about right. [11:41] oh that makes more sense [11:41] The largest category is games, most of which are only a few MiB. [11:41] The second largest is music, which is anywhere between a few and a few hundred MiB. [11:42] I'm grabbing the .torrent files already (along with the entire website), and there's the ArchiveBot grab. [11:43] But it might be worth grabbing the data inside the torrents as well. Mininova acted as a content distribution platform for the past few years. Not sure how much of that content was also distributed elsewhere by the publishers. [11:44] ~90% of the torrents only have one seeder - most likely Mininova, which will stop seeding on 4 April. [11:45] And contrary to previous statements in here, all torrents I've looked at only had one tracker - Mininova. [11:46] *** pizzaiolo has joined #archiveteam-bs [12:06] *** GE has joined #archiveteam-bs [12:23] *** kristian_ has quit IRC (Quit: Leaving) [12:54] *** tfgbd_znc has quit IRC (Ping timeout: 600 seconds) [12:58] *** tfgbd_znc has joined #archiveteam-bs [13:16] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [13:37] *** sep332 has quit IRC (Read error: Operation timed out) [13:38] *** sep332_ has joined #archiveteam-bs [13:44] *** pizzaiolo has quit IRC (Ping timeout: 246 seconds) [14:07] *** pizzaiolo has joined #archiveteam-bs [14:22] *** Jonison has joined #archiveteam-bs [14:23] *** C4K3 has quit IRC (Read error: Operation timed out) [14:26] *** C4K3 has joined #archiveteam-bs [15:18] *** C4K3 has quit IRC (Quit: leaving) [15:40] *** kyounko has quit IRC (Read error: Connection reset by peer) [15:44] anyone have exp running warrior scripts in the windows 10 bash prompt? [15:51] pnJay: no but i can help set them up [15:52] I used last year to run them on my linux server [15:53] just note that if you plan on running them using something like systemd or upstart on WSL on windows that's not going to work [16:21] *** Dark_Star has quit IRC (Ping timeout: 506 seconds) [16:24] *** Simpbrain has joined #archiveteam-bs [16:37] *** odemg has joined #archiveteam-bs [16:40] *** Dark_Star has joined #archiveteam-bs [17:15] *** brayden has quit IRC (Ping timeout: 633 seconds) [17:38] *** Aranje has joined #archiveteam-bs [17:50] *** Aranje has quit IRC (Quit: Three sheets to the wind) [17:59] *** GE has quit IRC (Remote host closed the connection) [18:29] *** odemg has quit IRC (Remote host closed the connection) [18:32] *** odemg has joined #archiveteam-bs [18:37] *** dashcloud has quit IRC (Read error: Operation timed out) [18:40] *** dashcloud has joined #archiveteam-bs [19:01] *** pizzaiolo has quit IRC (Remote host closed the connection) [19:14] *** GE has joined #archiveteam-bs [19:54] *** pnJay has quit IRC (Quit: Page closed) [20:17] https://pastebin.com/raw/zhivLAGh [20:21] *** fie has quit IRC (Read error: Operation timed out) [20:37] Status updates: Mininova is at 160k URLs done, 302k left (growing), 600/hour; WunderBlogs at 471k done, 373k left (dropping), 6k/hour [20:38] (Yes, Mininova will likely not finish in time.) [20:45] *** Jonison has quit IRC (Read error: Connection reset by peer) [20:46] it's a shame it's not a warrior project. [20:49] *** pnJay has joined #archiveteam-bs [20:55] ArchiveBot did grab it already. Unfortunately, it seems that over 10% of the pages are 500 Internal Server Error instead of the actual content. [20:58] I guess in this case you could first run a grab ignoring all torrent pages (which should be relatively quick; the slowest part for me are the statistics pages), collect the torrent IDs, and then group those together to get the items. [21:02] I'm curious though whether there is a general solution to grab an entire website in a distributed way without manual pre-processing. The tracker would tell the client "here's a list of 500 URLs - go fetch them and all page requisites, extract links to other URLs we're interested in, then give me the WARC and the list of new URLs". Does this exist? [21:07] (Actually, the page requisites should probably also just go in the list sent back to the server, otherwise there will be tons of duplicates.) [21:08] *** Jonison has joined #archiveteam-bs [21:14] JAA, mininova is SLOW AS!!! [21:18] *** BlueMaxim has joined #archiveteam-bs [21:18] Yeah, in this case a distributed effort would probably not really help. I suspect their servers are simply overloaded (or crappy). [21:21] Maybe I'll ignore all individual torrent statistics pages for now; they're extremely slow and timing out frequently. Is there a way to instruct wpull to ignore a URL but still store it in the database so I can grab them later (if I manage to download the rest and there's still time)? [21:21] yeah, I wont touch this, do your thing man [21:27] I try :) [21:28] Need to get some sleep now. If you have any clue about my questions, feel free to reply; I'll read the logs in the morning. [21:29] *** JAA has quit IRC (Quit: Page closed) [21:47] *** RichardG has quit IRC (Ping timeout: 633 seconds) [21:50] *** RichardG has joined #archiveteam-bs [21:57] *** Jonison has quit IRC (Read error: Connection reset by peer) [22:03] *** bwn has quit IRC (Read error: Operation timed out) [22:13] *** bwn has joined #archiveteam-bs [23:30] *** odemg has quit IRC (Remote host closed the connection) [23:31] *** odemg has joined #archiveteam-bs [23:37] *** BlueMaxim has quit IRC (Read error: Operation timed out) [23:38] *** Speck has joined #archiveteam-bs [23:40] *** pnJay has quit IRC (Leaving) [23:47] *** GE has quit IRC (Remote host closed the connection)