#archiveteam-bs 2017-03-27,Mon

↑back Search

Time Nickname Message
00:32 🔗 pnJay has quit IRC (Leaving)
01:09 🔗 tfgbd_znc has quit IRC (Ping timeout: 600 seconds)
01:19 🔗 tfgbd_znc has joined #archiveteam-bs
01:19 🔗 pizzaiolo has left
01:26 🔗 kristian_ has quit IRC (Quit: Leaving)
01:48 🔗 ndiddy has joined #archiveteam-bs
02:36 🔗 tfgbd_znc has quit IRC (Ping timeout: 600 seconds)
02:39 🔗 tfgbd_znc has joined #archiveteam-bs
02:59 🔗 tfgbd_znc has quit IRC (Ping timeout: 600 seconds)
03:06 🔗 tfgbd_znc has joined #archiveteam-bs
03:09 🔗 Honno has joined #archiveteam-bs
03:57 🔗 fie has quit IRC (Read error: Operation timed out)
04:05 🔗 fie has joined #archiveteam-bs
04:14 🔗 tammy_ has quit IRC (Ping timeout: 244 seconds)
04:26 🔗 tammy_ has joined #archiveteam-bs
04:44 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
05:51 🔗 Aranje has quit IRC (Read error: Operation timed out)
06:55 🔗 kyounko has joined #archiveteam-bs
07:05 🔗 odemg has quit IRC (Remote host closed the connection)
07:53 🔗 kristian_ has joined #archiveteam-bs
08:00 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
08:01 🔗 BlueMaxim has joined #archiveteam-bs
08:39 🔗 paparus has joined #archiveteam-bs
08:44 🔗 flipflop has joined #archiveteam-bs
08:47 🔗 prokuz has joined #archiveteam-bs
08:48 🔗 GE has joined #archiveteam-bs
08:56 🔗 paparus has quit IRC (Read error: Operation timed out)
08:56 🔗 paparus has joined #archiveteam-bs
08:57 🔗 fie has quit IRC (Read error: Connection reset by peer)
08:59 🔗 flipflop has quit IRC (Read error: Operation timed out)
09:04 🔗 prokuz has quit IRC (Read error: Operation timed out)
09:16 🔗 fie has joined #archiveteam-bs
09:17 🔗 JAA has joined #archiveteam-bs
10:00 🔗 paparus has quit IRC (Quit: Leaving)
10:40 🔗 pnJay has joined #archiveteam-bs
10:43 🔗 GE has quit IRC (Remote host closed the connection)
10:50 🔗 Silvan has joined #archiveteam-bs
10:50 🔗 SilSte has quit IRC (Read error: Connection reset by peer)
11:37 🔗 SpaffGarg the average size of the .torrent files is 49mb? that can't be right
11:40 🔗 JAA The average size of the data in the torrents is 50 MiB, yes (not the .torrent files themselves). It actually looks about right.
11:41 🔗 SpaffGarg oh that makes more sense
11:41 🔗 JAA The largest category is games, most of which are only a few MiB.
11:41 🔗 JAA The second largest is music, which is anywhere between a few and a few hundred MiB.
11:42 🔗 JAA I'm grabbing the .torrent files already (along with the entire website), and there's the ArchiveBot grab.
11:43 🔗 JAA But it might be worth grabbing the data inside the torrents as well. Mininova acted as a content distribution platform for the past few years. Not sure how much of that content was also distributed elsewhere by the publishers.
11:44 🔗 JAA ~90% of the torrents only have one seeder - most likely Mininova, which will stop seeding on 4 April.
11:45 🔗 JAA And contrary to previous statements in here, all torrents I've looked at only had one tracker - Mininova.
11:46 🔗 pizzaiolo has joined #archiveteam-bs
12:06 🔗 GE has joined #archiveteam-bs
12:23 🔗 kristian_ has quit IRC (Quit: Leaving)
12:54 🔗 tfgbd_znc has quit IRC (Ping timeout: 600 seconds)
12:58 🔗 tfgbd_znc has joined #archiveteam-bs
13:16 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
13:37 🔗 sep332 has quit IRC (Read error: Operation timed out)
13:38 🔗 sep332_ has joined #archiveteam-bs
13:44 🔗 pizzaiolo has quit IRC (Ping timeout: 246 seconds)
14:07 🔗 pizzaiolo has joined #archiveteam-bs
14:22 🔗 Jonison has joined #archiveteam-bs
14:23 🔗 C4K3 has quit IRC (Read error: Operation timed out)
14:26 🔗 C4K3 has joined #archiveteam-bs
15:18 🔗 C4K3 has quit IRC (Quit: leaving)
15:40 🔗 kyounko has quit IRC (Read error: Connection reset by peer)
15:44 🔗 pnJay anyone have exp running warrior scripts in the windows 10 bash prompt?
15:51 🔗 luckcolor pnJay: no but i can help set them up
15:52 🔗 luckcolor I used last year to run them on my linux server
15:53 🔗 luckcolor just note that if you plan on running them using something like systemd or upstart on WSL on windows that's not going to work
16:21 🔗 Dark_Star has quit IRC (Ping timeout: 506 seconds)
16:24 🔗 Simpbrain has joined #archiveteam-bs
16:37 🔗 odemg has joined #archiveteam-bs
16:40 🔗 Dark_Star has joined #archiveteam-bs
17:15 🔗 brayden has quit IRC (Ping timeout: 633 seconds)
17:38 🔗 Aranje has joined #archiveteam-bs
17:50 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
17:59 🔗 GE has quit IRC (Remote host closed the connection)
18:29 🔗 odemg has quit IRC (Remote host closed the connection)
18:32 🔗 odemg has joined #archiveteam-bs
18:37 🔗 dashcloud has quit IRC (Read error: Operation timed out)
18:40 🔗 dashcloud has joined #archiveteam-bs
19:01 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
19:14 🔗 GE has joined #archiveteam-bs
19:54 🔗 pnJay has quit IRC (Quit: Page closed)
20:17 🔗 odemg https://pastebin.com/raw/zhivLAGh
20:21 🔗 fie has quit IRC (Read error: Operation timed out)
20:37 🔗 JAA Status updates: Mininova is at 160k URLs done, 302k left (growing), 600/hour; WunderBlogs at 471k done, 373k left (dropping), 6k/hour
20:38 🔗 JAA (Yes, Mininova will likely not finish in time.)
20:45 🔗 Jonison has quit IRC (Read error: Connection reset by peer)
20:46 🔗 tammy_ it's a shame it's not a warrior project.
20:49 🔗 pnJay has joined #archiveteam-bs
20:55 🔗 JAA ArchiveBot did grab it already. Unfortunately, it seems that over 10% of the pages are 500 Internal Server Error instead of the actual content.
20:58 🔗 JAA I guess in this case you could first run a grab ignoring all torrent pages (which should be relatively quick; the slowest part for me are the statistics pages), collect the torrent IDs, and then group those together to get the items.
21:02 🔗 JAA I'm curious though whether there is a general solution to grab an entire website in a distributed way without manual pre-processing. The tracker would tell the client "here's a list of 500 URLs - go fetch them and all page requisites, extract links to other URLs we're interested in, then give me the WARC and the list of new URLs". Does this exist?
21:07 🔗 JAA (Actually, the page requisites should probably also just go in the list sent back to the server, otherwise there will be tons of duplicates.)
21:08 🔗 Jonison has joined #archiveteam-bs
21:14 🔗 odemg JAA, mininova is SLOW AS!!!
21:18 🔗 BlueMaxim has joined #archiveteam-bs
21:18 🔗 JAA Yeah, in this case a distributed effort would probably not really help. I suspect their servers are simply overloaded (or crappy).
21:21 🔗 JAA Maybe I'll ignore all individual torrent statistics pages for now; they're extremely slow and timing out frequently. Is there a way to instruct wpull to ignore a URL but still store it in the database so I can grab them later (if I manage to download the rest and there's still time)?
21:21 🔗 odemg yeah, I wont touch this, do your thing man
21:27 🔗 JAA I try :)
21:28 🔗 JAA Need to get some sleep now. If you have any clue about my questions, feel free to reply; I'll read the logs in the morning.
21:29 🔗 JAA has quit IRC (Quit: Page closed)
21:47 🔗 RichardG has quit IRC (Ping timeout: 633 seconds)
21:50 🔗 RichardG has joined #archiveteam-bs
21:57 🔗 Jonison has quit IRC (Read error: Connection reset by peer)
22:03 🔗 bwn has quit IRC (Read error: Operation timed out)
22:13 🔗 bwn has joined #archiveteam-bs
23:30 🔗 odemg has quit IRC (Remote host closed the connection)
23:31 🔗 odemg has joined #archiveteam-bs
23:37 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
23:38 🔗 Speck has joined #archiveteam-bs
23:40 🔗 pnJay has quit IRC (Leaving)
23:47 🔗 GE has quit IRC (Remote host closed the connection)

irclogger-viewer