Time |
Nickname |
Message |
00:32
🔗
|
|
pnJay has quit IRC (Leaving) |
01:09
🔗
|
|
tfgbd_znc has quit IRC (Ping timeout: 600 seconds) |
01:19
🔗
|
|
tfgbd_znc has joined #archiveteam-bs |
01:19
🔗
|
|
pizzaiolo has left |
01:26
🔗
|
|
kristian_ has quit IRC (Quit: Leaving) |
01:48
🔗
|
|
ndiddy has joined #archiveteam-bs |
02:36
🔗
|
|
tfgbd_znc has quit IRC (Ping timeout: 600 seconds) |
02:39
🔗
|
|
tfgbd_znc has joined #archiveteam-bs |
02:59
🔗
|
|
tfgbd_znc has quit IRC (Ping timeout: 600 seconds) |
03:06
🔗
|
|
tfgbd_znc has joined #archiveteam-bs |
03:09
🔗
|
|
Honno has joined #archiveteam-bs |
03:57
🔗
|
|
fie has quit IRC (Read error: Operation timed out) |
04:05
🔗
|
|
fie has joined #archiveteam-bs |
04:14
🔗
|
|
tammy_ has quit IRC (Ping timeout: 244 seconds) |
04:26
🔗
|
|
tammy_ has joined #archiveteam-bs |
04:44
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
05:51
🔗
|
|
Aranje has quit IRC (Read error: Operation timed out) |
06:55
🔗
|
|
kyounko has joined #archiveteam-bs |
07:05
🔗
|
|
odemg has quit IRC (Remote host closed the connection) |
07:53
🔗
|
|
kristian_ has joined #archiveteam-bs |
08:00
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
08:01
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
08:39
🔗
|
|
paparus has joined #archiveteam-bs |
08:44
🔗
|
|
flipflop has joined #archiveteam-bs |
08:47
🔗
|
|
prokuz has joined #archiveteam-bs |
08:48
🔗
|
|
GE has joined #archiveteam-bs |
08:56
🔗
|
|
paparus has quit IRC (Read error: Operation timed out) |
08:56
🔗
|
|
paparus has joined #archiveteam-bs |
08:57
🔗
|
|
fie has quit IRC (Read error: Connection reset by peer) |
08:59
🔗
|
|
flipflop has quit IRC (Read error: Operation timed out) |
09:04
🔗
|
|
prokuz has quit IRC (Read error: Operation timed out) |
09:16
🔗
|
|
fie has joined #archiveteam-bs |
09:17
🔗
|
|
JAA has joined #archiveteam-bs |
10:00
🔗
|
|
paparus has quit IRC (Quit: Leaving) |
10:40
🔗
|
|
pnJay has joined #archiveteam-bs |
10:43
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
10:50
🔗
|
|
Silvan has joined #archiveteam-bs |
10:50
🔗
|
|
SilSte has quit IRC (Read error: Connection reset by peer) |
11:37
🔗
|
SpaffGarg |
the average size of the .torrent files is 49mb? that can't be right |
11:40
🔗
|
JAA |
The average size of the data in the torrents is 50 MiB, yes (not the .torrent files themselves). It actually looks about right. |
11:41
🔗
|
SpaffGarg |
oh that makes more sense |
11:41
🔗
|
JAA |
The largest category is games, most of which are only a few MiB. |
11:41
🔗
|
JAA |
The second largest is music, which is anywhere between a few and a few hundred MiB. |
11:42
🔗
|
JAA |
I'm grabbing the .torrent files already (along with the entire website), and there's the ArchiveBot grab. |
11:43
🔗
|
JAA |
But it might be worth grabbing the data inside the torrents as well. Mininova acted as a content distribution platform for the past few years. Not sure how much of that content was also distributed elsewhere by the publishers. |
11:44
🔗
|
JAA |
~90% of the torrents only have one seeder - most likely Mininova, which will stop seeding on 4 April. |
11:45
🔗
|
JAA |
And contrary to previous statements in here, all torrents I've looked at only had one tracker - Mininova. |
11:46
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
12:06
🔗
|
|
GE has joined #archiveteam-bs |
12:23
🔗
|
|
kristian_ has quit IRC (Quit: Leaving) |
12:54
🔗
|
|
tfgbd_znc has quit IRC (Ping timeout: 600 seconds) |
12:58
🔗
|
|
tfgbd_znc has joined #archiveteam-bs |
13:16
🔗
|
|
BlueMaxim has quit IRC (Read error: Connection reset by peer) |
13:37
🔗
|
|
sep332 has quit IRC (Read error: Operation timed out) |
13:38
🔗
|
|
sep332_ has joined #archiveteam-bs |
13:44
🔗
|
|
pizzaiolo has quit IRC (Ping timeout: 246 seconds) |
14:07
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
14:22
🔗
|
|
Jonison has joined #archiveteam-bs |
14:23
🔗
|
|
C4K3 has quit IRC (Read error: Operation timed out) |
14:26
🔗
|
|
C4K3 has joined #archiveteam-bs |
15:18
🔗
|
|
C4K3 has quit IRC (Quit: leaving) |
15:40
🔗
|
|
kyounko has quit IRC (Read error: Connection reset by peer) |
15:44
🔗
|
pnJay |
anyone have exp running warrior scripts in the windows 10 bash prompt? |
15:51
🔗
|
luckcolor |
pnJay: no but i can help set them up |
15:52
🔗
|
luckcolor |
I used last year to run them on my linux server |
15:53
🔗
|
luckcolor |
just note that if you plan on running them using something like systemd or upstart on WSL on windows that's not going to work |
16:21
🔗
|
|
Dark_Star has quit IRC (Ping timeout: 506 seconds) |
16:24
🔗
|
|
Simpbrain has joined #archiveteam-bs |
16:37
🔗
|
|
odemg has joined #archiveteam-bs |
16:40
🔗
|
|
Dark_Star has joined #archiveteam-bs |
17:15
🔗
|
|
brayden has quit IRC (Ping timeout: 633 seconds) |
17:38
🔗
|
|
Aranje has joined #archiveteam-bs |
17:50
🔗
|
|
Aranje has quit IRC (Quit: Three sheets to the wind) |
17:59
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
18:29
🔗
|
|
odemg has quit IRC (Remote host closed the connection) |
18:32
🔗
|
|
odemg has joined #archiveteam-bs |
18:37
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
18:40
🔗
|
|
dashcloud has joined #archiveteam-bs |
19:01
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
19:14
🔗
|
|
GE has joined #archiveteam-bs |
19:54
🔗
|
|
pnJay has quit IRC (Quit: Page closed) |
20:17
🔗
|
odemg |
https://pastebin.com/raw/zhivLAGh |
20:21
🔗
|
|
fie has quit IRC (Read error: Operation timed out) |
20:37
🔗
|
JAA |
Status updates: Mininova is at 160k URLs done, 302k left (growing), 600/hour; WunderBlogs at 471k done, 373k left (dropping), 6k/hour |
20:38
🔗
|
JAA |
(Yes, Mininova will likely not finish in time.) |
20:45
🔗
|
|
Jonison has quit IRC (Read error: Connection reset by peer) |
20:46
🔗
|
tammy_ |
it's a shame it's not a warrior project. |
20:49
🔗
|
|
pnJay has joined #archiveteam-bs |
20:55
🔗
|
JAA |
ArchiveBot did grab it already. Unfortunately, it seems that over 10% of the pages are 500 Internal Server Error instead of the actual content. |
20:58
🔗
|
JAA |
I guess in this case you could first run a grab ignoring all torrent pages (which should be relatively quick; the slowest part for me are the statistics pages), collect the torrent IDs, and then group those together to get the items. |
21:02
🔗
|
JAA |
I'm curious though whether there is a general solution to grab an entire website in a distributed way without manual pre-processing. The tracker would tell the client "here's a list of 500 URLs - go fetch them and all page requisites, extract links to other URLs we're interested in, then give me the WARC and the list of new URLs". Does this exist? |
21:07
🔗
|
JAA |
(Actually, the page requisites should probably also just go in the list sent back to the server, otherwise there will be tons of duplicates.) |
21:08
🔗
|
|
Jonison has joined #archiveteam-bs |
21:14
🔗
|
odemg |
JAA, mininova is SLOW AS!!! |
21:18
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
21:18
🔗
|
JAA |
Yeah, in this case a distributed effort would probably not really help. I suspect their servers are simply overloaded (or crappy). |
21:21
🔗
|
JAA |
Maybe I'll ignore all individual torrent statistics pages for now; they're extremely slow and timing out frequently. Is there a way to instruct wpull to ignore a URL but still store it in the database so I can grab them later (if I manage to download the rest and there's still time)? |
21:21
🔗
|
odemg |
yeah, I wont touch this, do your thing man |
21:27
🔗
|
JAA |
I try :) |
21:28
🔗
|
JAA |
Need to get some sleep now. If you have any clue about my questions, feel free to reply; I'll read the logs in the morning. |
21:29
🔗
|
|
JAA has quit IRC (Quit: Page closed) |
21:47
🔗
|
|
RichardG has quit IRC (Ping timeout: 633 seconds) |
21:50
🔗
|
|
RichardG has joined #archiveteam-bs |
21:57
🔗
|
|
Jonison has quit IRC (Read error: Connection reset by peer) |
22:03
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
22:13
🔗
|
|
bwn has joined #archiveteam-bs |
23:30
🔗
|
|
odemg has quit IRC (Remote host closed the connection) |
23:31
🔗
|
|
odemg has joined #archiveteam-bs |
23:37
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
23:38
🔗
|
|
Speck has joined #archiveteam-bs |
23:40
🔗
|
|
pnJay has quit IRC (Leaving) |
23:47
🔗
|
|
GE has quit IRC (Remote host closed the connection) |