Time |
Nickname |
Message |
00:00
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
00:01
🔗
|
|
fredgido has quit IRC (Remote host closed the connection) |
00:02
🔗
|
|
fredgido has joined #archiveteam-bs |
01:04
🔗
|
|
ZizzyDizz has joined #archiveteam-bs |
01:10
🔗
|
|
BlueMax has joined #archiveteam-bs |
01:22
🔗
|
|
killsushi has quit IRC (Read error: Connection reset by peer) |
01:26
🔗
|
|
DogsRNice has quit IRC (Read error: Connection reset by peer) |
01:38
🔗
|
|
nepeat has quit IRC (Read error: Connection reset by peer) |
01:39
🔗
|
|
nepeat has joined #archiveteam-bs |
01:46
🔗
|
|
nepeat has quit IRC (Quit: ZNC 1.7.4 - https://znc.in) |
01:47
🔗
|
|
nepeat has joined #archiveteam-bs |
01:55
🔗
|
|
m007a83_ is now known as m007a83 |
02:29
🔗
|
|
Smiley has quit IRC (Ping timeout: 252 seconds) |
02:32
🔗
|
|
Smiley has joined #archiveteam-bs |
02:44
🔗
|
|
qw3rty115 has joined #archiveteam-bs |
02:49
🔗
|
|
qw3rty114 has quit IRC (Read error: Operation timed out) |
03:39
🔗
|
|
larryv has quit IRC (Quit: larryv) |
03:42
🔗
|
|
qw3rty116 has joined #archiveteam-bs |
03:47
🔗
|
|
qw3rty115 has quit IRC (Ping timeout: 612 seconds) |
03:58
🔗
|
|
odemgi_ has joined #archiveteam-bs |
04:03
🔗
|
|
odemgi has quit IRC (Read error: Operation timed out) |
04:05
🔗
|
|
ZizzyDizz has quit IRC (Ping timeout: 260 seconds) |
04:42
🔗
|
|
omarroth has quit IRC (Remote host closed the connection) |
06:11
🔗
|
|
d5f4a3622 has quit IRC (Read error: Connection reset by peer) |
06:11
🔗
|
|
d5f4a3622 has joined #archiveteam-bs |
06:37
🔗
|
PurpleSym |
JAA, Sanqui: chromebot is able to capture these fragment-based sites, but it does not support recursion on them yet, see https://github.com/PromyLOPh/crocoite/issues/12 |
07:18
🔗
|
|
Mateon1 has joined #archiveteam-bs |
07:27
🔗
|
|
RichardG has quit IRC (Ping timeout: 360 seconds) |
09:28
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
09:35
🔗
|
Sanqui |
It's tragic when an angelfire or tripod site tells you they've moved because it allows "more bandwidth and size" but the new website is gone completely. |
10:08
🔗
|
godane |
SketchCow: good news |
10:08
🔗
|
godane |
your getting Processor Newspaper from 2012 to 2014 |
10:09
🔗
|
godane |
its part of Sandhills Publishing |
10:09
🔗
|
godane |
i have also came up with my new script using python ia command |
10:33
🔗
|
|
tuluu has quit IRC (Read error: Connection refused) |
10:33
🔗
|
|
tuluu has joined #archiveteam-bs |
10:38
🔗
|
godane |
so debian is doing a update on its own |
11:49
🔗
|
|
RichardG has joined #archiveteam-bs |
12:59
🔗
|
|
killsushi has joined #archiveteam-bs |
14:00
🔗
|
JAA |
Fucking hell, Disqus is even more awful than I thought. |
14:02
🔗
|
kiska |
XD |
14:02
🔗
|
kiska |
That's js for ya |
14:05
🔗
|
JAA |
Yeah |
14:05
🔗
|
JAA |
On the plus side, I'm almost ready for starting the Channels archival. |
14:05
🔗
|
JAA |
And it's about time since that'll go down either today or tomorrow. |
14:15
🔗
|
Sanqui |
I thought Disqus works without JS |
14:19
🔗
|
JAA |
Nope. Are you confusing it with Discourse? |
14:27
🔗
|
JAA |
Soo, I can't guarantee that I'll grab all the relevant URLs that would in theory allow direct playback. There are just too many things playing into what is requested exactly. In particular, there is an "embed comments" URL which includes the thread URL (fine), the thread title (fine), another thread title (uhh), and *another* thread title but this time merged together from all <h1> tags on the page. |
14:27
🔗
|
JAA |
Yup, really. The sort order also seems to be configured per forum/channel or something like that, but I can't figure out where it comes from. |
15:20
🔗
|
|
n00buser has joined #archiveteam-bs |
15:40
🔗
|
|
h3ndr1k_ has quit IRC (Ping timeout: 252 seconds) |
16:02
🔗
|
|
n00buser has quit IRC (Ping timeout: 246 seconds) |
16:46
🔗
|
JAA |
Disqus Channels archival is started. Let's see how long it takes until I get banned. |
16:47
🔗
|
Raccoon |
nothing good shall come of this |
16:47
🔗
|
Fusl |
ryz be very happy about this |
16:48
🔗
|
JAA |
I didn't have any issues on my tests, but I only ran it with 20 concurrent connections there. |
16:48
🔗
|
JAA |
I was doing ~5k requests per minute though. |
17:18
🔗
|
|
h3ndr1k has joined #archiveteam-bs |
17:38
🔗
|
JAA |
Well, not making much progress because Disqus is really slow. |
17:39
🔗
|
JAA |
I'm only able to push about 2.5k requests per minute now. |
17:40
🔗
|
JAA |
But at least something is being saved. |
17:43
🔗
|
Raccoon |
"ONLy 2.5k REquEsTs pER miNuTe" |
17:44
🔗
|
Fusl |
:D |
17:44
🔗
|
JAA |
:-) |
17:44
🔗
|
JAA |
I can easily do 600 requests per second with this code, so yeah, *only*. |
17:45
🔗
|
Raccoon |
you should call your 'Discus' scrapper, "Hammer Throw" |
18:17
🔗
|
JAA |
About 44k discussions archived now. |
18:21
🔗
|
JAA |
The channels with 10k or more followers have a total of about 943k discussions. |
18:23
🔗
|
JAA |
So that'd be roughly 32 hours to cover those. |
18:23
🔗
|
JAA |
(It doesn't actually proceed in that order though.) |
18:25
🔗
|
JAA |
No issues so far with the retrieval though, other than it being slow. Just a handful of timeouts. |
18:59
🔗
|
|
h3ndr1k has quit IRC (Quit: ) |
19:03
🔗
|
|
h3ndr1k has joined #archiveteam-bs |
19:41
🔗
|
|
ShellyRol has quit IRC (Ping timeout: 745 seconds) |
19:52
🔗
|
|
ShellyRol has joined #archiveteam-bs |
20:25
🔗
|
JAA |
My crawl just passed 100k archived discussions. |
20:26
🔗
|
JAA |
It discovered 348 channels/forums/whatever in total, 191 are done from what I can see. |
20:40
🔗
|
markedL |
oh JAA's on the case, yeah, good hands there. |
20:51
🔗
|
|
omarroth has joined #archiveteam-bs |
20:57
🔗
|
|
Hani111 has joined #archiveteam-bs |
21:06
🔗
|
|
Ryz has joined #archiveteam-bs |
21:06
🔗
|
|
Fusl__ sets mode: +o Ryz |
21:06
🔗
|
|
Fusl sets mode: +o Ryz |
21:06
🔗
|
|
Fusl_ sets mode: +o Ryz |
21:07
🔗
|
Ryz |
Yessss, JAA, yessssss, grab as much loot as possibleeeeeeeeee~ |
21:07
🔗
|
JAA |
If Disqus's servers weren't so bad, I'd have a full copy already. |
21:08
🔗
|
Ryz |
Any idea how much left? |
21:08
🔗
|
|
Hani has quit IRC (Ping timeout: 745 seconds) |
21:08
🔗
|
|
Hani111 is now known as Hani |
21:09
🔗
|
JAA |
Not really. A lot for sure. |
21:09
🔗
|
JAA |
124k discussions grabbed, and there are at least a million. |
21:10
🔗
|
|
closure has quit IRC (Read error: Operation timed out) |
21:10
🔗
|
JAA |
My server could easily go four times as fast as it does now, but well... |
21:10
🔗
|
|
closure_ has joined #archiveteam-bs |
21:32
🔗
|
Raccoon |
Only 1/8 way there. Labor Day weekend. Nobody will notice till Tuesday after you've locked the doors behind you. |
21:33
🔗
|
JAA |
They announced the shutdown for 1 September though. |
21:33
🔗
|
Raccoon |
While everyone's getting drunk with family? doubtful |
21:40
🔗
|
Ryz |
JAA, can you run qwarc on more than one website? |
21:41
🔗
|
|
K4k_ has quit IRC (Ping timeout: 252 seconds) |
21:42
🔗
|
JAA |
Ryz: Uh, yeah? I think you misunderstand what qwarc is though. |
22:05
🔗
|
|
fredgido has quit IRC (Read error: Connection reset by peer) |
22:05
🔗
|
|
fredgido has joined #archiveteam-bs |
22:36
🔗
|
JAA |
Some stats after almost 6 hours: 187k discussions done, 1.15M requests (= 50 req/s on average), rx ~51 GB, 7.74 GiB WARCs |
22:37
🔗
|
JAA |
115 of the 348 channels are still being retrieved. |
22:39
🔗
|
Ryz |
Mm, might not make it in time at this rate? :c |
22:40
🔗
|
Ryz |
It'll still be a good amount of coverage; not a complete one ideally~ |
22:42
🔗
|
|
BlueMax has joined #archiveteam-bs |
22:55
🔗
|
|
larryv has joined #archiveteam-bs |
23:00
🔗
|
|
ShellyRol has quit IRC (Read error: Operation timed out) |
23:05
🔗
|
|
ShellyRol has joined #archiveteam-bs |
23:28
🔗
|
|
qw3rty117 has joined #archiveteam-bs |
23:33
🔗
|
|
RichardG_ has joined #archiveteam-bs |
23:33
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
23:34
🔗
|
|
qw3rty116 has quit IRC (Ping timeout: 612 seconds) |