#archiveteam-bs 2019-08-31,Sat

↑back Search

Time Nickname Message
00:00 🔗 BlueMax has quit IRC (Quit: Leaving)
00:01 🔗 fredgido has quit IRC (Remote host closed the connection)
00:02 🔗 fredgido has joined #archiveteam-bs
01:04 🔗 ZizzyDizz has joined #archiveteam-bs
01:10 🔗 BlueMax has joined #archiveteam-bs
01:22 🔗 killsushi has quit IRC (Read error: Connection reset by peer)
01:26 🔗 DogsRNice has quit IRC (Read error: Connection reset by peer)
01:38 🔗 nepeat has quit IRC (Read error: Connection reset by peer)
01:39 🔗 nepeat has joined #archiveteam-bs
01:46 🔗 nepeat has quit IRC (Quit: ZNC 1.7.4 - https://znc.in)
01:47 🔗 nepeat has joined #archiveteam-bs
01:55 🔗 m007a83_ is now known as m007a83
02:29 🔗 Smiley has quit IRC (Ping timeout: 252 seconds)
02:32 🔗 Smiley has joined #archiveteam-bs
02:44 🔗 qw3rty115 has joined #archiveteam-bs
02:49 🔗 qw3rty114 has quit IRC (Read error: Operation timed out)
03:39 🔗 larryv has quit IRC (Quit: larryv)
03:42 🔗 qw3rty116 has joined #archiveteam-bs
03:47 🔗 qw3rty115 has quit IRC (Ping timeout: 612 seconds)
03:58 🔗 odemgi_ has joined #archiveteam-bs
04:03 🔗 odemgi has quit IRC (Read error: Operation timed out)
04:05 🔗 ZizzyDizz has quit IRC (Ping timeout: 260 seconds)
04:42 🔗 omarroth has quit IRC (Remote host closed the connection)
06:11 🔗 d5f4a3622 has quit IRC (Read error: Connection reset by peer)
06:11 🔗 d5f4a3622 has joined #archiveteam-bs
06:37 🔗 PurpleSym JAA, Sanqui: chromebot is able to capture these fragment-based sites, but it does not support recursion on them yet, see https://github.com/PromyLOPh/crocoite/issues/12
07:18 🔗 Mateon1 has joined #archiveteam-bs
07:27 🔗 RichardG has quit IRC (Ping timeout: 360 seconds)
09:28 🔗 BlueMax has quit IRC (Quit: Leaving)
09:35 🔗 Sanqui It's tragic when an angelfire or tripod site tells you they've moved because it allows "more bandwidth and size" but the new website is gone completely.
10:08 🔗 godane SketchCow: good news
10:08 🔗 godane your getting Processor Newspaper from 2012 to 2014
10:09 🔗 godane its part of Sandhills Publishing
10:09 🔗 godane i have also came up with my new script using python ia command
10:33 🔗 tuluu has quit IRC (Read error: Connection refused)
10:33 🔗 tuluu has joined #archiveteam-bs
10:38 🔗 godane so debian is doing a update on its own
11:49 🔗 RichardG has joined #archiveteam-bs
12:59 🔗 killsushi has joined #archiveteam-bs
14:00 🔗 JAA Fucking hell, Disqus is even more awful than I thought.
14:02 🔗 kiska XD
14:02 🔗 kiska That's js for ya
14:05 🔗 JAA Yeah
14:05 🔗 JAA On the plus side, I'm almost ready for starting the Channels archival.
14:05 🔗 JAA And it's about time since that'll go down either today or tomorrow.
14:15 🔗 Sanqui I thought Disqus works without JS
14:19 🔗 JAA Nope. Are you confusing it with Discourse?
14:27 🔗 JAA Soo, I can't guarantee that I'll grab all the relevant URLs that would in theory allow direct playback. There are just too many things playing into what is requested exactly. In particular, there is an "embed comments" URL which includes the thread URL (fine), the thread title (fine), another thread title (uhh), and *another* thread title but this time merged together from all <h1> tags on the page.
14:27 🔗 JAA Yup, really. The sort order also seems to be configured per forum/channel or something like that, but I can't figure out where it comes from.
15:20 🔗 n00buser has joined #archiveteam-bs
15:40 🔗 h3ndr1k_ has quit IRC (Ping timeout: 252 seconds)
16:02 🔗 n00buser has quit IRC (Ping timeout: 246 seconds)
16:46 🔗 JAA Disqus Channels archival is started. Let's see how long it takes until I get banned.
16:47 🔗 Raccoon nothing good shall come of this
16:47 🔗 Fusl ryz be very happy about this
16:48 🔗 JAA I didn't have any issues on my tests, but I only ran it with 20 concurrent connections there.
16:48 🔗 JAA I was doing ~5k requests per minute though.
17:18 🔗 h3ndr1k has joined #archiveteam-bs
17:38 🔗 JAA Well, not making much progress because Disqus is really slow.
17:39 🔗 JAA I'm only able to push about 2.5k requests per minute now.
17:40 🔗 JAA But at least something is being saved.
17:43 🔗 Raccoon "ONLy 2.5k REquEsTs pER miNuTe"
17:44 🔗 Fusl :D
17:44 🔗 JAA :-)
17:44 🔗 JAA I can easily do 600 requests per second with this code, so yeah, *only*.
17:45 🔗 Raccoon you should call your 'Discus' scrapper, "Hammer Throw"
18:17 🔗 JAA About 44k discussions archived now.
18:21 🔗 JAA The channels with 10k or more followers have a total of about 943k discussions.
18:23 🔗 JAA So that'd be roughly 32 hours to cover those.
18:23 🔗 JAA (It doesn't actually proceed in that order though.)
18:25 🔗 JAA No issues so far with the retrieval though, other than it being slow. Just a handful of timeouts.
18:59 🔗 h3ndr1k has quit IRC (Quit: )
19:03 🔗 h3ndr1k has joined #archiveteam-bs
19:41 🔗 ShellyRol has quit IRC (Ping timeout: 745 seconds)
19:52 🔗 ShellyRol has joined #archiveteam-bs
20:25 🔗 JAA My crawl just passed 100k archived discussions.
20:26 🔗 JAA It discovered 348 channels/forums/whatever in total, 191 are done from what I can see.
20:40 🔗 markedL oh JAA's on the case, yeah, good hands there.
20:51 🔗 omarroth has joined #archiveteam-bs
20:57 🔗 Hani111 has joined #archiveteam-bs
21:06 🔗 Ryz has joined #archiveteam-bs
21:06 🔗 Fusl__ sets mode: +o Ryz
21:06 🔗 Fusl sets mode: +o Ryz
21:06 🔗 Fusl_ sets mode: +o Ryz
21:07 🔗 Ryz Yessss, JAA, yessssss, grab as much loot as possibleeeeeeeeee~
21:07 🔗 JAA If Disqus's servers weren't so bad, I'd have a full copy already.
21:08 🔗 Ryz Any idea how much left?
21:08 🔗 Hani has quit IRC (Ping timeout: 745 seconds)
21:08 🔗 Hani111 is now known as Hani
21:09 🔗 JAA Not really. A lot for sure.
21:09 🔗 JAA 124k discussions grabbed, and there are at least a million.
21:10 🔗 closure has quit IRC (Read error: Operation timed out)
21:10 🔗 JAA My server could easily go four times as fast as it does now, but well...
21:10 🔗 closure_ has joined #archiveteam-bs
21:32 🔗 Raccoon Only 1/8 way there. Labor Day weekend. Nobody will notice till Tuesday after you've locked the doors behind you.
21:33 🔗 JAA They announced the shutdown for 1 September though.
21:33 🔗 Raccoon While everyone's getting drunk with family? doubtful
21:40 🔗 Ryz JAA, can you run qwarc on more than one website?
21:41 🔗 K4k_ has quit IRC (Ping timeout: 252 seconds)
21:42 🔗 JAA Ryz: Uh, yeah? I think you misunderstand what qwarc is though.
22:05 🔗 fredgido has quit IRC (Read error: Connection reset by peer)
22:05 🔗 fredgido has joined #archiveteam-bs
22:36 🔗 JAA Some stats after almost 6 hours: 187k discussions done, 1.15M requests (= 50 req/s on average), rx ~51 GB, 7.74 GiB WARCs
22:37 🔗 JAA 115 of the 348 channels are still being retrieved.
22:39 🔗 Ryz Mm, might not make it in time at this rate? :c
22:40 🔗 Ryz It'll still be a good amount of coverage; not a complete one ideally~
22:42 🔗 BlueMax has joined #archiveteam-bs
22:55 🔗 larryv has joined #archiveteam-bs
23:00 🔗 ShellyRol has quit IRC (Read error: Operation timed out)
23:05 🔗 ShellyRol has joined #archiveteam-bs
23:28 🔗 qw3rty117 has joined #archiveteam-bs
23:33 🔗 RichardG_ has joined #archiveteam-bs
23:33 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
23:34 🔗 qw3rty116 has quit IRC (Ping timeout: 612 seconds)

irclogger-viewer