[00:00] *** BartoCH has joined #archiveteam-bs [00:01] Ah crap, fbo.gov has some crawler detection: https://www.fbo.gov/?s=main&rrltd=1 [00:02] But that page links to an FTP server that allegedly has all the data: ftp://ftp.fbo.gov/ [00:02] Doesn't seem to have the files though. [00:32] *** killsushi has joined #archiveteam-bs [00:39] thanks jodizzle [00:58] Ah, nice, that crawler detection is awful. :-) [01:41] *** britmob has quit IRC (Read error: Connection reset by peer) [02:06] As is the rest of the site. It's very slow and breaks in a variety of ways. [02:16] It's essentially possible to DoS most of the site with a couple thousand requests. Amazing. [02:16] As in, render it useless for an extended period of time. [02:20] I'll try again in the morning and hope the broken cache entries get flushed out by then. [02:20] I'm grabbing the FTP, by the way. [03:24] *** manjaro-u has quit IRC (Read error: Operation timed out) [03:49] *** sotty has left [04:11] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [04:31] *** odemgi_ has joined #archiveteam-bs [04:34] *** Stiletto has quit IRC () [04:35] *** odemgi has quit IRC (Read error: Operation timed out) [04:37] *** qw3rty2 has joined #archiveteam-bs [04:41] *** qw3rty has quit IRC (Ping timeout: 745 seconds) [04:43] *** dxrt_ has quit IRC (Read error: Operation timed out) [04:44] *** dxrt_ has joined #archiveteam-bs [04:44] *** dxrt sets mode: +o dxrt_ [04:44] *** Fusl__ sets mode: +o dxrt_ [04:44] *** Fusl_ sets mode: +o dxrt_ [04:44] *** Fusl sets mode: +o dxrt_ [05:32] *** Stiletto has joined #archiveteam-bs [05:55] *** m007a83 has quit IRC (Read error: Connection reset by peer) [06:00] *** Zeryl has quit IRC (Read error: Connection reset by peer) [06:04] *** m007a83 has joined #archiveteam-bs [07:12] *** deevious has joined #archiveteam-bs [08:02] *** RichardG_ has joined #archiveteam-bs [08:05] *** icedice has quit IRC (Quit: Leaving) [08:07] *** RichardG_ has quit IRC (Ping timeout: 258 seconds) [08:08] *** RichardG has quit IRC (Read error: Operation timed out) [09:01] *** omglolba- has quit IRC (Read error: Connection reset by peer) [09:06] *** omglolbah has joined #archiveteam-bs [09:27] *** katocala has quit IRC (Read error: Operation timed out) [09:27] *** katocala has joined #archiveteam-bs [09:28] *** antomatic has quit IRC (Read error: Operation timed out) [09:47] *** RichardG has joined #archiveteam-bs [10:22] *** antomatic has joined #archiveteam-bs [10:42] *** Panasonic has quit IRC (Read error: Operation timed out) [10:58] *** BlueMax has quit IRC (Read error: Connection reset by peer) [12:21] *** IAmbience has joined #archiveteam-bs [12:33] *** slyphic has quit IRC (Read error: Operation timed out) [12:35] *** slyphic has joined #archiveteam-bs [12:45] *** killsushi has quit IRC (Quit: Leaving) [13:03] *** Mateon1 has quit IRC (Ping timeout: 255 seconds) [13:04] *** Mateon1 has joined #archiveteam-bs [13:48] *** antomatic has quit IRC (Ping timeout: 745 seconds) [13:52] *** antomatic has joined #archiveteam-bs [14:03] *** kiskabak has joined #archiveteam-bs [14:03] *** Fusl sets mode: +o kiskabak [14:03] *** Fusl__ sets mode: +o kiskabak [14:03] *** Fusl_ sets mode: +o kiskabak [14:07] FBO's pagination is still broken, surprise, surprise. Working better now than when I shredded it yesterday evening though. Hopefully I'll be able to get most of the entries anyway. [14:07] Their scraping detection is simply a rate limit of a bit under 1 request per second, but that's per session, not per IP. So I'm just running ten sessions in parallel now. :-) [15:04] *** RichardG has quit IRC (Read error: Connection reset by peer) [15:05] *** RichardG has joined #archiveteam-bs [15:05] *** RichardG_ has joined #archiveteam-bs [15:06] *** RichardG has quit IRC (Read error: Connection reset by peer) [15:37] *** SmileyG has joined #archiveteam-bs [15:37] *** Smiley has quit IRC (Read error: Operation timed out) [15:48] *** akierig has joined #archiveteam-bs [16:08] Grabbing the SuperiorPics forums https://www.superiorpics.com/c/ now. http://forums.superiorpics.com/ubbthreads/ubbthreads.php/topics/5486588 [16:10] JAA, would you also grab the old version of their forums? LIke http://forums.superiorpics.com/ubbthreads/ubbthreads.php/forums/10//From_The_Webmaster ? It's locked behind a login, but most of the navigation pages can be accessed [16:11] That's actually what I'm grabbing. [16:11] That other thing is just a new interface for the same forums. [16:11] ...Oh it does oO; [16:11] My crawl starts from the category pages, e.g. http://forums.superiorpics.com/ubbthreads/ubbthreads.php/category/3/General_Comments [16:12] Would that include the old version of the navigation pages? [16:12] I'm grabbing those category pages for the three existing categories, then all forums mentioned there, then all threads in those. [16:12] Including pagination obviously. [16:13] *** Atom-- has joined #archiveteam-bs [16:13] Ah, it's too bad it seems the main page of the old version of the forums aren't showable (or it's entirely replaced) [16:14] Going to http://forums.superiorpics.com/ubbthreads/ubbthreads.php/forum_summary would just redirect to https://www.superiorpics.com/c/ [16:14] Yup [16:17] *** Atom__ has quit IRC (Read error: Operation timed out) [16:22] *** manjaro-u has joined #archiveteam-bs [16:35] *** katocala has quit IRC () [16:57] *** katocala has joined #archiveteam-bs [17:15] *** katocala has quit IRC () [17:17] *** katocala has joined #archiveteam-bs [17:31] *** icedice has joined #archiveteam-bs [17:57] *** MrRadar has quit IRC (Read error: Operation timed out) [18:10] *** akierig has quit IRC (Quit: later_gator) [18:13] *** MrRadar has joined #archiveteam-bs [18:17] *** systwi_ has joined #archiveteam-bs [18:22] *** systwi has quit IRC (Read error: Operation timed out) [18:36] *** Pixi has quit IRC (Quit: Pixi) [18:37] *** katocala has quit IRC (Read error: Operation timed out) [18:37] *** katocala has joined #archiveteam-bs [18:45] *** Pixi has joined #archiveteam-bs [19:07] *** wyatt8740 has joined #archiveteam-bs [19:29] *** systwi has joined #archiveteam-bs [19:34] *** akierig has joined #archiveteam-bs [19:35] *** systwi_ has quit IRC (Read error: Operation timed out) [19:40] *** icedice has quit IRC (Quit: Leaving) [19:40] *** icedice has joined #archiveteam-bs [19:42] *** icedice has quit IRC (Client Quit) [19:42] *** icedice has joined #archiveteam-bs [19:43] *** icedice has quit IRC (Client Quit) [19:44] *** icedice has joined #archiveteam-bs [19:44] *** icedice has quit IRC (Connection closed) [19:44] *** icedice has joined #archiveteam-bs [19:49] *** Quirk8 has quit IRC (END OF LINE) [19:55] *** X-Scale` has joined #archiveteam-bs [19:57] *** X-Scale has quit IRC (Ping timeout: 252 seconds) [19:57] *** X-Scale` is now known as X-Scale [20:29] *** X-Scale` has joined #archiveteam-bs [20:30] *** X-Scale has quit IRC (Ping timeout: 252 seconds) [20:30] *** X-Scale` is now known as X-Scale [20:53] SketchCow, not managed to catch you since the betaarchive thing, did you get my pm? [21:23] *** Dash has quit IRC (ZNC 1.6.6+deb1ubuntu0.2 - http://znc.in) [21:50] *** Ravenloft has joined #archiveteam-bs [21:56] *** Flashfire has quit IRC (Remote host closed the connection) [21:56] *** kiska has quit IRC (Remote host closed the connection) [21:57] *** Flashfire has joined #archiveteam-bs [21:57] *** kiska has joined #archiveteam-bs [21:57] *** Fusl__ sets mode: +o kiska [21:57] *** Fusl sets mode: +o kiska [21:57] *** Fusl_ sets mode: +o kiska [21:57] *** BlueMax has joined #archiveteam-bs [22:00] Surprisingly, the FBO pagination is still working correctly it seems. [22:00] It's not even half-way done though. :-| [22:00] Average response time is ~20 seconds currently... [22:01] And it increases with the page number, obviously. [22:03] The SuperiorPics forum grab is running well except for some encoding issues. The site claims it's serving UTF-8, but apparently it's ISO-8859-1 instead. That only affects the logging though of images and outlinks, so I'll just let it keep throwing errors. If necessary, I can try to extract those URLs again in the end from the WARC. [22:04] Also, broken HTML. So much broken HTML... [22:06] *** ibachandl has joined #archiveteam-bs [22:11] *** akierig has quit IRC (Quit: later_gator) [23:05] *** tuluu has quit IRC (Ping timeout: 258 seconds) [23:07] *** katocala has quit IRC (Read error: Operation timed out) [23:08] *** tuluu has joined #archiveteam-bs [23:08] *** katocala has joined #archiveteam-bs