Time |
Nickname |
Message |
00:00
🔗
|
|
BartoCH has joined #archiveteam-bs |
00:01
🔗
|
JAA |
Ah crap, fbo.gov has some crawler detection: https://www.fbo.gov/?s=main&rrltd=1 |
00:02
🔗
|
JAA |
But that page links to an FTP server that allegedly has all the data: ftp://ftp.fbo.gov/ |
00:02
🔗
|
JAA |
Doesn't seem to have the files though. |
00:32
🔗
|
|
killsushi has joined #archiveteam-bs |
00:39
🔗
|
paul2520 |
thanks jodizzle |
00:58
🔗
|
JAA |
Ah, nice, that crawler detection is awful. :-) |
01:41
🔗
|
|
britmob has quit IRC (Read error: Connection reset by peer) |
02:06
🔗
|
JAA |
As is the rest of the site. It's very slow and breaks in a variety of ways. |
02:16
🔗
|
JAA |
It's essentially possible to DoS most of the site with a couple thousand requests. Amazing. |
02:16
🔗
|
JAA |
As in, render it useless for an extended period of time. |
02:20
🔗
|
JAA |
I'll try again in the morning and hope the broken cache entries get flushed out by then. |
02:20
🔗
|
JAA |
I'm grabbing the FTP, by the way. |
03:24
🔗
|
|
manjaro-u has quit IRC (Read error: Operation timed out) |
03:49
🔗
|
|
sotty has left |
04:11
🔗
|
|
DogsRNice has quit IRC (Read error: Connection reset by peer) |
04:31
🔗
|
|
odemgi_ has joined #archiveteam-bs |
04:34
🔗
|
|
Stiletto has quit IRC () |
04:35
🔗
|
|
odemgi has quit IRC (Read error: Operation timed out) |
04:37
🔗
|
|
qw3rty2 has joined #archiveteam-bs |
04:41
🔗
|
|
qw3rty has quit IRC (Ping timeout: 745 seconds) |
04:43
🔗
|
|
dxrt_ has quit IRC (Read error: Operation timed out) |
04:44
🔗
|
|
dxrt_ has joined #archiveteam-bs |
04:44
🔗
|
|
dxrt sets mode: +o dxrt_ |
04:44
🔗
|
|
Fusl__ sets mode: +o dxrt_ |
04:44
🔗
|
|
Fusl_ sets mode: +o dxrt_ |
04:44
🔗
|
|
Fusl sets mode: +o dxrt_ |
05:32
🔗
|
|
Stiletto has joined #archiveteam-bs |
05:55
🔗
|
|
m007a83 has quit IRC (Read error: Connection reset by peer) |
06:00
🔗
|
|
Zeryl has quit IRC (Read error: Connection reset by peer) |
06:04
🔗
|
|
m007a83 has joined #archiveteam-bs |
07:12
🔗
|
|
deevious has joined #archiveteam-bs |
08:02
🔗
|
|
RichardG_ has joined #archiveteam-bs |
08:05
🔗
|
|
icedice has quit IRC (Quit: Leaving) |
08:07
🔗
|
|
RichardG_ has quit IRC (Ping timeout: 258 seconds) |
08:08
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
09:01
🔗
|
|
omglolba- has quit IRC (Read error: Connection reset by peer) |
09:06
🔗
|
|
omglolbah has joined #archiveteam-bs |
09:27
🔗
|
|
katocala has quit IRC (Read error: Operation timed out) |
09:27
🔗
|
|
katocala has joined #archiveteam-bs |
09:28
🔗
|
|
antomatic has quit IRC (Read error: Operation timed out) |
09:47
🔗
|
|
RichardG has joined #archiveteam-bs |
10:22
🔗
|
|
antomatic has joined #archiveteam-bs |
10:42
🔗
|
|
Panasonic has quit IRC (Read error: Operation timed out) |
10:58
🔗
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
12:21
🔗
|
|
IAmbience has joined #archiveteam-bs |
12:33
🔗
|
|
slyphic has quit IRC (Read error: Operation timed out) |
12:35
🔗
|
|
slyphic has joined #archiveteam-bs |
12:45
🔗
|
|
killsushi has quit IRC (Quit: Leaving) |
13:03
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 255 seconds) |
13:04
🔗
|
|
Mateon1 has joined #archiveteam-bs |
13:48
🔗
|
|
antomatic has quit IRC (Ping timeout: 745 seconds) |
13:52
🔗
|
|
antomatic has joined #archiveteam-bs |
14:03
🔗
|
|
kiskabak has joined #archiveteam-bs |
14:03
🔗
|
|
Fusl sets mode: +o kiskabak |
14:03
🔗
|
|
Fusl__ sets mode: +o kiskabak |
14:03
🔗
|
|
Fusl_ sets mode: +o kiskabak |
14:07
🔗
|
JAA |
FBO's pagination is still broken, surprise, surprise. Working better now than when I shredded it yesterday evening though. Hopefully I'll be able to get most of the entries anyway. |
14:07
🔗
|
JAA |
Their scraping detection is simply a rate limit of a bit under 1 request per second, but that's per session, not per IP. So I'm just running ten sessions in parallel now. :-) |
15:04
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
15:05
🔗
|
|
RichardG has joined #archiveteam-bs |
15:05
🔗
|
|
RichardG_ has joined #archiveteam-bs |
15:06
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
15:37
🔗
|
|
SmileyG has joined #archiveteam-bs |
15:37
🔗
|
|
Smiley has quit IRC (Read error: Operation timed out) |
15:48
🔗
|
|
akierig has joined #archiveteam-bs |
16:08
🔗
|
JAA |
Grabbing the SuperiorPics forums https://www.superiorpics.com/c/ now. http://forums.superiorpics.com/ubbthreads/ubbthreads.php/topics/5486588 |
16:10
🔗
|
Ryz |
JAA, would you also grab the old version of their forums? LIke http://forums.superiorpics.com/ubbthreads/ubbthreads.php/forums/10//From_The_Webmaster ? It's locked behind a login, but most of the navigation pages can be accessed |
16:11
🔗
|
JAA |
That's actually what I'm grabbing. |
16:11
🔗
|
JAA |
That other thing is just a new interface for the same forums. |
16:11
🔗
|
Ryz |
...Oh it does oO; |
16:11
🔗
|
JAA |
My crawl starts from the category pages, e.g. http://forums.superiorpics.com/ubbthreads/ubbthreads.php/category/3/General_Comments |
16:12
🔗
|
Ryz |
Would that include the old version of the navigation pages? |
16:12
🔗
|
JAA |
I'm grabbing those category pages for the three existing categories, then all forums mentioned there, then all threads in those. |
16:12
🔗
|
JAA |
Including pagination obviously. |
16:13
🔗
|
|
Atom-- has joined #archiveteam-bs |
16:13
🔗
|
Ryz |
Ah, it's too bad it seems the main page of the old version of the forums aren't showable (or it's entirely replaced) |
16:14
🔗
|
Ryz |
Going to http://forums.superiorpics.com/ubbthreads/ubbthreads.php/forum_summary would just redirect to https://www.superiorpics.com/c/ |
16:14
🔗
|
JAA |
Yup |
16:17
🔗
|
|
Atom__ has quit IRC (Read error: Operation timed out) |
16:22
🔗
|
|
manjaro-u has joined #archiveteam-bs |
16:35
🔗
|
|
katocala has quit IRC () |
16:57
🔗
|
|
katocala has joined #archiveteam-bs |
17:15
🔗
|
|
katocala has quit IRC () |
17:17
🔗
|
|
katocala has joined #archiveteam-bs |
17:31
🔗
|
|
icedice has joined #archiveteam-bs |
17:57
🔗
|
|
MrRadar has quit IRC (Read error: Operation timed out) |
18:10
🔗
|
|
akierig has quit IRC (Quit: later_gator) |
18:13
🔗
|
|
MrRadar has joined #archiveteam-bs |
18:17
🔗
|
|
systwi_ has joined #archiveteam-bs |
18:22
🔗
|
|
systwi has quit IRC (Read error: Operation timed out) |
18:36
🔗
|
|
Pixi has quit IRC (Quit: Pixi) |
18:37
🔗
|
|
katocala has quit IRC (Read error: Operation timed out) |
18:37
🔗
|
|
katocala has joined #archiveteam-bs |
18:45
🔗
|
|
Pixi has joined #archiveteam-bs |
19:07
🔗
|
|
wyatt8740 has joined #archiveteam-bs |
19:29
🔗
|
|
systwi has joined #archiveteam-bs |
19:34
🔗
|
|
akierig has joined #archiveteam-bs |
19:35
🔗
|
|
systwi_ has quit IRC (Read error: Operation timed out) |
19:40
🔗
|
|
icedice has quit IRC (Quit: Leaving) |
19:40
🔗
|
|
icedice has joined #archiveteam-bs |
19:42
🔗
|
|
icedice has quit IRC (Client Quit) |
19:42
🔗
|
|
icedice has joined #archiveteam-bs |
19:43
🔗
|
|
icedice has quit IRC (Client Quit) |
19:44
🔗
|
|
icedice has joined #archiveteam-bs |
19:44
🔗
|
|
icedice has quit IRC (Connection closed) |
19:44
🔗
|
|
icedice has joined #archiveteam-bs |
19:49
🔗
|
|
Quirk8 has quit IRC (END OF LINE) |
19:55
🔗
|
|
X-Scale` has joined #archiveteam-bs |
19:57
🔗
|
|
X-Scale has quit IRC (Ping timeout: 252 seconds) |
19:57
🔗
|
|
X-Scale` is now known as X-Scale |
20:29
🔗
|
|
X-Scale` has joined #archiveteam-bs |
20:30
🔗
|
|
X-Scale has quit IRC (Ping timeout: 252 seconds) |
20:30
🔗
|
|
X-Scale` is now known as X-Scale |
20:53
🔗
|
odemgi_ |
SketchCow, not managed to catch you since the betaarchive thing, did you get my pm? |
21:23
🔗
|
|
Dash has quit IRC (ZNC 1.6.6+deb1ubuntu0.2 - http://znc.in) |
21:50
🔗
|
|
Ravenloft has joined #archiveteam-bs |
21:56
🔗
|
|
Flashfire has quit IRC (Remote host closed the connection) |
21:56
🔗
|
|
kiska has quit IRC (Remote host closed the connection) |
21:57
🔗
|
|
Flashfire has joined #archiveteam-bs |
21:57
🔗
|
|
kiska has joined #archiveteam-bs |
21:57
🔗
|
|
Fusl__ sets mode: +o kiska |
21:57
🔗
|
|
Fusl sets mode: +o kiska |
21:57
🔗
|
|
Fusl_ sets mode: +o kiska |
21:57
🔗
|
|
BlueMax has joined #archiveteam-bs |
22:00
🔗
|
JAA |
Surprisingly, the FBO pagination is still working correctly it seems. |
22:00
🔗
|
JAA |
It's not even half-way done though. :-| |
22:00
🔗
|
JAA |
Average response time is ~20 seconds currently... |
22:01
🔗
|
JAA |
And it increases with the page number, obviously. |
22:03
🔗
|
JAA |
The SuperiorPics forum grab is running well except for some encoding issues. The site claims it's serving UTF-8, but apparently it's ISO-8859-1 instead. That only affects the logging though of images and outlinks, so I'll just let it keep throwing errors. If necessary, I can try to extract those URLs again in the end from the WARC. |
22:04
🔗
|
JAA |
Also, broken HTML. So much broken HTML... |
22:06
🔗
|
|
ibachandl has joined #archiveteam-bs |
22:11
🔗
|
|
akierig has quit IRC (Quit: later_gator) |
23:05
🔗
|
|
tuluu has quit IRC (Ping timeout: 258 seconds) |
23:07
🔗
|
|
katocala has quit IRC (Read error: Operation timed out) |
23:08
🔗
|
|
tuluu has joined #archiveteam-bs |
23:08
🔗
|
|
katocala has joined #archiveteam-bs |