Time |
Nickname |
Message |
00:09
🔗
|
wp494 |
reminder that d-day for roblox forums is the 27th |
00:10
🔗
|
|
odemg has joined #archiveteam-bs |
00:10
🔗
|
|
ReimuHaku has quit IRC (Ping timeout: 245 seconds) |
00:13
🔗
|
|
ReimuHaku has joined #archiveteam-bs |
00:17
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
00:18
🔗
|
|
bitBaron has joined #archiveteam-bs |
00:19
🔗
|
|
qw3rty2 has quit IRC (Read error: Operation timed out) |
00:22
🔗
|
|
qw3rty has joined #archiveteam-bs |
00:24
🔗
|
|
bitBaron has quit IRC (Read error: Operation timed out) |
00:24
🔗
|
|
GLaDOS has quit IRC (Remote host closed the connection) |
00:26
🔗
|
|
tsuckow has joined #archiveteam-bs |
00:31
🔗
|
|
GLaDOS has joined #archiveteam-bs |
00:35
🔗
|
GLaDOS |
..so i figured out why i wasnt able to log into my wiki account |
00:35
🔗
|
GLaDOS |
my password manager fucking urlencoded the password when it saved it |
00:36
🔗
|
|
tsuckow has quit IRC (Ping timeout: 268 seconds) |
00:41
🔗
|
xmc |
hahahaha |
00:41
🔗
|
xmc |
computers are just great aren't they |
01:20
🔗
|
GLaDOS |
"what's that? this string has to be kept the exact same? well i better encode it then!" |
01:23
🔗
|
|
bitBaron has joined #archiveteam-bs |
01:33
🔗
|
|
bitBaron has quit IRC (Read error: Operation timed out) |
01:39
🔗
|
|
username1 has joined #archiveteam-bs |
01:41
🔗
|
|
schbirid2 has quit IRC (Read error: Operation timed out) |
02:02
🔗
|
|
GLaDOS has quit IRC (Remote host closed the connection) |
02:08
🔗
|
|
Odd0002 has quit IRC (Remote host closed the connection) |
02:21
🔗
|
|
GLaDOS has joined #archiveteam-bs |
02:28
🔗
|
|
bitBaron has joined #archiveteam-bs |
02:36
🔗
|
|
bitBaron has quit IRC (Read error: Operation timed out) |
02:41
🔗
|
|
GLaDOS has quit IRC (Remote host closed the connection) |
02:42
🔗
|
|
GLaDOS has joined #archiveteam-bs |
02:50
🔗
|
|
pizzaiolo has quit IRC (pizzaiolo) |
03:14
🔗
|
|
qw3rty2 has joined #archiveteam-bs |
03:21
🔗
|
|
qw3rty has quit IRC (Read error: Operation timed out) |
03:31
🔗
|
|
bitBaron has joined #archiveteam-bs |
03:40
🔗
|
|
bitBaron has quit IRC (Read error: Operation timed out) |
03:43
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
04:15
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
04:37
🔗
|
|
bitBaron has joined #archiveteam-bs |
04:48
🔗
|
|
bitBaron has quit IRC (Ping timeout: 633 seconds) |
04:49
🔗
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
04:56
🔗
|
|
Sk1d has joined #archiveteam-bs |
05:40
🔗
|
|
bitBaron has joined #archiveteam-bs |
05:51
🔗
|
|
bitBaron has quit IRC (Ping timeout: 633 seconds) |
05:54
🔗
|
|
Yoshimura has joined #archiveteam-bs |
06:05
🔗
|
|
j08nY has joined #archiveteam-bs |
06:29
🔗
|
|
Fletcher| has quit IRC (Remote host closed the connection) |
06:38
🔗
|
|
j08nY has quit IRC (Quit: Leaving) |
06:45
🔗
|
|
bitBaron has joined #archiveteam-bs |
06:45
🔗
|
|
Fletcher has joined #archiveteam-bs |
06:52
🔗
|
|
bitBaron has quit IRC (Read error: Operation timed out) |
08:33
🔗
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
08:41
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
08:51
🔗
|
|
bitBaron has joined #archiveteam-bs |
09:01
🔗
|
|
Stiletti has joined #archiveteam-bs |
09:02
🔗
|
|
Stiletti is now known as Stiletto |
09:02
🔗
|
|
bitBaron has quit IRC (Read error: Operation timed out) |
09:56
🔗
|
|
bitBaron has joined #archiveteam-bs |
10:08
🔗
|
|
bitBaron has quit IRC (Read error: Operation timed out) |
10:43
🔗
|
|
zhongfu has joined #archiveteam-bs |
11:03
🔗
|
|
bitBaron has joined #archiveteam-bs |
11:40
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
12:01
🔗
|
|
odemg has quit IRC (Ping timeout: 260 seconds) |
12:37
🔗
|
|
odemg has joined #archiveteam-bs |
12:56
🔗
|
|
ld1 has quit IRC (Quit: WeeChat 1.9) |
13:02
🔗
|
|
brayden has joined #archiveteam-bs |
13:02
🔗
|
|
swebb sets mode: +o brayden |
13:02
🔗
|
|
brayden has quit IRC (Connection closed) |
13:03
🔗
|
|
brayden has joined #archiveteam-bs |
13:03
🔗
|
|
swebb sets mode: +o brayden |
13:10
🔗
|
|
ld1 has joined #archiveteam-bs |
13:10
🔗
|
|
ld1 has quit IRC (Client Quit) |
13:13
🔗
|
|
ld1 has joined #archiveteam-bs |
13:59
🔗
|
|
ld1 has quit IRC (Quit: WeeChat 1.9) |
14:14
🔗
|
|
ld1 has joined #archiveteam-bs |
14:14
🔗
|
|
qw3rty has joined #archiveteam-bs |
14:17
🔗
|
|
qw3rty2 has quit IRC (Ping timeout: 600 seconds) |
14:33
🔗
|
|
qw3rty2 has joined #archiveteam-bs |
14:41
🔗
|
|
qw3rty has quit IRC (Read error: Operation timed out) |
14:58
🔗
|
|
ZexaronS has quit IRC (Quit: Leaving) |
15:30
🔗
|
JAA |
arkiver: For example, there would be an item forumpage:13_186637 which translates to retrieving https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=13&PageIndex=186637 and all /Forum/ShowPost.aspx?PostID=\d+$ links on that page plus the relevant &PageIndex= URLs for multi-paged threads. |
15:33
🔗
|
JAA |
(In this particular case, there are no multi-paged threads though.) |
15:34
🔗
|
JAA |
For forumpage:13_186638, it would for example fetch https://forum.roblox.com/Forum/ShowPost.aspx?PostID=24086454 and https://forum.roblox.com/Forum/ShowPost.aspx?PostID=24086454&PageIndex=2 |
15:37
🔗
|
JAA |
I can give you my code for extracting the post IDs and maximum page index for each thread on a forum index page, if you want. |
15:38
🔗
|
|
j08nY has joined #archiveteam-bs |
16:11
🔗
|
JAA |
Also, forumpage:N_1 should retrieve https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=N, i.e. no PageIndex parameter. |
16:12
🔗
|
JAA |
We might miss some things if people post on a forum while we're grabbing it, but we should be able to capture most of it. |
16:12
🔗
|
JAA |
My wpull index grab is only at roughly 1/4 by the way... |
16:22
🔗
|
arkiver |
ok |
16:22
🔗
|
arkiver |
JAA: can you please prepare the items list |
16:23
🔗
|
arkiver |
preparing the warrior project now |
16:23
🔗
|
arkiver |
with this idea |
16:26
🔗
|
JAA |
Will do |
16:28
🔗
|
JAA |
Oh dear |
16:29
🔗
|
JAA |
Can anyone please find and kill the guy who produced this bullshit forum software? |
16:29
🔗
|
username1 |
may i propose lennart poettering? |
16:29
🔗
|
|
username1 is now known as schbirid |
16:29
🔗
|
JAA |
https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=46&PageIndex=69161 returns threads although there are only 69160 pages in that forum. |
16:30
🔗
|
JAA |
Also, each page is supposed to contain 25 threads, but that's rarely the case. |
16:30
🔗
|
JAA |
And yeah, we can go after Lennart as well. |
16:42
🔗
|
|
pikhq has quit IRC (Read error: Operation timed out) |
16:45
🔗
|
JAA |
arkiver: Done. It's a 19.8 MiB text file with one item name per line. Compresses down to 2.66 MiB with gzip. Where do you want me to put it? |
16:45
🔗
|
JAA |
Slightly over 1 million items, by the way. |
16:45
🔗
|
arkiver |
1 million items is good |
16:50
🔗
|
|
pikhq has joined #archiveteam-bs |
16:53
🔗
|
JAA |
arkiver: Do you want my code for determining the number of thread pages? |
16:54
🔗
|
arkiver |
sure |
16:54
🔗
|
JAA |
(It's really just a few string searches in the forum listing HTML.) |
16:55
🔗
|
arkiver |
yeah, I'm almost done anyway, but show me |
16:55
🔗
|
arkiver |
haha |
16:55
🔗
|
arkiver |
there's one way really good way to make a website not playable by the wayback machine |
16:55
🔗
|
arkiver |
make the website one big POST request and will not be in the wayback machine :P |
16:56
🔗
|
JAA |
I first did it with an HTML parser and XPath. More beautiful and way slower. |
16:56
🔗
|
JAA |
Haha yeah. |
16:56
🔗
|
arkiver |
ah yeah, not using that |
16:56
🔗
|
JAA |
I'll remember that in case I ever want to make a page unarchivable. |
16:56
🔗
|
arkiver |
:P |
17:01
🔗
|
JAA |
arkiver: https://gist.github.com/JustAnotherArchivist/85ef5c0e9d874791ee485fa69d08ac62#file-hook-py-L61-L82 |
18:07
🔗
|
|
ld1 has quit IRC (Quit: WeeChat 1.9) |
18:09
🔗
|
|
ld1 has joined #archiveteam-bs |
18:14
🔗
|
|
ld1 has quit IRC (Quit: WeeChat 1.9) |
18:20
🔗
|
|
ld1 has joined #archiveteam-bs |
18:32
🔗
|
|
odemg has quit IRC (Read error: Operation timed out) |
19:19
🔗
|
|
ZexaronS has joined #archiveteam-bs |
19:38
🔗
|
|
odemg has joined #archiveteam-bs |
20:26
🔗
|
|
cog has joined #archiveteam-bs |
20:36
🔗
|
|
qw3rty has joined #archiveteam-bs |
20:39
🔗
|
|
qw3rty3 has joined #archiveteam-bs |
20:39
🔗
|
|
qw3rty2 has quit IRC (Read error: Operation timed out) |
20:46
🔗
|
|
qw3rty has quit IRC (Read error: Operation timed out) |
20:52
🔗
|
JAA |
arkiver: How is it going? We need to get this up and running ASAP... |
20:54
🔗
|
wp494 |
^^^ |
20:54
🔗
|
wp494 |
we have a bit under 48 hrs to go |
20:59
🔗
|
|
qw3rty3 has quit IRC (Read error: Operation timed out) |
21:01
🔗
|
|
Sue has quit IRC (Quit: leaving) |
21:01
🔗
|
|
Sue_ is now known as Sue |
21:10
🔗
|
arkiver |
almost done |
21:10
🔗
|
arkiver |
hold on for a little bit longer |
21:10
🔗
|
arkiver |
when was it announced this would go away? |
21:11
🔗
|
|
odemg has quit IRC (Read error: Operation timed out) |
21:11
🔗
|
|
odemg has joined #archiveteam-bs |
21:16
🔗
|
JAA |
arkiver: Ten days ago or so. |
21:29
🔗
|
JAA |
arkiver: Any ETA? I have to leave soonish. |
21:29
🔗
|
arkiver |
under hour hopefully |
21:30
🔗
|
arkiver |
please send me the itemlist someway so I have that in case you're offline |
21:31
🔗
|
JAA |
Right. How? |
21:32
🔗
|
arkiver |
I guess you can put it on github |
21:32
🔗
|
arkiver |
not sure where :P |
21:32
🔗
|
JAA |
I'll try Gist. |
21:33
🔗
|
arkiver |
i'm using the <span class="normalTextSmallBold">Page 1 of 3</span> to get max page |
21:35
🔗
|
arkiver |
some final testing now |
21:38
🔗
|
JAA |
arkiver: https://gist.github.com/JustAnotherArchivist/9cdc31f75e995e228014c97876d29350/raw/50df623f504addf9eed79cc113de439cd926cbe5/roblox-items.txt |
21:40
🔗
|
arkiver |
do we also want pageindex=1 |
21:40
🔗
|
arkiver |
let's skip that |
21:41
🔗
|
JAA |
Agreed |
21:41
🔗
|
JAA |
There are links to that on the forum index, but since you get the same content as without the parameter, there is no point. |
21:41
🔗
|
arkiver |
I'll have a log for you to check in a few minutes |
21:45
🔗
|
arkiver |
JAA: https://paste.fedoraproject.org/paste/zsD2uqj8jDV7RiZRxXLGmA/raw |
21:47
🔗
|
arkiver |
JAA: another one https://paste.fedoraproject.org/paste/cpjj7-uF62HgF2TzpJsIgg/raw |
21:47
🔗
|
arkiver |
looks like new treads are still being created, so the threads on a page shift over time |
21:48
🔗
|
JAA |
Yeah, and some old threads might become active again. We'll certainly miss some threads with this strategy, but I don't think anything else is feasible in this time frame and without massive data duplication. |
21:49
🔗
|
arkiver |
we'll miss the newest threads |
21:49
🔗
|
arkiver |
uh nvm, we're not going through items sequentially |
21:49
🔗
|
arkiver |
(wanted to say we're archiving some double) |
21:49
🔗
|
arkiver |
does this look good? |
21:54
🔗
|
JAA |
Yeah, looks good. Does your code for the max page retrieval account for number formatting? I believe they print numbers over 1000 as "1,000". At least they do on the pager in the forums. |
21:54
🔗
|
arkiver |
no, do you have an example? |
21:54
🔗
|
JAA |
I haven't come across a thread yet, but I'm sure they exist. |
21:54
🔗
|
arkiver |
ah, simply see bottom of https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=13&PageIndex=186638 |
21:54
🔗
|
arkiver |
I'll add support for that |
21:55
🔗
|
JAA |
Yeah. I *think* it will look the same for threads, but it'd be nice to check. |
21:55
🔗
|
JAA |
But of course that stupid forum software doesn't allow for sorting threads by post number. |
21:59
🔗
|
JAA |
arkiver: Found one: https://forum.roblox.com/Forum/ShowPost.aspx?PostID=80624889 |
21:59
🔗
|
JAA |
You can test it at forumpage:23_1. |
22:00
🔗
|
arkiver |
yep, testing now |
22:02
🔗
|
arkiver |
it's doing this, so I guess it's working https://paste.fedoraproject.org/paste/CKurQ~alUWo8D5G8mpPHzw/raw |
22:03
🔗
|
JAA |
https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=23&PageIndex=1 << I'd remove the PageIndex parameter here as well. |
22:03
🔗
|
JAA |
Otherwise, looks good. :-) |
22:03
🔗
|
arkiver |
it's just a few, let's save those without pageindex with save page now |
22:03
🔗
|
arkiver |
I'm starting the project |
22:05
🔗
|
arkiver |
actually added it anyway |
22:06
🔗
|
arkiver |
JAA this good? https://paste.fedoraproject.org/paste/NbFqTlMLR2CLO5syAaYIqQ/raw |
22:06
🔗
|
arkiver |
doing both URLs |
22:06
🔗
|
JAA |
Perfect |
22:07
🔗
|
|
svchfoo3 has quit IRC (Quit: Closing) |
22:07
🔗
|
arkiver |
it's online https://github.com/ArchiveTeam/roblox-grab |
22:08
🔗
|
JAA |
Hooray |
22:08
🔗
|
arkiver |
in the warrior now |
22:10
🔗
|
|
svchfoo3 has joined #archiveteam-bs |
22:11
🔗
|
|
svchfoo1 sets mode: +o svchfoo3 |
22:14
🔗
|
arkiver |
project is started |
22:14
🔗
|
arkiver |
it's warrior default now |
22:17
🔗
|
ld1 |
Limits for concurrency? |
22:18
🔗
|
mls |
Getting 500s atm |
22:18
🔗
|
mls |
For Roblox |
22:18
🔗
|
arkiver |
I'm not getting 500 |
22:18
🔗
|
arkiver |
all working fine here |
22:18
🔗
|
JAA |
Me neither |
22:18
🔗
|
mls |
I'll leave it to cool down, had 6 concurrent |
22:18
🔗
|
arkiver |
we need to go 500 items/min if we have 36 hours left |
22:19
🔗
|
arkiver |
looks like we can make it |
22:19
🔗
|
arkiver |
I have 10 concurrent |
22:19
🔗
|
arkiver |
we're now doing 600+ items/min |
22:20
🔗
|
arkiver |
750+ actually |
22:20
🔗
|
JAA |
Oh yeah, I'm seeing a few 500s currently. But very few really. |
22:20
🔗
|
arkiver |
ah I have 500s now too |
22:20
🔗
|
mls |
It unstuck for me |
22:21
🔗
|
JAA |
Hmm, some interesting URLs in there... Stuff like 47=500 https://www.roblox.com/request-error?id=a3c10460-e278-4aff-81e4-fc05d727c04a&mode=&code=500 |
22:21
🔗
|
arkiver |
I'll make the code abort on a 500 |
22:21
🔗
|
arkiver |
^ JAA do we want that> |
22:21
🔗
|
arkiver |
? |
22:21
🔗
|
arkiver |
else we might miss stuff |
22:21
🔗
|
JAA |
Hm, idk |
22:22
🔗
|
mls |
That's what I got after a bunch of regular ones JAA, those URLs |
22:23
🔗
|
arkiver |
we're now aborting on retries after 500 |
22:23
🔗
|
arkiver |
20170625.02 is now minimum |
22:23
🔗
|
arkiver |
and we're at 900+ items/min now |
22:24
🔗
|
arkiver |
Limiting at 800 items/mmin |
22:24
🔗
|
arkiver |
let me know if we should go higher |
22:25
🔗
|
JAA |
Do we want to requeue the previous items? |
22:26
🔗
|
arkiver |
yeah, requeuing all items now |
22:26
🔗
|
arkiver |
done. |
22:27
🔗
|
arkiver |
is there also parts of the forum that are not going down that we are not archiving? |
22:27
🔗
|
ld1 |
Throws a "project code out of date" after having re-pulled via git. |
22:28
🔗
|
JAA |
arkiver: My item list covers all forums. But yes, it appears that some parts will stay. |
22:28
🔗
|
JAA |
It's just not very clear which ones... |
22:28
🔗
|
JAA |
ld1: You need to restart the pipeline after updating the code. |
22:29
🔗
|
ld1 |
I did, but I'll purge `data/` now. |
22:30
🔗
|
ld1 |
That did it. Thanks. |
22:43
🔗
|
JAA |
"Server returned 0 (HERR). Sleeping." -- That took longer than I thought it would. |
22:44
🔗
|
bitspill |
That just mean roblox got overwhelmed and is now timing out |
22:44
🔗
|
bitspill |
? |
22:45
🔗
|
JAA |
Likely |
22:45
🔗
|
JAA |
Better than 302 redirects to an error 500 page. |
22:46
🔗
|
JAA |
arkiver: We killed it. |
22:47
🔗
|
bitspill |
You mean like this, which is now looping the 500 on mine: |
22:47
🔗
|
bitspill |
12=302 https://forum.roblox.com/Forum/ShowPost.aspx?PostID=114497378 |
22:47
🔗
|
bitspill |
13=500 https://www.roblox.com/request-error?id=a0201d10-19d4-4522-bf80-b39f6a804c98&mode=&code=500 |
22:47
🔗
|
bitspill |
Server returned 500 (RETRFINISHED). Sleeping. |
22:48
🔗
|
JAA |
Yeah |
22:52
🔗
|
|
Honno has quit IRC (Read error: Operation timed out) |
22:56
🔗
|
JAA |
Lovely, their server is case-insensitive too. It doesn't matter whether you access https://forum.roblox.com/Forum/ or /forum/ or /fORuM/. |
22:59
🔗
|
JAA |
At least I couldn't find any other spellings than /Forum on their website. |
23:04
🔗
|
|
schbirid2 has joined #archiveteam-bs |
23:07
🔗
|
|
schbirid has quit IRC (Read error: Operation timed out) |
23:07
🔗
|
|
username1 has joined #archiveteam-bs |
23:09
🔗
|
|
schbirid2 has quit IRC (Read error: Operation timed out) |
23:10
🔗
|
|
ld1 has quit IRC (Remote host closed the connection) |
23:10
🔗
|
|
ld1 has joined #archiveteam-bs |
23:11
🔗
|
|
Ravenloft has joined #archiveteam-bs |
23:11
🔗
|
Ravenloft |
so, what the cool kids are using to backup content from netflix these days? |
23:19
🔗
|
|
cog has quit IRC (Ping timeout: 268 seconds) |
23:54
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
23:56
🔗
|
|
schbirid2 has joined #archiveteam-bs |
23:58
🔗
|
|
username1 has quit IRC (Read error: Operation timed out) |
23:59
🔗
|
|
schbirid has joined #archiveteam-bs |