#archiveteam-bs 2017-07-25,Tue

↑back Search

Time Nickname Message
00:09 🔗 wp494 reminder that d-day for roblox forums is the 27th
00:10 🔗 odemg has joined #archiveteam-bs
00:10 🔗 ReimuHaku has quit IRC (Ping timeout: 245 seconds)
00:13 🔗 ReimuHaku has joined #archiveteam-bs
00:17 🔗 BlueMaxim has quit IRC (Quit: Leaving)
00:18 🔗 bitBaron has joined #archiveteam-bs
00:19 🔗 qw3rty2 has quit IRC (Read error: Operation timed out)
00:22 🔗 qw3rty has joined #archiveteam-bs
00:24 🔗 bitBaron has quit IRC (Read error: Operation timed out)
00:24 🔗 GLaDOS has quit IRC (Remote host closed the connection)
00:26 🔗 tsuckow has joined #archiveteam-bs
00:31 🔗 GLaDOS has joined #archiveteam-bs
00:35 🔗 GLaDOS ..so i figured out why i wasnt able to log into my wiki account
00:35 🔗 GLaDOS my password manager fucking urlencoded the password when it saved it
00:36 🔗 tsuckow has quit IRC (Ping timeout: 268 seconds)
00:41 🔗 xmc hahahaha
00:41 🔗 xmc computers are just great aren't they
01:20 🔗 GLaDOS "what's that? this string has to be kept the exact same? well i better encode it then!"
01:23 🔗 bitBaron has joined #archiveteam-bs
01:33 🔗 bitBaron has quit IRC (Read error: Operation timed out)
01:39 🔗 username1 has joined #archiveteam-bs
01:41 🔗 schbirid2 has quit IRC (Read error: Operation timed out)
02:02 🔗 GLaDOS has quit IRC (Remote host closed the connection)
02:08 🔗 Odd0002 has quit IRC (Remote host closed the connection)
02:21 🔗 GLaDOS has joined #archiveteam-bs
02:28 🔗 bitBaron has joined #archiveteam-bs
02:36 🔗 bitBaron has quit IRC (Read error: Operation timed out)
02:41 🔗 GLaDOS has quit IRC (Remote host closed the connection)
02:42 🔗 GLaDOS has joined #archiveteam-bs
02:50 🔗 pizzaiolo has quit IRC (pizzaiolo)
03:14 🔗 qw3rty2 has joined #archiveteam-bs
03:21 🔗 qw3rty has quit IRC (Read error: Operation timed out)
03:31 🔗 bitBaron has joined #archiveteam-bs
03:40 🔗 bitBaron has quit IRC (Read error: Operation timed out)
03:43 🔗 Stiletto has quit IRC (Read error: Operation timed out)
04:15 🔗 BlueMaxim has joined #archiveteam-bs
04:37 🔗 bitBaron has joined #archiveteam-bs
04:48 🔗 bitBaron has quit IRC (Ping timeout: 633 seconds)
04:49 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
04:56 🔗 Sk1d has joined #archiveteam-bs
05:40 🔗 bitBaron has joined #archiveteam-bs
05:51 🔗 bitBaron has quit IRC (Ping timeout: 633 seconds)
05:54 🔗 Yoshimura has joined #archiveteam-bs
06:05 🔗 j08nY has joined #archiveteam-bs
06:29 🔗 Fletcher| has quit IRC (Remote host closed the connection)
06:38 🔗 j08nY has quit IRC (Quit: Leaving)
06:45 🔗 bitBaron has joined #archiveteam-bs
06:45 🔗 Fletcher has joined #archiveteam-bs
06:52 🔗 bitBaron has quit IRC (Read error: Operation timed out)
08:33 🔗 zhongfu has quit IRC (Ping timeout: 260 seconds)
08:41 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
08:51 🔗 bitBaron has joined #archiveteam-bs
09:01 🔗 Stiletti has joined #archiveteam-bs
09:02 🔗 Stiletti is now known as Stiletto
09:02 🔗 bitBaron has quit IRC (Read error: Operation timed out)
09:56 🔗 bitBaron has joined #archiveteam-bs
10:08 🔗 bitBaron has quit IRC (Read error: Operation timed out)
10:43 🔗 zhongfu has joined #archiveteam-bs
11:03 🔗 bitBaron has joined #archiveteam-bs
11:40 🔗 pizzaiolo has joined #archiveteam-bs
12:01 🔗 odemg has quit IRC (Ping timeout: 260 seconds)
12:37 🔗 odemg has joined #archiveteam-bs
12:56 🔗 ld1 has quit IRC (Quit: WeeChat 1.9)
13:02 🔗 brayden has joined #archiveteam-bs
13:02 🔗 swebb sets mode: +o brayden
13:02 🔗 brayden has quit IRC (Connection closed)
13:03 🔗 brayden has joined #archiveteam-bs
13:03 🔗 swebb sets mode: +o brayden
13:10 🔗 ld1 has joined #archiveteam-bs
13:10 🔗 ld1 has quit IRC (Client Quit)
13:13 🔗 ld1 has joined #archiveteam-bs
13:59 🔗 ld1 has quit IRC (Quit: WeeChat 1.9)
14:14 🔗 ld1 has joined #archiveteam-bs
14:14 🔗 qw3rty has joined #archiveteam-bs
14:17 🔗 qw3rty2 has quit IRC (Ping timeout: 600 seconds)
14:33 🔗 qw3rty2 has joined #archiveteam-bs
14:41 🔗 qw3rty has quit IRC (Read error: Operation timed out)
14:58 🔗 ZexaronS has quit IRC (Quit: Leaving)
15:30 🔗 JAA arkiver: For example, there would be an item forumpage:13_186637 which translates to retrieving https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=13&PageIndex=186637 and all /Forum/ShowPost.aspx?PostID=\d+$ links on that page plus the relevant &PageIndex= URLs for multi-paged threads.
15:33 🔗 JAA (In this particular case, there are no multi-paged threads though.)
15:34 🔗 JAA For forumpage:13_186638, it would for example fetch https://forum.roblox.com/Forum/ShowPost.aspx?PostID=24086454 and https://forum.roblox.com/Forum/ShowPost.aspx?PostID=24086454&PageIndex=2
15:37 🔗 JAA I can give you my code for extracting the post IDs and maximum page index for each thread on a forum index page, if you want.
15:38 🔗 j08nY has joined #archiveteam-bs
16:11 🔗 JAA Also, forumpage:N_1 should retrieve https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=N, i.e. no PageIndex parameter.
16:12 🔗 JAA We might miss some things if people post on a forum while we're grabbing it, but we should be able to capture most of it.
16:12 🔗 JAA My wpull index grab is only at roughly 1/4 by the way...
16:22 🔗 arkiver ok
16:22 🔗 arkiver JAA: can you please prepare the items list
16:23 🔗 arkiver preparing the warrior project now
16:23 🔗 arkiver with this idea
16:26 🔗 JAA Will do
16:28 🔗 JAA Oh dear
16:29 🔗 JAA Can anyone please find and kill the guy who produced this bullshit forum software?
16:29 🔗 username1 may i propose lennart poettering?
16:29 🔗 username1 is now known as schbirid
16:29 🔗 JAA https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=46&PageIndex=69161 returns threads although there are only 69160 pages in that forum.
16:30 🔗 JAA Also, each page is supposed to contain 25 threads, but that's rarely the case.
16:30 🔗 JAA And yeah, we can go after Lennart as well.
16:42 🔗 pikhq has quit IRC (Read error: Operation timed out)
16:45 🔗 JAA arkiver: Done. It's a 19.8 MiB text file with one item name per line. Compresses down to 2.66 MiB with gzip. Where do you want me to put it?
16:45 🔗 JAA Slightly over 1 million items, by the way.
16:45 🔗 arkiver 1 million items is good
16:50 🔗 pikhq has joined #archiveteam-bs
16:53 🔗 JAA arkiver: Do you want my code for determining the number of thread pages?
16:54 🔗 arkiver sure
16:54 🔗 JAA (It's really just a few string searches in the forum listing HTML.)
16:55 🔗 arkiver yeah, I'm almost done anyway, but show me
16:55 🔗 arkiver haha
16:55 🔗 arkiver there's one way really good way to make a website not playable by the wayback machine
16:55 🔗 arkiver make the website one big POST request and will not be in the wayback machine :P
16:56 🔗 JAA I first did it with an HTML parser and XPath. More beautiful and way slower.
16:56 🔗 JAA Haha yeah.
16:56 🔗 arkiver ah yeah, not using that
16:56 🔗 JAA I'll remember that in case I ever want to make a page unarchivable.
16:56 🔗 arkiver :P
17:01 🔗 JAA arkiver: https://gist.github.com/JustAnotherArchivist/85ef5c0e9d874791ee485fa69d08ac62#file-hook-py-L61-L82
18:07 🔗 ld1 has quit IRC (Quit: WeeChat 1.9)
18:09 🔗 ld1 has joined #archiveteam-bs
18:14 🔗 ld1 has quit IRC (Quit: WeeChat 1.9)
18:20 🔗 ld1 has joined #archiveteam-bs
18:32 🔗 odemg has quit IRC (Read error: Operation timed out)
19:19 🔗 ZexaronS has joined #archiveteam-bs
19:38 🔗 odemg has joined #archiveteam-bs
20:26 🔗 cog has joined #archiveteam-bs
20:36 🔗 qw3rty has joined #archiveteam-bs
20:39 🔗 qw3rty3 has joined #archiveteam-bs
20:39 🔗 qw3rty2 has quit IRC (Read error: Operation timed out)
20:46 🔗 qw3rty has quit IRC (Read error: Operation timed out)
20:52 🔗 JAA arkiver: How is it going? We need to get this up and running ASAP...
20:54 🔗 wp494 ^^^
20:54 🔗 wp494 we have a bit under 48 hrs to go
20:59 🔗 qw3rty3 has quit IRC (Read error: Operation timed out)
21:01 🔗 Sue has quit IRC (Quit: leaving)
21:01 🔗 Sue_ is now known as Sue
21:10 🔗 arkiver almost done
21:10 🔗 arkiver hold on for a little bit longer
21:10 🔗 arkiver when was it announced this would go away?
21:11 🔗 odemg has quit IRC (Read error: Operation timed out)
21:11 🔗 odemg has joined #archiveteam-bs
21:16 🔗 JAA arkiver: Ten days ago or so.
21:29 🔗 JAA arkiver: Any ETA? I have to leave soonish.
21:29 🔗 arkiver under hour hopefully
21:30 🔗 arkiver please send me the itemlist someway so I have that in case you're offline
21:31 🔗 JAA Right. How?
21:32 🔗 arkiver I guess you can put it on github
21:32 🔗 arkiver not sure where :P
21:32 🔗 JAA I'll try Gist.
21:33 🔗 arkiver i'm using the <span class="normalTextSmallBold">Page 1 of 3</span> to get max page
21:35 🔗 arkiver some final testing now
21:38 🔗 JAA arkiver: https://gist.github.com/JustAnotherArchivist/9cdc31f75e995e228014c97876d29350/raw/50df623f504addf9eed79cc113de439cd926cbe5/roblox-items.txt
21:40 🔗 arkiver do we also want pageindex=1
21:40 🔗 arkiver let's skip that
21:41 🔗 JAA Agreed
21:41 🔗 JAA There are links to that on the forum index, but since you get the same content as without the parameter, there is no point.
21:41 🔗 arkiver I'll have a log for you to check in a few minutes
21:45 🔗 arkiver JAA: https://paste.fedoraproject.org/paste/zsD2uqj8jDV7RiZRxXLGmA/raw
21:47 🔗 arkiver JAA: another one https://paste.fedoraproject.org/paste/cpjj7-uF62HgF2TzpJsIgg/raw
21:47 🔗 arkiver looks like new treads are still being created, so the threads on a page shift over time
21:48 🔗 JAA Yeah, and some old threads might become active again. We'll certainly miss some threads with this strategy, but I don't think anything else is feasible in this time frame and without massive data duplication.
21:49 🔗 arkiver we'll miss the newest threads
21:49 🔗 arkiver uh nvm, we're not going through items sequentially
21:49 🔗 arkiver (wanted to say we're archiving some double)
21:49 🔗 arkiver does this look good?
21:54 🔗 JAA Yeah, looks good. Does your code for the max page retrieval account for number formatting? I believe they print numbers over 1000 as "1,000". At least they do on the pager in the forums.
21:54 🔗 arkiver no, do you have an example?
21:54 🔗 JAA I haven't come across a thread yet, but I'm sure they exist.
21:54 🔗 arkiver ah, simply see bottom of https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=13&PageIndex=186638
21:54 🔗 arkiver I'll add support for that
21:55 🔗 JAA Yeah. I *think* it will look the same for threads, but it'd be nice to check.
21:55 🔗 JAA But of course that stupid forum software doesn't allow for sorting threads by post number.
21:59 🔗 JAA arkiver: Found one: https://forum.roblox.com/Forum/ShowPost.aspx?PostID=80624889
21:59 🔗 JAA You can test it at forumpage:23_1.
22:00 🔗 arkiver yep, testing now
22:02 🔗 arkiver it's doing this, so I guess it's working https://paste.fedoraproject.org/paste/CKurQ~alUWo8D5G8mpPHzw/raw
22:03 🔗 JAA https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=23&PageIndex=1 << I'd remove the PageIndex parameter here as well.
22:03 🔗 JAA Otherwise, looks good. :-)
22:03 🔗 arkiver it's just a few, let's save those without pageindex with save page now
22:03 🔗 arkiver I'm starting the project
22:05 🔗 arkiver actually added it anyway
22:06 🔗 arkiver JAA this good? https://paste.fedoraproject.org/paste/NbFqTlMLR2CLO5syAaYIqQ/raw
22:06 🔗 arkiver doing both URLs
22:06 🔗 JAA Perfect
22:07 🔗 svchfoo3 has quit IRC (Quit: Closing)
22:07 🔗 arkiver it's online https://github.com/ArchiveTeam/roblox-grab
22:08 🔗 JAA Hooray
22:08 🔗 arkiver in the warrior now
22:10 🔗 svchfoo3 has joined #archiveteam-bs
22:11 🔗 svchfoo1 sets mode: +o svchfoo3
22:14 🔗 arkiver project is started
22:14 🔗 arkiver it's warrior default now
22:17 🔗 ld1 Limits for concurrency?
22:18 🔗 mls Getting 500s atm
22:18 🔗 mls For Roblox
22:18 🔗 arkiver I'm not getting 500
22:18 🔗 arkiver all working fine here
22:18 🔗 JAA Me neither
22:18 🔗 mls I'll leave it to cool down, had 6 concurrent
22:18 🔗 arkiver we need to go 500 items/min if we have 36 hours left
22:19 🔗 arkiver looks like we can make it
22:19 🔗 arkiver I have 10 concurrent
22:19 🔗 arkiver we're now doing 600+ items/min
22:20 🔗 arkiver 750+ actually
22:20 🔗 JAA Oh yeah, I'm seeing a few 500s currently. But very few really.
22:20 🔗 arkiver ah I have 500s now too
22:20 🔗 mls It unstuck for me
22:21 🔗 JAA Hmm, some interesting URLs in there... Stuff like 47=500 https://www.roblox.com/request-error?id=a3c10460-e278-4aff-81e4-fc05d727c04a&mode=&code=500
22:21 🔗 arkiver I'll make the code abort on a 500
22:21 🔗 arkiver ^ JAA do we want that>
22:21 🔗 arkiver ?
22:21 🔗 arkiver else we might miss stuff
22:21 🔗 JAA Hm, idk
22:22 🔗 mls That's what I got after a bunch of regular ones JAA, those URLs
22:23 🔗 arkiver we're now aborting on retries after 500
22:23 🔗 arkiver 20170625.02 is now minimum
22:23 🔗 arkiver and we're at 900+ items/min now
22:24 🔗 arkiver Limiting at 800 items/mmin
22:24 🔗 arkiver let me know if we should go higher
22:25 🔗 JAA Do we want to requeue the previous items?
22:26 🔗 arkiver yeah, requeuing all items now
22:26 🔗 arkiver done.
22:27 🔗 arkiver is there also parts of the forum that are not going down that we are not archiving?
22:27 🔗 ld1 Throws a "project code out of date" after having re-pulled via git.
22:28 🔗 JAA arkiver: My item list covers all forums. But yes, it appears that some parts will stay.
22:28 🔗 JAA It's just not very clear which ones...
22:28 🔗 JAA ld1: You need to restart the pipeline after updating the code.
22:29 🔗 ld1 I did, but I'll purge `data/` now.
22:30 🔗 ld1 That did it. Thanks.
22:43 🔗 JAA "Server returned 0 (HERR). Sleeping." -- That took longer than I thought it would.
22:44 🔗 bitspill That just mean roblox got overwhelmed and is now timing out
22:44 🔗 bitspill ?
22:45 🔗 JAA Likely
22:45 🔗 JAA Better than 302 redirects to an error 500 page.
22:46 🔗 JAA arkiver: We killed it.
22:47 🔗 bitspill You mean like this, which is now looping the 500 on mine:
22:47 🔗 bitspill 12=302 https://forum.roblox.com/Forum/ShowPost.aspx?PostID=114497378
22:47 🔗 bitspill 13=500 https://www.roblox.com/request-error?id=a0201d10-19d4-4522-bf80-b39f6a804c98&mode=&code=500
22:47 🔗 bitspill Server returned 500 (RETRFINISHED). Sleeping.
22:48 🔗 JAA Yeah
22:52 🔗 Honno has quit IRC (Read error: Operation timed out)
22:56 🔗 JAA Lovely, their server is case-insensitive too. It doesn't matter whether you access https://forum.roblox.com/Forum/ or /forum/ or /fORuM/.
22:59 🔗 JAA At least I couldn't find any other spellings than /Forum on their website.
23:04 🔗 schbirid2 has joined #archiveteam-bs
23:07 🔗 schbirid has quit IRC (Read error: Operation timed out)
23:07 🔗 username1 has joined #archiveteam-bs
23:09 🔗 schbirid2 has quit IRC (Read error: Operation timed out)
23:10 🔗 ld1 has quit IRC (Remote host closed the connection)
23:10 🔗 ld1 has joined #archiveteam-bs
23:11 🔗 Ravenloft has joined #archiveteam-bs
23:11 🔗 Ravenloft so, what the cool kids are using to backup content from netflix these days?
23:19 🔗 cog has quit IRC (Ping timeout: 268 seconds)
23:54 🔗 BlueMaxim has joined #archiveteam-bs
23:56 🔗 schbirid2 has joined #archiveteam-bs
23:58 🔗 username1 has quit IRC (Read error: Operation timed out)
23:59 🔗 schbirid has joined #archiveteam-bs

irclogger-viewer