[00:09] reminder that d-day for roblox forums is the 27th [00:10] *** odemg has joined #archiveteam-bs [00:10] *** ReimuHaku has quit IRC (Ping timeout: 245 seconds) [00:13] *** ReimuHaku has joined #archiveteam-bs [00:17] *** BlueMaxim has quit IRC (Quit: Leaving) [00:18] *** bitBaron has joined #archiveteam-bs [00:19] *** qw3rty2 has quit IRC (Read error: Operation timed out) [00:22] *** qw3rty has joined #archiveteam-bs [00:24] *** bitBaron has quit IRC (Read error: Operation timed out) [00:24] *** GLaDOS has quit IRC (Remote host closed the connection) [00:26] *** tsuckow has joined #archiveteam-bs [00:31] *** GLaDOS has joined #archiveteam-bs [00:35] ..so i figured out why i wasnt able to log into my wiki account [00:35] my password manager fucking urlencoded the password when it saved it [00:36] *** tsuckow has quit IRC (Ping timeout: 268 seconds) [00:41] hahahaha [00:41] computers are just great aren't they [01:20] "what's that? this string has to be kept the exact same? well i better encode it then!" [01:23] *** bitBaron has joined #archiveteam-bs [01:33] *** bitBaron has quit IRC (Read error: Operation timed out) [01:39] *** username1 has joined #archiveteam-bs [01:41] *** schbirid2 has quit IRC (Read error: Operation timed out) [02:02] *** GLaDOS has quit IRC (Remote host closed the connection) [02:08] *** Odd0002 has quit IRC (Remote host closed the connection) [02:21] *** GLaDOS has joined #archiveteam-bs [02:28] *** bitBaron has joined #archiveteam-bs [02:36] *** bitBaron has quit IRC (Read error: Operation timed out) [02:41] *** GLaDOS has quit IRC (Remote host closed the connection) [02:42] *** GLaDOS has joined #archiveteam-bs [02:50] *** pizzaiolo has quit IRC (pizzaiolo) [03:14] *** qw3rty2 has joined #archiveteam-bs [03:21] *** qw3rty has quit IRC (Read error: Operation timed out) [03:31] *** bitBaron has joined #archiveteam-bs [03:40] *** bitBaron has quit IRC (Read error: Operation timed out) [03:43] *** Stiletto has quit IRC (Read error: Operation timed out) [04:15] *** BlueMaxim has joined #archiveteam-bs [04:37] *** bitBaron has joined #archiveteam-bs [04:48] *** bitBaron has quit IRC (Ping timeout: 633 seconds) [04:49] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:56] *** Sk1d has joined #archiveteam-bs [05:40] *** bitBaron has joined #archiveteam-bs [05:51] *** bitBaron has quit IRC (Ping timeout: 633 seconds) [05:54] *** Yoshimura has joined #archiveteam-bs [06:05] *** j08nY has joined #archiveteam-bs [06:29] *** Fletcher| has quit IRC (Remote host closed the connection) [06:38] *** j08nY has quit IRC (Quit: Leaving) [06:45] *** bitBaron has joined #archiveteam-bs [06:45] *** Fletcher has joined #archiveteam-bs [06:52] *** bitBaron has quit IRC (Read error: Operation timed out) [08:33] *** zhongfu has quit IRC (Ping timeout: 260 seconds) [08:41] *** BlueMaxim has quit IRC (Read error: Operation timed out) [08:51] *** bitBaron has joined #archiveteam-bs [09:01] *** Stiletti has joined #archiveteam-bs [09:02] *** Stiletti is now known as Stiletto [09:02] *** bitBaron has quit IRC (Read error: Operation timed out) [09:56] *** bitBaron has joined #archiveteam-bs [10:08] *** bitBaron has quit IRC (Read error: Operation timed out) [10:43] *** zhongfu has joined #archiveteam-bs [11:03] *** bitBaron has joined #archiveteam-bs [11:40] *** pizzaiolo has joined #archiveteam-bs [12:01] *** odemg has quit IRC (Ping timeout: 260 seconds) [12:37] *** odemg has joined #archiveteam-bs [12:56] *** ld1 has quit IRC (Quit: WeeChat 1.9) [13:02] *** brayden has joined #archiveteam-bs [13:02] *** swebb sets mode: +o brayden [13:02] *** brayden has quit IRC (Connection closed) [13:03] *** brayden has joined #archiveteam-bs [13:03] *** swebb sets mode: +o brayden [13:10] *** ld1 has joined #archiveteam-bs [13:10] *** ld1 has quit IRC (Client Quit) [13:13] *** ld1 has joined #archiveteam-bs [13:59] *** ld1 has quit IRC (Quit: WeeChat 1.9) [14:14] *** ld1 has joined #archiveteam-bs [14:14] *** qw3rty has joined #archiveteam-bs [14:17] *** qw3rty2 has quit IRC (Ping timeout: 600 seconds) [14:33] *** qw3rty2 has joined #archiveteam-bs [14:41] *** qw3rty has quit IRC (Read error: Operation timed out) [14:58] *** ZexaronS has quit IRC (Quit: Leaving) [15:30] arkiver: For example, there would be an item forumpage:13_186637 which translates to retrieving https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=13&PageIndex=186637 and all /Forum/ShowPost.aspx?PostID=\d+$ links on that page plus the relevant &PageIndex= URLs for multi-paged threads. [15:33] (In this particular case, there are no multi-paged threads though.) [15:34] For forumpage:13_186638, it would for example fetch https://forum.roblox.com/Forum/ShowPost.aspx?PostID=24086454 and https://forum.roblox.com/Forum/ShowPost.aspx?PostID=24086454&PageIndex=2 [15:37] I can give you my code for extracting the post IDs and maximum page index for each thread on a forum index page, if you want. [15:38] *** j08nY has joined #archiveteam-bs [16:11] Also, forumpage:N_1 should retrieve https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=N, i.e. no PageIndex parameter. [16:12] We might miss some things if people post on a forum while we're grabbing it, but we should be able to capture most of it. [16:12] My wpull index grab is only at roughly 1/4 by the way... [16:22] ok [16:22] JAA: can you please prepare the items list [16:23] preparing the warrior project now [16:23] with this idea [16:26] Will do [16:28] Oh dear [16:29] Can anyone please find and kill the guy who produced this bullshit forum software? [16:29] may i propose lennart poettering? [16:29] *** username1 is now known as schbirid [16:29] https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=46&PageIndex=69161 returns threads although there are only 69160 pages in that forum. [16:30] Also, each page is supposed to contain 25 threads, but that's rarely the case. [16:30] And yeah, we can go after Lennart as well. [16:42] *** pikhq has quit IRC (Read error: Operation timed out) [16:45] arkiver: Done. It's a 19.8 MiB text file with one item name per line. Compresses down to 2.66 MiB with gzip. Where do you want me to put it? [16:45] Slightly over 1 million items, by the way. [16:45] 1 million items is good [16:50] *** pikhq has joined #archiveteam-bs [16:53] arkiver: Do you want my code for determining the number of thread pages? [16:54] sure [16:54] (It's really just a few string searches in the forum listing HTML.) [16:55] yeah, I'm almost done anyway, but show me [16:55] haha [16:55] there's one way really good way to make a website not playable by the wayback machine [16:55] make the website one big POST request and will not be in the wayback machine :P [16:56] I first did it with an HTML parser and XPath. More beautiful and way slower. [16:56] Haha yeah. [16:56] ah yeah, not using that [16:56] I'll remember that in case I ever want to make a page unarchivable. [16:56] :P [17:01] arkiver: https://gist.github.com/JustAnotherArchivist/85ef5c0e9d874791ee485fa69d08ac62#file-hook-py-L61-L82 [18:07] *** ld1 has quit IRC (Quit: WeeChat 1.9) [18:09] *** ld1 has joined #archiveteam-bs [18:14] *** ld1 has quit IRC (Quit: WeeChat 1.9) [18:20] *** ld1 has joined #archiveteam-bs [18:32] *** odemg has quit IRC (Read error: Operation timed out) [19:19] *** ZexaronS has joined #archiveteam-bs [19:38] *** odemg has joined #archiveteam-bs [20:26] *** cog has joined #archiveteam-bs [20:36] *** qw3rty has joined #archiveteam-bs [20:39] *** qw3rty3 has joined #archiveteam-bs [20:39] *** qw3rty2 has quit IRC (Read error: Operation timed out) [20:46] *** qw3rty has quit IRC (Read error: Operation timed out) [20:52] arkiver: How is it going? We need to get this up and running ASAP... [20:54] ^^^ [20:54] we have a bit under 48 hrs to go [20:59] *** qw3rty3 has quit IRC (Read error: Operation timed out) [21:01] *** Sue has quit IRC (Quit: leaving) [21:01] *** Sue_ is now known as Sue [21:10] almost done [21:10] hold on for a little bit longer [21:10] when was it announced this would go away? [21:11] *** odemg has quit IRC (Read error: Operation timed out) [21:11] *** odemg has joined #archiveteam-bs [21:16] arkiver: Ten days ago or so. [21:29] arkiver: Any ETA? I have to leave soonish. [21:29] under hour hopefully [21:30] please send me the itemlist someway so I have that in case you're offline [21:31] Right. How? [21:32] I guess you can put it on github [21:32] not sure where :P [21:32] I'll try Gist. [21:33] i'm using the Page 1 of 3 to get max page [21:35] some final testing now [21:38] arkiver: https://gist.github.com/JustAnotherArchivist/9cdc31f75e995e228014c97876d29350/raw/50df623f504addf9eed79cc113de439cd926cbe5/roblox-items.txt [21:40] do we also want pageindex=1 [21:40] let's skip that [21:41] Agreed [21:41] There are links to that on the forum index, but since you get the same content as without the parameter, there is no point. [21:41] I'll have a log for you to check in a few minutes [21:45] JAA: https://paste.fedoraproject.org/paste/zsD2uqj8jDV7RiZRxXLGmA/raw [21:47] JAA: another one https://paste.fedoraproject.org/paste/cpjj7-uF62HgF2TzpJsIgg/raw [21:47] looks like new treads are still being created, so the threads on a page shift over time [21:48] Yeah, and some old threads might become active again. We'll certainly miss some threads with this strategy, but I don't think anything else is feasible in this time frame and without massive data duplication. [21:49] we'll miss the newest threads [21:49] uh nvm, we're not going through items sequentially [21:49] (wanted to say we're archiving some double) [21:49] does this look good? [21:54] Yeah, looks good. Does your code for the max page retrieval account for number formatting? I believe they print numbers over 1000 as "1,000". At least they do on the pager in the forums. [21:54] no, do you have an example? [21:54] I haven't come across a thread yet, but I'm sure they exist. [21:54] ah, simply see bottom of https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=13&PageIndex=186638 [21:54] I'll add support for that [21:55] Yeah. I *think* it will look the same for threads, but it'd be nice to check. [21:55] But of course that stupid forum software doesn't allow for sorting threads by post number. [21:59] arkiver: Found one: https://forum.roblox.com/Forum/ShowPost.aspx?PostID=80624889 [21:59] You can test it at forumpage:23_1. [22:00] yep, testing now [22:02] it's doing this, so I guess it's working https://paste.fedoraproject.org/paste/CKurQ~alUWo8D5G8mpPHzw/raw [22:03] https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=23&PageIndex=1 << I'd remove the PageIndex parameter here as well. [22:03] Otherwise, looks good. :-) [22:03] it's just a few, let's save those without pageindex with save page now [22:03] I'm starting the project [22:05] actually added it anyway [22:06] JAA this good? https://paste.fedoraproject.org/paste/NbFqTlMLR2CLO5syAaYIqQ/raw [22:06] doing both URLs [22:06] Perfect [22:07] *** svchfoo3 has quit IRC (Quit: Closing) [22:07] it's online https://github.com/ArchiveTeam/roblox-grab [22:08] Hooray [22:08] in the warrior now [22:10] *** svchfoo3 has joined #archiveteam-bs [22:11] *** svchfoo1 sets mode: +o svchfoo3 [22:14] project is started [22:14] it's warrior default now [22:17] Limits for concurrency? [22:18] Getting 500s atm [22:18] For Roblox [22:18] I'm not getting 500 [22:18] all working fine here [22:18] Me neither [22:18] I'll leave it to cool down, had 6 concurrent [22:18] we need to go 500 items/min if we have 36 hours left [22:19] looks like we can make it [22:19] I have 10 concurrent [22:19] we're now doing 600+ items/min [22:20] 750+ actually [22:20] Oh yeah, I'm seeing a few 500s currently. But very few really. [22:20] ah I have 500s now too [22:20] It unstuck for me [22:21] Hmm, some interesting URLs in there... Stuff like 47=500 https://www.roblox.com/request-error?id=a3c10460-e278-4aff-81e4-fc05d727c04a&mode=&code=500 [22:21] I'll make the code abort on a 500 [22:21] ^ JAA do we want that> [22:21] ? [22:21] else we might miss stuff [22:21] Hm, idk [22:22] That's what I got after a bunch of regular ones JAA, those URLs [22:23] we're now aborting on retries after 500 [22:23] 20170625.02 is now minimum [22:23] and we're at 900+ items/min now [22:24] Limiting at 800 items/mmin [22:24] let me know if we should go higher [22:25] Do we want to requeue the previous items? [22:26] yeah, requeuing all items now [22:26] done. [22:27] is there also parts of the forum that are not going down that we are not archiving? [22:27] Throws a "project code out of date" after having re-pulled via git. [22:28] arkiver: My item list covers all forums. But yes, it appears that some parts will stay. [22:28] It's just not very clear which ones... [22:28] ld1: You need to restart the pipeline after updating the code. [22:29] I did, but I'll purge `data/` now. [22:30] That did it. Thanks. [22:43] "Server returned 0 (HERR). Sleeping." -- That took longer than I thought it would. [22:44] That just mean roblox got overwhelmed and is now timing out [22:44] ? [22:45] Likely [22:45] Better than 302 redirects to an error 500 page. [22:46] arkiver: We killed it. [22:47] You mean like this, which is now looping the 500 on mine: [22:47] 12=302 https://forum.roblox.com/Forum/ShowPost.aspx?PostID=114497378 [22:47] 13=500 https://www.roblox.com/request-error?id=a0201d10-19d4-4522-bf80-b39f6a804c98&mode=&code=500 [22:47] Server returned 500 (RETRFINISHED). Sleeping. [22:48] Yeah [22:52] *** Honno has quit IRC (Read error: Operation timed out) [22:56] Lovely, their server is case-insensitive too. It doesn't matter whether you access https://forum.roblox.com/Forum/ or /forum/ or /fORuM/. [22:59] At least I couldn't find any other spellings than /Forum on their website. [23:04] *** schbirid2 has joined #archiveteam-bs [23:07] *** schbirid has quit IRC (Read error: Operation timed out) [23:07] *** username1 has joined #archiveteam-bs [23:09] *** schbirid2 has quit IRC (Read error: Operation timed out) [23:10] *** ld1 has quit IRC (Remote host closed the connection) [23:10] *** ld1 has joined #archiveteam-bs [23:11] *** Ravenloft has joined #archiveteam-bs [23:11] so, what the cool kids are using to backup content from netflix these days? [23:19] *** cog has quit IRC (Ping timeout: 268 seconds) [23:54] *** BlueMaxim has joined #archiveteam-bs [23:56] *** schbirid2 has joined #archiveteam-bs [23:58] *** username1 has quit IRC (Read error: Operation timed out) [23:59] *** schbirid has joined #archiveteam-bs