#archiveteam-bs 2017-07-25,Tue

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
wp494reminder that d-day for roblox forums is the 27th [00:09]
***odemg has joined #archiveteam-bs
ReimuHaku has quit IRC (Ping timeout: 245 seconds)
ReimuHaku has joined #archiveteam-bs
BlueMaxim has quit IRC (Quit: Leaving)
bitBaron has joined #archiveteam-bs
qw3rty2 has quit IRC (Read error: Operation timed out)
qw3rty has joined #archiveteam-bs
bitBaron has quit IRC (Read error: Operation timed out)
GLaDOS has quit IRC (Remote host closed the connection)
tsuckow has joined #archiveteam-bs
[00:10]
GLaDOS has joined #archiveteam-bs [00:31]
GLaDOS..so i figured out why i wasnt able to log into my wiki account
my password manager fucking urlencoded the password when it saved it
[00:35]
***tsuckow has quit IRC (Ping timeout: 268 seconds) [00:36]
xmchahahaha
computers are just great aren't they
[00:41]
........ (idle for 39mn)
GLaDOS"what's that? this string has to be kept the exact same? well i better encode it then!" [01:20]
***bitBaron has joined #archiveteam-bs [01:23]
bitBaron has quit IRC (Read error: Operation timed out) [01:33]
username1 has joined #archiveteam-bs
schbirid2 has quit IRC (Read error: Operation timed out)
[01:39]
..... (idle for 21mn)
GLaDOS has quit IRC (Remote host closed the connection) [02:02]
Odd0002 has quit IRC (Remote host closed the connection) [02:08]
GLaDOS has joined #archiveteam-bs [02:21]
bitBaron has joined #archiveteam-bs [02:28]
bitBaron has quit IRC (Read error: Operation timed out) [02:36]
GLaDOS has quit IRC (Remote host closed the connection)
GLaDOS has joined #archiveteam-bs
[02:41]
pizzaiolo has quit IRC (pizzaiolo) [02:50]
..... (idle for 24mn)
qw3rty2 has joined #archiveteam-bs [03:14]
qw3rty has quit IRC (Read error: Operation timed out) [03:21]
bitBaron has joined #archiveteam-bs [03:31]
bitBaron has quit IRC (Read error: Operation timed out)
Stiletto has quit IRC (Read error: Operation timed out)
[03:40]
....... (idle for 32mn)
BlueMaxim has joined #archiveteam-bs [04:15]
..... (idle for 22mn)
bitBaron has joined #archiveteam-bs [04:37]
bitBaron has quit IRC (Ping timeout: 633 seconds)
Sk1d has quit IRC (Ping timeout: 250 seconds)
[04:48]
Sk1d has joined #archiveteam-bs [04:56]
......... (idle for 44mn)
bitBaron has joined #archiveteam-bs [05:40]
bitBaron has quit IRC (Ping timeout: 633 seconds)
Yoshimura has joined #archiveteam-bs
[05:51]
j08nY has joined #archiveteam-bs [06:05]
..... (idle for 24mn)
Fletcher| has quit IRC (Remote host closed the connection) [06:29]
j08nY has quit IRC (Quit: Leaving) [06:38]
bitBaron has joined #archiveteam-bs
Fletcher has joined #archiveteam-bs
[06:45]
bitBaron has quit IRC (Read error: Operation timed out) [06:52]
..................... (idle for 1h41mn)
zhongfu has quit IRC (Ping timeout: 260 seconds) [08:33]
BlueMaxim has quit IRC (Read error: Operation timed out) [08:41]
bitBaron has joined #archiveteam-bs [08:51]
Stiletti has joined #archiveteam-bs
Stiletti is now known as Stiletto
bitBaron has quit IRC (Read error: Operation timed out)
[09:01]
........... (idle for 54mn)
bitBaron has joined #archiveteam-bs [09:56]
bitBaron has quit IRC (Read error: Operation timed out) [10:08]
........ (idle for 35mn)
zhongfu has joined #archiveteam-bs [10:43]
..... (idle for 20mn)
bitBaron has joined #archiveteam-bs [11:03]
........ (idle for 37mn)
pizzaiolo has joined #archiveteam-bs [11:40]
..... (idle for 21mn)
odemg has quit IRC (Ping timeout: 260 seconds) [12:01]
........ (idle for 36mn)
odemg has joined #archiveteam-bs [12:37]
.... (idle for 19mn)
ld1 has quit IRC (Quit: WeeChat 1.9) [12:56]
brayden has joined #archiveteam-bs
swebb sets mode: +o brayden
brayden has quit IRC (Connection closed)
brayden has joined #archiveteam-bs
swebb sets mode: +o brayden
[13:02]
ld1 has joined #archiveteam-bs
ld1 has quit IRC (Client Quit)
ld1 has joined #archiveteam-bs
[13:10]
.......... (idle for 46mn)
ld1 has quit IRC (Quit: WeeChat 1.9) [13:59]
.... (idle for 15mn)
ld1 has joined #archiveteam-bs
qw3rty has joined #archiveteam-bs
qw3rty2 has quit IRC (Ping timeout: 600 seconds)
[14:14]
.... (idle for 16mn)
qw3rty2 has joined #archiveteam-bs [14:33]
qw3rty has quit IRC (Read error: Operation timed out) [14:41]
.... (idle for 17mn)
ZexaronS has quit IRC (Quit: Leaving) [14:58]
....... (idle for 32mn)
JAAarkiver: For example, there would be an item forumpage:13_186637 which translates to retrieving https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=13&PageIndex=186637 and all /Forum/ShowPost.aspx?PostID=\d+$ links on that page plus the relevant &PageIndex= URLs for multi-paged threads.
(In this particular case, there are no multi-paged threads though.)
For forumpage:13_186638, it would for example fetch https://forum.roblox.com/Forum/ShowPost.aspx?PostID=24086454 and https://forum.roblox.com/Forum/ShowPost.aspx?PostID=24086454&PageIndex=2
I can give you my code for extracting the post IDs and maximum page index for each thread on a forum index page, if you want.
[15:30]
***j08nY has joined #archiveteam-bs [15:38]
....... (idle for 33mn)
JAAAlso, forumpage:N_1 should retrieve https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=N, i.e. no PageIndex parameter.
We might miss some things if people post on a forum while we're grabbing it, but we should be able to capture most of it.
My wpull index grab is only at roughly 1/4 by the way...
[16:11]
arkiverok
JAA: can you please prepare the items list
preparing the warrior project now
with this idea
[16:22]
JAAWill do
Oh dear
Can anyone please find and kill the guy who produced this bullshit forum software?
[16:26]
username1may i propose lennart poettering? [16:29]
***username1 is now known as schbirid [16:29]
JAAhttps://forum.roblox.com/Forum/ShowForum.aspx?ForumID=46&PageIndex=69161 returns threads although there are only 69160 pages in that forum.
Also, each page is supposed to contain 25 threads, but that's rarely the case.
And yeah, we can go after Lennart as well.
[16:29]
***pikhq has quit IRC (Read error: Operation timed out) [16:42]
JAAarkiver: Done. It's a 19.8 MiB text file with one item name per line. Compresses down to 2.66 MiB with gzip. Where do you want me to put it?
Slightly over 1 million items, by the way.
[16:45]
arkiver1 million items is good [16:45]
***pikhq has joined #archiveteam-bs [16:50]
JAAarkiver: Do you want my code for determining the number of thread pages? [16:53]
arkiversure [16:54]
JAA(It's really just a few string searches in the forum listing HTML.) [16:54]
arkiveryeah, I'm almost done anyway, but show me
haha
there's one way really good way to make a website not playable by the wayback machine
make the website one big POST request and will not be in the wayback machine :P
[16:55]
JAAI first did it with an HTML parser and XPath. More beautiful and way slower.
Haha yeah.
[16:56]
arkiverah yeah, not using that [16:56]
JAAI'll remember that in case I ever want to make a page unarchivable. [16:56]
arkiver:P [16:56]
JAAarkiver: https://gist.github.com/JustAnotherArchivist/85ef5c0e9d874791ee485fa69d08ac62#file-hook-py-L61-L82 [17:01]
.............. (idle for 1h6mn)
***ld1 has quit IRC (Quit: WeeChat 1.9)
ld1 has joined #archiveteam-bs
[18:07]
ld1 has quit IRC (Quit: WeeChat 1.9) [18:14]
ld1 has joined #archiveteam-bs [18:20]
odemg has quit IRC (Read error: Operation timed out) [18:32]
.......... (idle for 47mn)
ZexaronS has joined #archiveteam-bs [19:19]
.... (idle for 19mn)
odemg has joined #archiveteam-bs [19:38]
.......... (idle for 48mn)
cog has joined #archiveteam-bs [20:26]
qw3rty has joined #archiveteam-bs
qw3rty3 has joined #archiveteam-bs
qw3rty2 has quit IRC (Read error: Operation timed out)
[20:36]
qw3rty has quit IRC (Read error: Operation timed out) [20:46]
JAAarkiver: How is it going? We need to get this up and running ASAP... [20:52]
wp494^^^
we have a bit under 48 hrs to go
[20:54]
***qw3rty3 has quit IRC (Read error: Operation timed out)
Sue has quit IRC (Quit: leaving)
Sue_ is now known as Sue
[20:59]
arkiveralmost done
hold on for a little bit longer
when was it announced this would go away?
[21:10]
***odemg has quit IRC (Read error: Operation timed out)
odemg has joined #archiveteam-bs
[21:11]
JAAarkiver: Ten days ago or so. [21:16]
arkiver: Any ETA? I have to leave soonish. [21:29]
arkiverunder hour hopefully
please send me the itemlist someway so I have that in case you're offline
[21:29]
JAARight. How? [21:31]
arkiverI guess you can put it on github
not sure where :P
[21:32]
JAAI'll try Gist. [21:32]
arkiveri'm using the <span class="normalTextSmallBold">Page 1 of 3</span> to get max page
some final testing now
[21:33]
JAAarkiver: https://gist.github.com/JustAnotherArchivist/9cdc31f75e995e228014c97876d29350/raw/50df623f504addf9eed79cc113de439cd926cbe5/roblox-items.txt [21:38]
arkiverdo we also want pageindex=1
let's skip that
[21:40]
JAAAgreed
There are links to that on the forum index, but since you get the same content as without the parameter, there is no point.
[21:41]
arkiverI'll have a log for you to check in a few minutes
JAA: https://paste.fedoraproject.org/paste/zsD2uqj8jDV7RiZRxXLGmA/raw
JAA: another one https://paste.fedoraproject.org/paste/cpjj7-uF62HgF2TzpJsIgg/raw
looks like new treads are still being created, so the threads on a page shift over time
[21:41]
JAAYeah, and some old threads might become active again. We'll certainly miss some threads with this strategy, but I don't think anything else is feasible in this time frame and without massive data duplication. [21:48]
arkiverwe'll miss the newest threads
uh nvm, we're not going through items sequentially
(wanted to say we're archiving some double)
does this look good?
[21:49]
JAAYeah, looks good. Does your code for the max page retrieval account for number formatting? I believe they print numbers over 1000 as "1,000". At least they do on the pager in the forums. [21:54]
arkiverno, do you have an example? [21:54]
JAAI haven't come across a thread yet, but I'm sure they exist. [21:54]
arkiverah, simply see bottom of https://forum.roblox.com/Forum/ShowForum.aspx?ForumID=13&PageIndex=186638
I'll add support for that
[21:54]
JAAYeah. I *think* it will look the same for threads, but it'd be nice to check.
But of course that stupid forum software doesn't allow for sorting threads by post number.
arkiver: Found one: https://forum.roblox.com/Forum/ShowPost.aspx?PostID=80624889
You can test it at forumpage:23_1.
[21:55]
arkiveryep, testing now
it's doing this, so I guess it's working https://paste.fedoraproject.org/paste/CKurQ~alUWo8D5G8mpPHzw/raw
[22:00]
JAAhttps://forum.roblox.com/Forum/ShowForum.aspx?ForumID=23&PageIndex=1 << I'd remove the PageIndex parameter here as well.
Otherwise, looks good. :-)
[22:03]
arkiverit's just a few, let's save those without pageindex with save page now
I'm starting the project
actually added it anyway
JAA this good? https://paste.fedoraproject.org/paste/NbFqTlMLR2CLO5syAaYIqQ/raw
doing both URLs
[22:03]
JAAPerfect [22:06]
***svchfoo3 has quit IRC (Quit: Closing) [22:07]
arkiverit's online https://github.com/ArchiveTeam/roblox-grab [22:07]
JAAHooray [22:08]
arkiverin the warrior now [22:08]
***svchfoo3 has joined #archiveteam-bs
svchfoo1 sets mode: +o svchfoo3
[22:10]
arkiverproject is started
it's warrior default now
[22:14]
ld1Limits for concurrency? [22:17]
mlsGetting 500s atm
For Roblox
[22:18]
arkiverI'm not getting 500
all working fine here
[22:18]
JAAMe neither [22:18]
mlsI'll leave it to cool down, had 6 concurrent [22:18]
arkiverwe need to go 500 items/min if we have 36 hours left
looks like we can make it
I have 10 concurrent
we're now doing 600+ items/min
750+ actually
[22:18]
JAAOh yeah, I'm seeing a few 500s currently. But very few really. [22:20]
arkiverah I have 500s now too [22:20]
mlsIt unstuck for me [22:20]
JAAHmm, some interesting URLs in there... Stuff like 47=500 https://www.roblox.com/request-error?id=a3c10460-e278-4aff-81e4-fc05d727c04a&mode=&code=500 [22:21]
arkiverI'll make the code abort on a 500
^ JAA do we want that>
?
else we might miss stuff
[22:21]
JAAHm, idk [22:21]
mlsThat's what I got after a bunch of regular ones JAA, those URLs [22:22]
arkiverwe're now aborting on retries after 500
20170625.02 is now minimum
and we're at 900+ items/min now
Limiting at 800 items/mmin
let me know if we should go higher
[22:23]
JAADo we want to requeue the previous items? [22:25]
arkiveryeah, requeuing all items now
done.
is there also parts of the forum that are not going down that we are not archiving?
[22:26]
ld1Throws a "project code out of date" after having re-pulled via git. [22:27]
JAAarkiver: My item list covers all forums. But yes, it appears that some parts will stay.
It's just not very clear which ones...
ld1: You need to restart the pipeline after updating the code.
[22:28]
ld1I did, but I'll purge `data/` now.
That did it. Thanks.
[22:29]
JAA"Server returned 0 (HERR). Sleeping." -- That took longer than I thought it would. [22:43]
bitspillThat just mean roblox got overwhelmed and is now timing out
?
[22:44]
JAALikely
Better than 302 redirects to an error 500 page.
arkiver: We killed it.
[22:45]
bitspillYou mean like this, which is now looping the 500 on mine:
12=302 https://forum.roblox.com/Forum/ShowPost.aspx?PostID=114497378
13=500 https://www.roblox.com/request-error?id=a0201d10-19d4-4522-bf80-b39f6a804c98&mode=&code=500
Server returned 500 (RETRFINISHED). Sleeping.
[22:47]
JAAYeah [22:48]
***Honno has quit IRC (Read error: Operation timed out) [22:52]
JAALovely, their server is case-insensitive too. It doesn't matter whether you access https://forum.roblox.com/Forum/ or /forum/ or /fORuM/.
At least I couldn't find any other spellings than /Forum on their website.
[22:56]
***schbirid2 has joined #archiveteam-bs
schbirid has quit IRC (Read error: Operation timed out)
username1 has joined #archiveteam-bs
schbirid2 has quit IRC (Read error: Operation timed out)
ld1 has quit IRC (Remote host closed the connection)
ld1 has joined #archiveteam-bs
Ravenloft has joined #archiveteam-bs
[23:04]
Ravenloftso, what the cool kids are using to backup content from netflix these days? [23:11]
***cog has quit IRC (Ping timeout: 268 seconds) [23:19]
........ (idle for 35mn)
BlueMaxim has joined #archiveteam-bs
schbirid2 has joined #archiveteam-bs
username1 has quit IRC (Read error: Operation timed out)
schbirid has joined #archiveteam-bs
[23:54]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)