[00:08] adinbied: I think you need to inherit from WgetDownload and implement it there. Let me see if I can find an example. [00:09] Not quite, but similar: https://github.com/ArchiveTeam/tindeck-grab/blob/0e686b117ab4cd4a3606e3abd7a23d357cb570dc/pipeline.py#L229-L283 [00:10] (And a bit further down for the usage with WgetDownload.) [00:14] OK, I'll give it another go in a bit with that in mind. Thanks! [00:22] *** adinbied has quit IRC (Read error: Operation timed out) [01:43] adinbied: what are you trying to create? [01:44] I´m not sure on which project you are basing this, but https://github.com/adinbied/quizlet-grabv2/blob/master/quizlet.lua seems very empty [01:45] nvm, see adinbied quit [03:26] *** adinbied has joined #warrior [03:27] @arkiver, I just glanced at the IRC logs and saw your response, I was basing it off of the halo2-grab, which seemed to be one of the more basic seesaw scripts to use as a starting point [03:30] I'm attempting to write a warrior grab script for quizlet.com via their API as the normal version of the site is very JS-dependent and would be less than ideal to grab. [04:00] *** adinbied has quit IRC (Quit: Leaving) [07:19] Are any warrior projects currently running? JAA? Astrid? [07:34] Or should I just shut down my warrior for the time being? [08:50] *** mls has quit IRC (Ping timeout: 268 seconds) [08:51] *** mls has joined #warrior [09:23] *** mls has quit IRC (Quit: leaving) [10:29] adinbied: We should probably grab the website anyway, including anything requested by the browser through JS. While playback might not work now, it'll only get better at handling JS (if anyone invests the time and effort), so it might work eventually. [10:34] Flashfire: No idea what's active currently. Check the channels and tracker, I guess. [10:34] Tindeck seems to be active, but not sure if it works in the warrior. [12:25] He quit JAA [12:25] (adinbied) [12:26] eientei95: I know, but he/she knows about the logs, see above. [12:26] oh [16:54] adinbied: The recent halo2 scripts? They were never actually used, since the project was cancelled. I see you changed a lot in the pipeline.py scripts. Feel free of course to do this the way you want, but the only thing you probably have to change in pipeline.py is where URLs are appended to the wget-lua arguments and the part containing project names. [16:55] Is the website going away? [16:55] Or do you really want to just get the API data? In that case I could help set up a project for that pretty quick. [16:56] arkiver: From #archivebot: 2018-08-06 23:11:56 UTC < adinbied> Not currently, but it holds alot of learning material and info that AFAIK isn't backed up [16:56] right [16:56] Was there any reason for the substantial changes to pipeline.py? [16:57] I haven't looked at it in detail, but adinbied was trying to implement support for multiple API keys which would be used randomly. [16:59] In that case that random key selection part can just be added here I think https://github.com/ArchiveTeam/tindeck-grab/blob/master/pipeline.py#L264-L272 [17:00] *** adinbied has joined #warrior [17:00] Yep, that's what I linked as well. [17:01] adinbied: what URLs are you exactly trying to archive exactly? [17:02] only https://github.com/adinbied/quizlet-grabv2/blob/master/pipeline.py#L220 ? [17:02] and don´t have to parse data or anything? [17:02] Hello all - back again. Glanced at the logs. JAA, would archiving the entire site be better or just the API responses? @arkiver, Yeah, my main changes were to archive all of the API responses using a selection of API client ID's [17:03] Yeah, so an example would be https://api.quizlet.com/2.0/sets/10000000?client_id=BNCkwdk2dm [17:04] No parsing of data and extracting more URLs? [17:04] adinbied: ¿Por qué no los dos? :-) [17:04] The API returns a JSON response as linked above - just archiving the JSON response from the API would be alot more useful for importing into other things(IMO) and lightweight as far as filesize goes [17:05] No, everything is incremental [17:05] Ok, do you want help setting this up? [17:07] Yeah, this is my first time actually writing a Warrior script outside of just messing with it myself. What was tripping me up was https://github.com/adinbied/quizlet-grabv2/blob/master/pipeline.py#L220 which seems to pick a random client ID from the list when the script is first run, but then keeps using it for all future requests [17:08] Got it [17:08] On the tracker side, the queue was just incrementally increasing URLS in the format api.quizlet.com/2.0/sets/460377 api.quizlet.com/2.0/sets/460378 etc [17:10] Would it help if I create a project which you can try to use for this and possible future incremental URLs projects? [17:10] I can make a group on ArchiveTeam and give you access to the repo [17:13] Sure, that would be immensely helpful - If you could help me out with the API client ID issue as well, that would be great. A possible solution I thought of was just randomly appending the client IDs to the URLS on the tracker side, but I'm not sure if that would cause other problems down the line or something [17:13] should have something up in a bit [17:14] OK, thank you so much! [17:14] Yeah, that sounds very useful. [17:16] any special status codes? [17:17] in case of bans for example [17:20] 403: https://api.quizlet.com/2.0/sets/10000003?client_id=QTTg3wuA6D [17:20] and 404 if not existing [17:20] I guess we can skip 403? [17:20] From my testing, an outright banned/invalid Client ID gives a 401, A protected set that requires owner permission gives 403, and [17:21] Okey, so let´s abort on 401 [17:21] yep, 404 for not existing and 200 for OK - skipping 403 sounds good, otherwise we'd just get alot of duplicate error messages [17:22] and keep client id the same thoughout multiple items? [17:24] Thats the part I wasn't sure on. My inital thought was the have the client ID vary for each request to stop IP & Client ID based rate limiting, although that might not end up being an issue, idk [17:25] Got it like client ID staying the same right now, but we can change it every request too [17:25] example output https://gist.github.com/Arkiver2/d48b85196456efb59ffb20321f24c2e8 [17:25] https://github.com/ArchiveTeam/quizlet-grab [17:25] Note that this is very basic and not doing any data parsing since we didn´t need that for this. [17:26] It would be most perfect if we can archive every every URL with each client ID for best support in the Wayback Machine, but since the website is not at risk currently, this is probably fine. [17:27] Is the API actually used on the website? [17:27] And if so, with what client_id? [17:27] Should i run the script? [17:28] I was under the impression that it was a separate thing. [17:28] Gives invalid request without client_id [17:29] Script updated with now random client_id for each URL. [17:30] example output: https://gist.github.com/Arkiver2/94c44681f6ca479053e1a2922e3d64d0 [17:30] Perfect - that looks good. The API isn't used on the regular version of the site (quizlet.com) and therefore doesn't require a client_id. [17:30] The website has a separate "webapi" thing, but that's only used for spyi... uh, logging. [17:30] So yeah, what client_id is in the URLs isn't really relevant. [17:31] arkiver, that's perfect! Mind if I create an IRC and add the source to the wiki [17:31] sure [17:31] Yeah, this is the point where we need a proper channel. [17:31] hold on, creating group [17:31] and adding to tracker [17:32] Project IRC: #quizletusin [17:33] https://github.com/orgs/ArchiveTeam/teams/quizlet [17:33] JAA: adinbied: both added to the team [17:33] also project is on tracker [17:34] Hmm.. getting a 404 on the team page.... [17:34] Accept the invite [17:34] I think you both should be able to edit the scripts now [17:35] Sweet - looks good [17:39] Do you need me to send a URL List to add to the tracker (generated via an auto incrementing python script) or do you have that covered? [17:40] We won't have a URL list on the tracker for this, just items like "1-1000" covering a certain range of IDs. [17:41] Ah, OK. Is arkiver the only one with tracker admin privs for this project? [17:42] Let's discuss anything project-specific over in #quizletusin. [17:47] there are a bunch of people with global-admin who aren't active any more [17:47] should we audit this list sometime? [17:48] Yes [21:16] *** Atom__ has joined #warrior [21:20] *** Atom-- has quit IRC (Read error: Operation timed out)