#warrior 2018-08-08,Wed

↑back Search

Time Nickname Message
00:08 πŸ”— JAA adinbied: I think you need to inherit from WgetDownload and implement it there. Let me see if I can find an example.
00:09 πŸ”— JAA Not quite, but similar: https://github.com/ArchiveTeam/tindeck-grab/blob/0e686b117ab4cd4a3606e3abd7a23d357cb570dc/pipeline.py#L229-L283
00:10 πŸ”— JAA (And a bit further down for the usage with WgetDownload.)
00:14 πŸ”— adinbied OK, I'll give it another go in a bit with that in mind. Thanks!
00:22 πŸ”— adinbied has quit IRC (Read error: Operation timed out)
01:43 πŸ”— arkiver adinbied: what are you trying to create?
01:44 πŸ”— arkiver IΒ΄m not sure on which project you are basing this, but https://github.com/adinbied/quizlet-grabv2/blob/master/quizlet.lua seems very empty
01:45 πŸ”— arkiver nvm, see adinbied quit
03:26 πŸ”— adinbied has joined #warrior
03:27 πŸ”— adinbied @arkiver, I just glanced at the IRC logs and saw your response, I was basing it off of the halo2-grab, which seemed to be one of the more basic seesaw scripts to use as a starting point
03:30 πŸ”— adinbied I'm attempting to write a warrior grab script for quizlet.com via their API as the normal version of the site is very JS-dependent and would be less than ideal to grab.
04:00 πŸ”— adinbied has quit IRC (Quit: Leaving)
07:19 πŸ”— Flashfire Are any warrior projects currently running? JAA? Astrid?
07:34 πŸ”— Flashfire Or should I just shut down my warrior for the time being?
08:50 πŸ”— mls has quit IRC (Ping timeout: 268 seconds)
08:51 πŸ”— mls has joined #warrior
09:23 πŸ”— mls has quit IRC (Quit: leaving)
10:29 πŸ”— JAA adinbied: We should probably grab the website anyway, including anything requested by the browser through JS. While playback might not work now, it'll only get better at handling JS (if anyone invests the time and effort), so it might work eventually.
10:34 πŸ”— JAA Flashfire: No idea what's active currently. Check the channels and tracker, I guess.
10:34 πŸ”— JAA Tindeck seems to be active, but not sure if it works in the warrior.
12:25 πŸ”— eientei95 He quit JAA
12:25 πŸ”— eientei95 (adinbied)
12:26 πŸ”— JAA eientei95: I know, but he/she knows about the logs, see above.
12:26 πŸ”— eientei95 oh
16:54 πŸ”— arkiver adinbied: The recent halo2 scripts? They were never actually used, since the project was cancelled. I see you changed a lot in the pipeline.py scripts. Feel free of course to do this the way you want, but the only thing you probably have to change in pipeline.py is where URLs are appended to the wget-lua arguments and the part containing project names.
16:55 πŸ”— arkiver Is the website going away?
16:55 πŸ”— arkiver Or do you really want to just get the API data? In that case I could help set up a project for that pretty quick.
16:56 πŸ”— JAA arkiver: From #archivebot: 2018-08-06 23:11:56 UTC < adinbied> Not currently, but it holds alot of learning material and info that AFAIK isn't backed up
16:56 πŸ”— arkiver right
16:56 πŸ”— arkiver Was there any reason for the substantial changes to pipeline.py?
16:57 πŸ”— JAA I haven't looked at it in detail, but adinbied was trying to implement support for multiple API keys which would be used randomly.
16:59 πŸ”— arkiver In that case that random key selection part can just be added here I think https://github.com/ArchiveTeam/tindeck-grab/blob/master/pipeline.py#L264-L272
17:00 πŸ”— adinbied has joined #warrior
17:00 πŸ”— JAA Yep, that's what I linked as well.
17:01 πŸ”— arkiver adinbied: what URLs are you exactly trying to archive exactly?
17:02 πŸ”— arkiver only https://github.com/adinbied/quizlet-grabv2/blob/master/pipeline.py#L220 ?
17:02 πŸ”— arkiver and donΒ΄t have to parse data or anything?
17:02 πŸ”— adinbied Hello all - back again. Glanced at the logs. JAA, would archiving the entire site be better or just the API responses? @arkiver, Yeah, my main changes were to archive all of the API responses using a selection of API client ID's
17:03 πŸ”— adinbied Yeah, so an example would be https://api.quizlet.com/2.0/sets/10000000?client_id=BNCkwdk2dm
17:04 πŸ”— arkiver No parsing of data and extracting more URLs?
17:04 πŸ”— JAA adinbied: ΒΏPor quΓ© no los dos? :-)
17:04 πŸ”— adinbied The API returns a JSON response as linked above - just archiving the JSON response from the API would be alot more useful for importing into other things(IMO) and lightweight as far as filesize goes
17:05 πŸ”— adinbied No, everything is incremental
17:05 πŸ”— arkiver Ok, do you want help setting this up?
17:07 πŸ”— adinbied Yeah, this is my first time actually writing a Warrior script outside of just messing with it myself. What was tripping me up was https://github.com/adinbied/quizlet-grabv2/blob/master/pipeline.py#L220 which seems to pick a random client ID from the list when the script is first run, but then keeps using it for all future requests
17:08 πŸ”— arkiver Got it
17:08 πŸ”— adinbied On the tracker side, the queue was just incrementally increasing URLS in the format api.quizlet.com/2.0/sets/460377 api.quizlet.com/2.0/sets/460378 etc
17:10 πŸ”— arkiver Would it help if I create a project which you can try to use for this and possible future incremental URLs projects?
17:10 πŸ”— arkiver I can make a group on ArchiveTeam and give you access to the repo
17:13 πŸ”— adinbied Sure, that would be immensely helpful - If you could help me out with the API client ID issue as well, that would be great. A possible solution I thought of was just randomly appending the client IDs to the URLS on the tracker side, but I'm not sure if that would cause other problems down the line or something
17:13 πŸ”— arkiver should have something up in a bit
17:14 πŸ”— adinbied OK, thank you so much!
17:14 πŸ”— JAA Yeah, that sounds very useful.
17:16 πŸ”— arkiver any special status codes?
17:17 πŸ”— arkiver in case of bans for example
17:20 πŸ”— arkiver 403: https://api.quizlet.com/2.0/sets/10000003?client_id=QTTg3wuA6D
17:20 πŸ”— arkiver and 404 if not existing
17:20 πŸ”— arkiver I guess we can skip 403?
17:20 πŸ”— adinbied From my testing, an outright banned/invalid Client ID gives a 401, A protected set that requires owner permission gives 403, and
17:21 πŸ”— arkiver Okey, so letΒ΄s abort on 401
17:21 πŸ”— adinbied yep, 404 for not existing and 200 for OK - skipping 403 sounds good, otherwise we'd just get alot of duplicate error messages
17:22 πŸ”— arkiver and keep client id the same thoughout multiple items?
17:24 πŸ”— adinbied Thats the part I wasn't sure on. My inital thought was the have the client ID vary for each request to stop IP & Client ID based rate limiting, although that might not end up being an issue, idk
17:25 πŸ”— arkiver Got it like client ID staying the same right now, but we can change it every request too
17:25 πŸ”— arkiver example output https://gist.github.com/Arkiver2/d48b85196456efb59ffb20321f24c2e8
17:25 πŸ”— arkiver https://github.com/ArchiveTeam/quizlet-grab
17:25 πŸ”— arkiver Note that this is very basic and not doing any data parsing since we didnΒ΄t need that for this.
17:26 πŸ”— arkiver It would be most perfect if we can archive every every URL with each client ID for best support in the Wayback Machine, but since the website is not at risk currently, this is probably fine.
17:27 πŸ”— JAA Is the API actually used on the website?
17:27 πŸ”— JAA And if so, with what client_id?
17:27 πŸ”— jut Should i run the script?
17:28 πŸ”— JAA I was under the impression that it was a separate thing.
17:28 πŸ”— arkiver Gives invalid request without client_id
17:29 πŸ”— arkiver Script updated with now random client_id for each URL.
17:30 πŸ”— arkiver example output: https://gist.github.com/Arkiver2/94c44681f6ca479053e1a2922e3d64d0
17:30 πŸ”— adinbied Perfect - that looks good. The API isn't used on the regular version of the site (quizlet.com) and therefore doesn't require a client_id.
17:30 πŸ”— JAA The website has a separate "webapi" thing, but that's only used for spyi... uh, logging.
17:30 πŸ”— JAA So yeah, what client_id is in the URLs isn't really relevant.
17:31 πŸ”— adinbied arkiver, that's perfect! Mind if I create an IRC and add the source to the wiki
17:31 πŸ”— arkiver sure
17:31 πŸ”— JAA Yeah, this is the point where we need a proper channel.
17:31 πŸ”— arkiver hold on, creating group
17:31 πŸ”— arkiver and adding to tracker
17:32 πŸ”— adinbied Project IRC: #quizletusin
17:33 πŸ”— arkiver https://github.com/orgs/ArchiveTeam/teams/quizlet
17:33 πŸ”— arkiver JAA: adinbied: both added to the team
17:33 πŸ”— arkiver also project is on tracker
17:34 πŸ”— adinbied Hmm.. getting a 404 on the team page....
17:34 πŸ”— arkiver Accept the invite
17:34 πŸ”— arkiver I think you both should be able to edit the scripts now
17:35 πŸ”— adinbied Sweet - looks good
17:39 πŸ”— adinbied Do you need me to send a URL List to add to the tracker (generated via an auto incrementing python script) or do you have that covered?
17:40 πŸ”— JAA We won't have a URL list on the tracker for this, just items like "1-1000" covering a certain range of IDs.
17:41 πŸ”— adinbied Ah, OK. Is arkiver the only one with tracker admin privs for this project?
17:42 πŸ”— JAA Let's discuss anything project-specific over in #quizletusin.
17:47 πŸ”— astrid there are a bunch of people with global-admin who aren't active any more
17:47 πŸ”— astrid should we audit this list sometime?
17:48 πŸ”— arkiver Yes
21:16 πŸ”— Atom__ has joined #warrior
21:20 πŸ”— Atom-- has quit IRC (Read error: Operation timed out)

irclogger-viewer