#warrior 2018-08-08,Wed

↑back Search

Time	Nickname	Message
00:08 ^🔗	JAA	adinbied: I think you need to inherit from WgetDownload and implement it there. Let me see if I can find an example.
00:09 ^🔗	JAA	Not quite, but similar: https://github.com/ArchiveTeam/tindeck-grab/blob/0e686b117ab4cd4a3606e3abd7a23d357cb570dc/pipeline.py#L229-L283
00:10 ^🔗	JAA	(And a bit further down for the usage with WgetDownload.)
00:14 ^🔗	adinbied	OK, I'll give it another go in a bit with that in mind. Thanks!
00:22 ^🔗		adinbied has quit IRC (Read error: Operation timed out)
01:43 ^🔗	arkiver	adinbied: what are you trying to create?
01:44 ^🔗	arkiver	I´m not sure on which project you are basing this, but https://github.com/adinbied/quizlet-grabv2/blob/master/quizlet.lua seems very empty
01:45 ^🔗	arkiver	nvm, see adinbied quit
03:26 ^🔗		adinbied has joined #warrior
03:27 ^🔗	adinbied	@arkiver, I just glanced at the IRC logs and saw your response, I was basing it off of the halo2-grab, which seemed to be one of the more basic seesaw scripts to use as a starting point
03:30 ^🔗	adinbied	I'm attempting to write a warrior grab script for quizlet.com via their API as the normal version of the site is very JS-dependent and would be less than ideal to grab.
04:00 ^🔗		adinbied has quit IRC (Quit: Leaving)
07:19 ^🔗	Flashfire	Are any warrior projects currently running? JAA? Astrid?
07:34 ^🔗	Flashfire	Or should I just shut down my warrior for the time being?
08:50 ^🔗		mls has quit IRC (Ping timeout: 268 seconds)
08:51 ^🔗		mls has joined #warrior
09:23 ^🔗		mls has quit IRC (Quit: leaving)
10:29 ^🔗	JAA	adinbied: We should probably grab the website anyway, including anything requested by the browser through JS. While playback might not work now, it'll only get better at handling JS (if anyone invests the time and effort), so it might work eventually.
10:34 ^🔗	JAA	Flashfire: No idea what's active currently. Check the channels and tracker, I guess.
10:34 ^🔗	JAA	Tindeck seems to be active, but not sure if it works in the warrior.
12:25 ^🔗	eientei95	He quit JAA
12:25 ^🔗	eientei95	(adinbied)
12:26 ^🔗	JAA	eientei95: I know, but he/she knows about the logs, see above.
12:26 ^🔗	eientei95	oh
16:54 ^🔗	arkiver	adinbied: The recent halo2 scripts? They were never actually used, since the project was cancelled. I see you changed a lot in the pipeline.py scripts. Feel free of course to do this the way you want, but the only thing you probably have to change in pipeline.py is where URLs are appended to the wget-lua arguments and the part containing project names.
16:55 ^🔗	arkiver	Is the website going away?
16:55 ^🔗	arkiver	Or do you really want to just get the API data? In that case I could help set up a project for that pretty quick.
16:56 ^🔗	JAA	arkiver: From #archivebot: 2018-08-06 23:11:56 UTC < adinbied> Not currently, but it holds alot of learning material and info that AFAIK isn't backed up
16:56 ^🔗	arkiver	right
16:56 ^🔗	arkiver	Was there any reason for the substantial changes to pipeline.py?
16:57 ^🔗	JAA	I haven't looked at it in detail, but adinbied was trying to implement support for multiple API keys which would be used randomly.
16:59 ^🔗	arkiver	In that case that random key selection part can just be added here I think https://github.com/ArchiveTeam/tindeck-grab/blob/master/pipeline.py#L264-L272
17:00 ^🔗		adinbied has joined #warrior
17:00 ^🔗	JAA	Yep, that's what I linked as well.
17:01 ^🔗	arkiver	adinbied: what URLs are you exactly trying to archive exactly?
17:02 ^🔗	arkiver	only https://github.com/adinbied/quizlet-grabv2/blob/master/pipeline.py#L220 ?
17:02 ^🔗	arkiver	and don´t have to parse data or anything?
17:02 ^🔗	adinbied	Hello all - back again. Glanced at the logs. JAA, would archiving the entire site be better or just the API responses? @arkiver, Yeah, my main changes were to archive all of the API responses using a selection of API client ID's
17:03 ^🔗	adinbied	Yeah, so an example would be https://api.quizlet.com/2.0/sets/10000000?client_id=BNCkwdk2dm
17:04 ^🔗	arkiver	No parsing of data and extracting more URLs?
17:04 ^🔗	JAA	adinbied: ¿Por qué no los dos? :-)
17:04 ^🔗	adinbied	The API returns a JSON response as linked above - just archiving the JSON response from the API would be alot more useful for importing into other things(IMO) and lightweight as far as filesize goes
17:05 ^🔗	adinbied	No, everything is incremental
17:05 ^🔗	arkiver	Ok, do you want help setting this up?
17:07 ^🔗	adinbied	Yeah, this is my first time actually writing a Warrior script outside of just messing with it myself. What was tripping me up was https://github.com/adinbied/quizlet-grabv2/blob/master/pipeline.py#L220 which seems to pick a random client ID from the list when the script is first run, but then keeps using it for all future requests
17:08 ^🔗	arkiver	Got it
17:08 ^🔗	adinbied	On the tracker side, the queue was just incrementally increasing URLS in the format api.quizlet.com/2.0/sets/460377 api.quizlet.com/2.0/sets/460378 etc
17:10 ^🔗	arkiver	Would it help if I create a project which you can try to use for this and possible future incremental URLs projects?
17:10 ^🔗	arkiver	I can make a group on ArchiveTeam and give you access to the repo
17:13 ^🔗	adinbied	Sure, that would be immensely helpful - If you could help me out with the API client ID issue as well, that would be great. A possible solution I thought of was just randomly appending the client IDs to the URLS on the tracker side, but I'm not sure if that would cause other problems down the line or something
17:13 ^🔗	arkiver	should have something up in a bit
17:14 ^🔗	adinbied	OK, thank you so much!
17:14 ^🔗	JAA	Yeah, that sounds very useful.
17:16 ^🔗	arkiver	any special status codes?
17:17 ^🔗	arkiver	in case of bans for example
17:20 ^🔗	arkiver	403: https://api.quizlet.com/2.0/sets/10000003?client_id=QTTg3wuA6D
17:20 ^🔗	arkiver	and 404 if not existing
17:20 ^🔗	arkiver	I guess we can skip 403?
17:20 ^🔗	adinbied	From my testing, an outright banned/invalid Client ID gives a 401, A protected set that requires owner permission gives 403, and
17:21 ^🔗	arkiver	Okey, so let´s abort on 401
17:21 ^🔗	adinbied	yep, 404 for not existing and 200 for OK - skipping 403 sounds good, otherwise we'd just get alot of duplicate error messages
17:22 ^🔗	arkiver	and keep client id the same thoughout multiple items?
17:24 ^🔗	adinbied	Thats the part I wasn't sure on. My inital thought was the have the client ID vary for each request to stop IP & Client ID based rate limiting, although that might not end up being an issue, idk
17:25 ^🔗	arkiver	Got it like client ID staying the same right now, but we can change it every request too
17:25 ^🔗	arkiver	example output https://gist.github.com/Arkiver2/d48b85196456efb59ffb20321f24c2e8
17:25 ^🔗	arkiver	https://github.com/ArchiveTeam/quizlet-grab
17:25 ^🔗	arkiver	Note that this is very basic and not doing any data parsing since we didn´t need that for this.
17:26 ^🔗	arkiver	It would be most perfect if we can archive every every URL with each client ID for best support in the Wayback Machine, but since the website is not at risk currently, this is probably fine.
17:27 ^🔗	JAA	Is the API actually used on the website?
17:27 ^🔗	JAA	And if so, with what client_id?
17:27 ^🔗	jut	Should i run the script?
17:28 ^🔗	JAA	I was under the impression that it was a separate thing.
17:28 ^🔗	arkiver	Gives invalid request without client_id
17:29 ^🔗	arkiver	Script updated with now random client_id for each URL.
17:30 ^🔗	arkiver	example output: https://gist.github.com/Arkiver2/94c44681f6ca479053e1a2922e3d64d0
17:30 ^🔗	adinbied	Perfect - that looks good. The API isn't used on the regular version of the site (quizlet.com) and therefore doesn't require a client_id.
17:30 ^🔗	JAA	The website has a separate "webapi" thing, but that's only used for spyi... uh, logging.
17:30 ^🔗	JAA	So yeah, what client_id is in the URLs isn't really relevant.
17:31 ^🔗	adinbied	arkiver, that's perfect! Mind if I create an IRC and add the source to the wiki
17:31 ^🔗	arkiver	sure
17:31 ^🔗	JAA	Yeah, this is the point where we need a proper channel.
17:31 ^🔗	arkiver	hold on, creating group
17:31 ^🔗	arkiver	and adding to tracker
17:32 ^🔗	adinbied	Project IRC: #quizletusin
17:33 ^🔗	arkiver	https://github.com/orgs/ArchiveTeam/teams/quizlet
17:33 ^🔗	arkiver	JAA: adinbied: both added to the team
17:33 ^🔗	arkiver	also project is on tracker
17:34 ^🔗	adinbied	Hmm.. getting a 404 on the team page....
17:34 ^🔗	arkiver	Accept the invite
17:34 ^🔗	arkiver	I think you both should be able to edit the scripts now
17:35 ^🔗	adinbied	Sweet - looks good
17:39 ^🔗	adinbied	Do you need me to send a URL List to add to the tracker (generated via an auto incrementing python script) or do you have that covered?
17:40 ^🔗	JAA	We won't have a URL list on the tracker for this, just items like "1-1000" covering a certain range of IDs.
17:41 ^🔗	adinbied	Ah, OK. Is arkiver the only one with tracker admin privs for this project?
17:42 ^🔗	JAA	Let's discuss anything project-specific over in #quizletusin.
17:47 ^🔗	astrid	there are a bunch of people with global-admin who aren't active any more
17:47 ^🔗	astrid	should we audit this list sometime?
17:48 ^🔗	arkiver	Yes
21:16 ^🔗		Atom__ has joined #warrior
21:20 ^🔗		Atom-- has quit IRC (Read error: Operation timed out)

irclogger-viewer