#archiveteam 2013-05-15,Wed

↑back Search

Time	Nickname	Message
01:18 ^🔗	SketchCow	https://www.facebook.com/ArchiveTeam
01:18 ^🔗	SketchCow	OK, hop on.
01:18 ^🔗	SketchCow	Let's fuck with them
01:21 ^🔗	DFJustin	oh no my identity
01:35 ^🔗	BlueMax	Alright I'm on
06:49 ^🔗	GLaDOS	I'm on there
08:08 ^🔗	SketchCow	PIXPD
12:03 ^🔗	omf_	WiK, I think I remember you saying you just incremented the id to find repos on github. Was 'GET https://api.github.com/repositories' not useful? What happens when you request a repo that no longer exists?
12:06 ^🔗	omf_	I want to data mine repos that are not forked for documentation
12:07 ^🔗	omf_	I asked about the above API because it seems like it would be faster to use the API in this case to get blocks of ids but if there is no penalty for hitting a bad id then just starting with the highest id would be easiest
12:17 ^🔗	omf_	I am trying to identify problems with the software we are using right. I devised a simple test that already found a problem site (ours lol) and I would like others if they have a moment to try it out http://pad.archivingyoursh.it/p/getting_pages
12:51 ^🔗	godane	anothere site gone: http://www.techdirt.com/articles/20130514/10145123081/critic-chinese-censorship-censored-microblog-with-11-million-followers-deleted.shtml
13:03 ^🔗	WiK	omf_: thats exactly what im using
13:04 ^🔗	WiK	im getting a block, saving the 'last seen' and then getting another block
13:04 ^🔗	omf_	but you are rated limited to 5000 I believe
13:04 ^🔗	omf_	# of calls that is. How many results per call?
13:05 ^🔗	WiK	idk, appox 3-4 hundred or so
13:05 ^🔗	WiK	but ive RARELY hit that limit due to my download speeds when i thread 10 downloads at a time
13:06 ^🔗	omf_	yeah their documentation is light on # of results returned for most calls
13:06 ^🔗	WiK	i just dont store any of the repo data like 'fork': false or anything
13:06 ^🔗	WiK	i only pull out name/full_name
13:07 ^🔗	WiK	i can always use that api to find out if its a fork or not later
13:08 ^🔗	omf_	I was looking at your gitDigger repo. I see the different data lists but where is list of working repos?
13:08 ^🔗	omf_	or does that come after the talk?
13:10 ^🔗	omf_	or is everything in here https://raw.github.com/wick2o/gitDigger/master/github_projectnames.txt already tested as working
13:10 ^🔗	WiK	omf_: everything in that list has already been downloaded
13:11 ^🔗	WiK	ive got alot more, just havent updated it for awhile, AND thats just project names
13:11 ^🔗	WiK	i havent published the name/repo_name data
13:12 ^🔗	omf_	Does the API only return valid repos?
13:14 ^🔗	WiK	yes
13:14 ^🔗	WiK	it also doesnt return private repos
13:14 ^🔗	omf_	That is fine for my first attempt
13:15 ^🔗	WiK	why not just wait till i hand over my data, and then you will have it
13:15 ^🔗	WiK	all you will have to do is use the api to query if its a fork or not
13:15 ^🔗	omf_	Because I might be able to get a talk in next month on this idea
13:16 ^🔗	omf_	Nothing security related like yours
13:16 ^🔗	omf_	NLP of the documentation
13:17 ^🔗	omf_	I am only going to do like 5000 repos now
13:17 ^🔗	omf_	not the whole thing, I don't have that kinda time
13:17 ^🔗	WiK	where are you gonna submit the talk?
13:21 ^🔗	WiK	omf_: ive cloned 636273 projects so far, and have filled 8tb worth of drives
13:21 ^🔗	omf_	Yeah that is sweet
13:59 ^🔗	WiK	omf_: 1691303 last id i scanned
13:59 ^🔗	omf_	thanks
13:59 ^🔗	WiK	at least so far :)
14:00 ^🔗	WiK	you writing your downloader yourself or tweaking mine to see if its forked or not?
14:02 ^🔗	omf_	I haven't decided yet. It is pretty straight forward to tell if a repo is a fork by checking the fork flag and then the forks count
14:37 ^🔗	omf_	I am still testing the NLP parts
14:50 ^🔗	WiK	ah
15:19 ^🔗	omf_	WiK, No perl in your cat/language/ ?
15:20 ^🔗	WiK	huh?
15:20 ^🔗	omf_	aaah it is just pl
15:21 ^🔗	WiK	what other extension do you suggest i add?
15:21 ^🔗	WiK	all tho they are all in all_files.txt
15:21 ^🔗	omf_	for Perl you got the main one which is .pl
15:21 ^🔗	omf_	but for docs there is .pod
15:22 ^🔗	omf_	and .pm for Perl library files
15:23 ^🔗	WiK	well you can always grep them outta all_files.txt
15:23 ^🔗	omf_	How big is your whole repo when I check it out? The web interface says the blob is too big
15:23 ^🔗	WiK	egrep -i '\.pod$' all_files.txt
15:24 ^🔗	WiK	1.05 GB
15:24 ^🔗	omf_	oh that is not bad at all
15:24 ^🔗	omf_	I am going to clone it now
18:49 ^🔗	ersi	SketchCow: Ping
18:59 ^🔗	omf_	Twitter puts SketchCow at Red Lobster ersi. Not sure if he has irc on his phone
18:59 ^🔗	ersi	It's not life critical. And IRC doesn't have to be instant.
20:55 ^🔗	SketchCow	What

irclogger-viewer