[01:18] https://www.facebook.com/ArchiveTeam [01:18] OK, hop on. [01:18] Let's fuck with them [01:21] oh no my identity [01:35] Alright I'm on [06:49] I'm on there [08:08] PIXPD [12:03] WiK, I think I remember you saying you just incremented the id to find repos on github. Was 'GET https://api.github.com/repositories' not useful? What happens when you request a repo that no longer exists? [12:06] I want to data mine repos that are not forked for documentation [12:07] I asked about the above API because it seems like it would be faster to use the API in this case to get blocks of ids but if there is no penalty for hitting a bad id then just starting with the highest id would be easiest [12:17] I am trying to identify problems with the software we are using right. I devised a simple test that already found a problem site (ours lol) and I would like others if they have a moment to try it out http://pad.archivingyoursh.it/p/getting_pages [12:51] anothere site gone: http://www.techdirt.com/articles/20130514/10145123081/critic-chinese-censorship-censored-microblog-with-11-million-followers-deleted.shtml [13:03] omf_: thats exactly what im using [13:04] im getting a block, saving the 'last seen' and then getting another block [13:04] but you are rated limited to 5000 I believe [13:04] # of calls that is. How many results per call? [13:05] idk, appox 3-4 hundred or so [13:05] but ive RARELY hit that limit due to my download speeds when i thread 10 downloads at a time [13:06] yeah their documentation is light on # of results returned for most calls [13:06] i just dont store any of the repo data like 'fork': false or anything [13:06] i only pull out name/full_name [13:07] i can always use that api to find out if its a fork or not later [13:08] I was looking at your gitDigger repo. I see the different data lists but where is list of working repos? [13:08] or does that come after the talk? [13:10] or is everything in here https://raw.github.com/wick2o/gitDigger/master/github_projectnames.txt already tested as working [13:10] omf_: everything in that list has already been downloaded [13:11] ive got alot more, just havent updated it for awhile, AND thats just project names [13:11] i havent published the name/repo_name data [13:12] Does the API only return valid repos? [13:14] yes [13:14] it also doesnt return private repos [13:14] That is fine for my first attempt [13:15] why not just wait till i hand over my data, and then you will have it [13:15] all you will have to do is use the api to query if its a fork or not [13:15] Because I might be able to get a talk in next month on this idea [13:16] Nothing security related like yours [13:16] NLP of the documentation [13:17] I am only going to do like 5000 repos now [13:17] not the whole thing, I don't have that kinda time [13:17] where are you gonna submit the talk? [13:21] omf_: ive cloned 636273 projects so far, and have filled 8tb worth of drives [13:21] Yeah that is sweet [13:59] omf_: 1691303 last id i scanned [13:59] thanks [13:59] at least so far :) [14:00] you writing your downloader yourself or tweaking mine to see if its forked or not? [14:02] I haven't decided yet. It is pretty straight forward to tell if a repo is a fork by checking the fork flag and then the forks count [14:37] I am still testing the NLP parts [14:50] ah [15:19] WiK, No perl in your cat/language/ ? [15:20] huh? [15:20] aaah it is just pl [15:21] what other extension do you suggest i add? [15:21] all tho they are all in all_files.txt [15:21] for Perl you got the main one which is .pl [15:21] but for docs there is .pod [15:22] and .pm for Perl library files [15:23] well you can always grep them outta all_files.txt [15:23] How big is your whole repo when I check it out? The web interface says the blob is too big [15:23] egrep -i '\.pod$' all_files.txt [15:24] 1.05 GB [15:24] oh that is not bad at all [15:24] I am going to clone it now [18:49] SketchCow: Ping [18:59] Twitter puts SketchCow at Red Lobster ersi. Not sure if he has irc on his phone [18:59] It's not life critical. And IRC doesn't have to be instant. [20:55] What