#archiveteam 2013-05-15,Wed

↑back Search

Time Nickname Message
01:18 🔗 SketchCow https://www.facebook.com/ArchiveTeam
01:18 🔗 SketchCow OK, hop on.
01:18 🔗 SketchCow Let's fuck with them
01:21 🔗 DFJustin oh no my identity
01:35 🔗 BlueMax Alright I'm on
06:49 🔗 GLaDOS I'm on there
08:08 🔗 SketchCow PIXPD
12:03 🔗 omf_ WiK, I think I remember you saying you just incremented the id to find repos on github. Was 'GET https://api.github.com/repositories' not useful? What happens when you request a repo that no longer exists?
12:06 🔗 omf_ I want to data mine repos that are not forked for documentation
12:07 🔗 omf_ I asked about the above API because it seems like it would be faster to use the API in this case to get blocks of ids but if there is no penalty for hitting a bad id then just starting with the highest id would be easiest
12:17 🔗 omf_ I am trying to identify problems with the software we are using right. I devised a simple test that already found a problem site (ours lol) and I would like others if they have a moment to try it out http://pad.archivingyoursh.it/p/getting_pages
12:51 🔗 godane anothere site gone: http://www.techdirt.com/articles/20130514/10145123081/critic-chinese-censorship-censored-microblog-with-11-million-followers-deleted.shtml
13:03 🔗 WiK omf_: thats exactly what im using
13:04 🔗 WiK im getting a block, saving the 'last seen' and then getting another block
13:04 🔗 omf_ but you are rated limited to 5000 I believe
13:04 🔗 omf_ # of calls that is. How many results per call?
13:05 🔗 WiK idk, appox 3-4 hundred or so
13:05 🔗 WiK but ive RARELY hit that limit due to my download speeds when i thread 10 downloads at a time
13:06 🔗 omf_ yeah their documentation is light on # of results returned for most calls
13:06 🔗 WiK i just dont store any of the repo data like 'fork': false or anything
13:06 🔗 WiK i only pull out name/full_name
13:07 🔗 WiK i can always use that api to find out if its a fork or not later
13:08 🔗 omf_ I was looking at your gitDigger repo. I see the different data lists but where is list of working repos?
13:08 🔗 omf_ or does that come after the talk?
13:10 🔗 omf_ or is everything in here https://raw.github.com/wick2o/gitDigger/master/github_projectnames.txt already tested as working
13:10 🔗 WiK omf_: everything in that list has already been downloaded
13:11 🔗 WiK ive got alot more, just havent updated it for awhile, AND thats just project names
13:11 🔗 WiK i havent published the name/repo_name data
13:12 🔗 omf_ Does the API only return valid repos?
13:14 🔗 WiK yes
13:14 🔗 WiK it also doesnt return private repos
13:14 🔗 omf_ That is fine for my first attempt
13:15 🔗 WiK why not just wait till i hand over my data, and then you will have it
13:15 🔗 WiK all you will have to do is use the api to query if its a fork or not
13:15 🔗 omf_ Because I might be able to get a talk in next month on this idea
13:16 🔗 omf_ Nothing security related like yours
13:16 🔗 omf_ NLP of the documentation
13:17 🔗 omf_ I am only going to do like 5000 repos now
13:17 🔗 omf_ not the whole thing, I don't have that kinda time
13:17 🔗 WiK where are you gonna submit the talk?
13:21 🔗 WiK omf_: ive cloned 636273 projects so far, and have filled 8tb worth of drives
13:21 🔗 omf_ Yeah that is sweet
13:59 🔗 WiK omf_: 1691303 last id i scanned
13:59 🔗 omf_ thanks
13:59 🔗 WiK at least so far :)
14:00 🔗 WiK you writing your downloader yourself or tweaking mine to see if its forked or not?
14:02 🔗 omf_ I haven't decided yet. It is pretty straight forward to tell if a repo is a fork by checking the fork flag and then the forks count
14:37 🔗 omf_ I am still testing the NLP parts
14:50 🔗 WiK ah
15:19 🔗 omf_ WiK, No perl in your cat/language/ ?
15:20 🔗 WiK huh?
15:20 🔗 omf_ aaah it is just pl
15:21 🔗 WiK what other extension do you suggest i add?
15:21 🔗 WiK all tho they are all in all_files.txt
15:21 🔗 omf_ for Perl you got the main one which is .pl
15:21 🔗 omf_ but for docs there is .pod
15:22 🔗 omf_ and .pm for Perl library files
15:23 🔗 WiK well you can always grep them outta all_files.txt
15:23 🔗 omf_ How big is your whole repo when I check it out? The web interface says the blob is too big
15:23 🔗 WiK egrep -i '\.pod$' all_files.txt
15:24 🔗 WiK 1.05 GB
15:24 🔗 omf_ oh that is not bad at all
15:24 🔗 omf_ I am going to clone it now
18:49 🔗 ersi SketchCow: Ping
18:59 🔗 omf_ Twitter puts SketchCow at Red Lobster ersi. Not sure if he has irc on his phone
18:59 🔗 ersi It's not life critical. And IRC doesn't have to be instant.
20:55 🔗 SketchCow What
