[02:24] omf_ firehose or ? [02:25] spritzer and searches. Very few groups have access to the full firehose [02:36] omf_ sure, costs $ from gnip i guess. are there any estimates of how complete the coverage is using spritzer? [02:36] presumably popular urls get covered, but lots don't, right? [02:37] it also depends on the shortener [02:37] some we just increment the value and find more urls [02:38] From my own observations more and more companies using a shortener that is just an alias to bitly [02:38] yeah [02:38] We are always looking for new ways to discover urls [02:39] hmm, bit.ly customs domains are cnames, right? [02:39] no idea [02:39] just look one up from our list on the wiki [02:40] it looks like they are actually A records to a bit.ly ip bock [02:40] block [02:40] j.mp -> A 69.58.188.45 [02:40] just saying, that may be a good way to tell that it is indeed bit.ly [02:41] the fastest is to just replace their domain name with bit.ly and try and load the page [02:41] it uses the same hash value pool [02:41] oh, huh [02:43] yep simple for us [02:43] so i know python and webby stuff. [02:43] is there a todo or issues list somewhere? [02:44] the always ongoing need to run the warrior on urlteam is the only task I know of. You could check the github repos, let me link you [02:45] 'if ps.hostname in bitly_pro_hosts or ps.hostname in ["bit.ly", "j.mp", "bitly.com"]:' [02:45] heh [02:46] https://github.com/ArchiveTeam/urlteam-stuff https://github.com/chronomex/urlteam https://github.com/ArchiveTeam/tinyback [02:47] cool, thanks for pointers [02:48] There might be more code repos but those are the only ones I know of [02:49] librarians would hate you guys :) [02:50] beautiful mess. [02:50] (meant in the best way possible) [02:51] librarians are always up our ass [02:51] Fuck them, they never contribute shit [02:52] we cannot use this because it does not have metadata. The common response: metadata comes later. If they do not "get" that then I usually follow up with: your bitches does not help this process [02:52] bunch of sheltered fuckers [02:53] as i said earlier, i'm glad you exist. [02:53] I am glad to have found this group so I could help [02:53] i think some day librarians may realize that access to knowledge has little to do with books on shelves. [02:54] until then, a pirates life for me [02:54] yeah in a decade or two [02:54] all I ever hear out of libraries is how their budget got cut or how they are lucky to have kept access to some shitty web database [02:55] It is disgusting how behind the tech curve the whole field is. I, as a non-academic can release papers all I want. I can include the code and data so people can reproduce my results [02:56] and there is nothing they can do about it. The closed circle of academia papers is being beaten down. I love this because I was told for years I could never break in since I am a slacker [02:57] now they are the old fossils [02:57] take urlteam for example [02:57] you could download the full url set and start a search engine with that [02:57] or study what is popular [02:58] bit.ly studies it and i think does publish :P [02:58] or any thing you can think of, for free. That is the gift these kinds of save the world projects hand out [02:58] but for sure, i think the inability to integrate tech into society as quickly as tech is changing.. it's a critical problem for society. [02:59] How come people are so quick to use the newest tech smart phone but they go back to fucking crap spreadsheets for data [02:59] how does tracker deployment work? who has the keys? [02:59] driver vs. mechanic. [02:59] they don't see that it's a tool for their use. [02:59] yeah [03:00] for this I am guessing swebb, soultcer and maybe alard. Alard sets up the trackers for the other projects [03:01] I think it is more the human nature problem of people resisting change and not wanting to learn new things [03:05] Every time I use newish tech to save a company money they always ask themselves why didn't they do that before. [03:05] Like taking a paperwork process that took 3 days and automating it into a report that takes 90 seconds to generate [09:35] For the record: soultcer runs tracker.tinyarchive.org and has done most/all of tinyarchive (tracker)/tinyback (client) repositories [09:37] And no, none of the work in the URLTeam tinyarchive/back combo uses the twitter data - it's all generated short codes, devided in chunks and distributed to the workers to look up. [09:37] swebb is however slurping twitter data and unrolling shortenerlinks - that data gets shared and gets aggregated into the tinyarchive data dumps released every year (or at least in this latest dump if I'm not mistaken)