[00:45] *** JesseW has joined #archiveteam-bs [01:52] *** toad2 has joined #archiveteam-bs [01:54] *** no2pencil has quit IRC (Read error: Operation timed out) [01:54] *** toad1 has quit IRC (Read error: Operation timed out) [02:20] ivan`: I'm going to dump a bunch of Youtube channels of folk music into your form: https://docs.google.com/forms/d/1_kkpBe6abFQ5sznrMfWHhP7ZhdktKejJEpvCCcqVues/viewform -- lemme know if you'd like them as a single email instead (probably about a dozen channels or so). [02:23] ivan`, are those being mirrored to IA? Is a list of items you've archived available, so that I could mirror them to IA off youtube if I wanted? [02:24] kyan: I know they aren't being mirrored to IA, because one of the points was to avoid burdening IA's servers with stuff they can't/don't want to hold. [02:25] Ah, hmm [02:25] I didn't know they didn't want to hold things [02:25] I don't know if a public list is available, but I'd be surprised if ivan` would mind privately sending you a list of what he has. [02:25] thought they were more like, expanding as use expanded, or something [02:26] Also, some of my most viewed uploads have been youtube videos I've mirrored to IA [02:26] so I think that it at least generate more traffic for IA to have discoverable content? [02:26] Then again, they don't have ads so I guess traffic ≠ money [02:27] I don't know that IA minds -- more that I remember seeing ivan` mention he was specifically intending to provide a home for stuff unable to be mirrored at IA. [02:28] Huh ok [02:28] I'm not sure how "unable" anything could be, but whatever [02:28] I mean, if the issue is with copyright, I guess [02:34] *** schbirid2 has joined #archiveteam-bs [02:35] I have no idea about why. [02:36] ah :P [02:37] *** schbirid has quit IRC (Read error: Operation timed out) [02:39] Also a question — how to sort search results on IA by size? [02:39] I'll be curious what Ivan has to say. [02:39] size of what, individual files, whole items, something else? [02:40] Whole items [02:40] I don't *think* that's available through the Advanced Search. [02:40] Probably extracting it from the census data is your best bet. [02:40] I don't see anything that looks promising there [02:40] Ah, ok. Thanks. [02:40] * kyan can't be bothered atm [02:40] Heh [02:40] What were you interested in looking for? [02:40] Also the census wouldn't fit on my drive lol [02:41] I've got a bunch of WARCs uploaded [02:41] the ones from one account have lots of views (10000+ per item, generally) [02:41] while the ones from the other account have like 10–50 views [02:41] I'd like to see if there's something wrong with the ones from the other account [02:42] but a lot of the items are small and only have a few URLs in them, making it understandable that they'd have few views. [02:42] By sorting by item size, I could see which ones have tens of thousands of URLs and try to see if there's something about them that's making them have few views. [02:43] Namely these: https://archive.org/search.php?query=uploader%3A%22worldpeacehaven%40gmail.com%22+mediatype%3Aweb&sort=-downloads&page=2 [02:43] 46 views for the most viewed item, and going down from there [02:43] Ah, if you are only interested in a limited number of identifiers, I'd just hack up curl to download http://archive.org/metadata/{id} for each one, then sort them locally. [02:43] I thought you wanted to sort the whole corpus [02:44] s/hack up curl/hack up a shell script *using* curl/ [02:44] Compare to my other account https://archive.org/search.php?query=uploader%3A%22kolubat%40gmail.com%22+mediatype%3Aweb&sort=-downloads [02:44] most views is 143K [02:44] makes me think something might be wrong. [02:44] * JesseW needs to get around to uploading my census results [02:45] JesseW, cool, that sounds promising! Thanks! :D [02:45] (and various shell commands) [02:45] but I need to figure out what exactly the next step is, too. [02:57] *** JetBalsa has quit IRC (hub.efnet.us irc.colosolutions.net) [02:57] *** SadDM has quit IRC (hub.efnet.us irc.colosolutions.net) [02:57] *** jspiros has quit IRC (hub.efnet.us irc.colosolutions.net) [02:57] *** matthusby has quit IRC (hub.efnet.us irc.colosolutions.net) [03:00] *** JesseW has quit IRC (Quit: Leaving.) [03:59] *** SN4T14 has quit IRC (Read error: Operation timed out) [03:59] *** SN4T14 has joined #archiveteam-bs [03:59] *** MrRadar has quit IRC (Read error: Operation timed out) [03:59] *** arkiver has quit IRC (Ping timeout: 360 seconds) [04:00] *** signius has quit IRC (Read error: Operation timed out) [04:00] *** joepie91 has quit IRC (Read error: Operation timed out) [04:01] *** phuzion has quit IRC (Read error: Operation timed out) [04:01] *** phuzion has joined #archiveteam-bs [04:01] *** zenguy has quit IRC (Ping timeout: 360 seconds) [04:01] *** dashcloud has quit IRC (Read error: Operation timed out) [04:01] *** atlogbot has quit IRC (Ping timeout: 360 seconds) [04:02] *** arkiver has joined #archiveteam-bs [04:02] *** joepie91 has joined #archiveteam-bs [04:02] *** signius has joined #archiveteam-bs [04:03] *** atlogbot has joined #archiveteam-bs [04:04] *** zenguy has joined #archiveteam-bs [04:04] *** dashcloud has joined #archiveteam-bs [04:04] *** phuzion has quit IRC (Read error: Operation timed out) [04:04] *** beardicus has quit IRC (Read error: Operation timed out) [04:06] *** phuzion has joined #archiveteam-bs [04:09] *** beardicus has joined #archiveteam-bs [04:14] *** kvieta has quit IRC (Ping timeout: 633 seconds) [04:14] *** kvieta has joined #archiveteam-bs [04:18] *** RedType has quit IRC (Remote host closed the connection) [04:21] *** MrRadar has joined #archiveteam-bs [04:22] *** beardicus has quit IRC (Read error: Operation timed out) [04:26] *** kvieta has quit IRC (Read error: Operation timed out) [04:27] *** SimpBrain has quit IRC (Ping timeout: 633 seconds) [04:36] *** SimpBrain has joined #archiveteam-bs [04:44] *** JesseW has joined #archiveteam-bs [04:46] *** toad2 has quit IRC (Ping timeout: 864 seconds) [04:47] *** kvieta has joined #archiveteam-bs [04:47] *** toad1 has joined #archiveteam-bs [04:47] *** beardicus has joined #archiveteam-bs [04:50] *** Swizzle has joined #archiveteam-bs [04:54] *** zerkalo has joined #archiveteam-bs [04:54] *** lbft_ has joined #archiveteam-bs [04:57] *** zerkalo_ has quit IRC (hub.efnet.us irc.Prison.NET) [04:57] *** chfoo has quit IRC (hub.efnet.us irc.Prison.NET) [04:57] *** achip has quit IRC (hub.efnet.us irc.Prison.NET) [04:57] *** lbft has quit IRC (hub.efnet.us irc.Prison.NET) [04:59] *** chfoo0 has joined #archiveteam-bs [05:04] *** achip has joined #archiveteam-bs [05:05] *** pikhq_ has quit IRC (hub.dk irc.homelien.no) [05:05] *** PurpleSym has quit IRC (hub.dk irc.homelien.no) [05:05] *** PotcFdk has quit IRC (hub.dk irc.homelien.no) [05:05] *** coretx has quit IRC (hub.dk irc.homelien.no) [05:05] *** altlabel has quit IRC (hub.dk irc.homelien.no) [05:05] *** limebyte has quit IRC (hub.dk irc.homelien.no) [05:05] *** i0npulse has quit IRC (hub.dk irc.homelien.no) [05:06] *** Rotab has quit IRC (hub.se irc.du.se) [05:10] *** coretx_ has joined #archiveteam-bs [05:11] *** vitzli has joined #archiveteam-bs [05:16] DFJustin: does wayback not support ftp at all, or can you construct ftp warcs and access them somehow [05:35] *** SmileyG has quit IRC (Read error: Connection reset by peer) [05:35] *** Smiley has joined #archiveteam-bs [05:35] *** will has quit IRC (Ping timeout: 252 seconds) [05:35] *** Rye has quit IRC (Ping timeout: 252 seconds) [05:38] *** will has joined #archiveteam-bs [05:40] *** useretail has quit IRC (Ping timeout: 252 seconds) [05:43] *** will has quit IRC (Ping timeout: 252 seconds) [05:45] *** will has joined #archiveteam-bs [05:45] *** Rye has joined #archiveteam-bs [05:45] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [05:47] Regarding ftp.esri.com, there are 11 wayback machine records from 2013, all of which returned 502 statuscodes. [05:47] (there are *only* those 11 records) [05:48] *** useretail has joined #archiveteam-bs [05:52] *** Sk1d has joined #archiveteam-bs [05:53] *** Swizzle has quit IRC (Quit: Leaving) [06:14] ok... [06:15] i get my nl vps suspended due to high loads, they send out a message saying they are going to do some work on the server. they send out another update saying they will put that server on a new server [06:15] win win for them i think [06:37] *** pikhq has joined #archiveteam-bs [06:37] *** i0npulse has joined #archiveteam-bs [06:37] *** altlabel has joined #archiveteam-bs [06:37] *** PurpleSym has joined #archiveteam-bs [06:37] *** PotcFdk has joined #archiveteam-bs [06:37] *** limebyte has joined #archiveteam-bs [06:38] JesseW, on IA census, I did IA.BAK census and found something, I don't know if it is only my 'ia search'/my mining script bug or it is common to all - a) both ia-mine and "ia search" return duplicate items b) they miss some items (about 10 or 15 on 600 item collection). I found this when I was doing "parallel --jobs 1" requests, and it happened to cli calls of ia too (week ago). [06:39] Right now ia-mine --search --itemlist seems to behave better - no item drops, but ia search returned one duplicate record [06:40] vitzli: https://archive.org/download/ia-bak-census_20150304/metamgr-norm-ids-20150304205357.txt.gz has a single duplicate (see http://archiveteam.org/index.php?title=Internet_Archive_Census#Contents_of_the_Census ) [06:40] The other census files do seem to have a bunch of duplication -- I'm not sure why. [06:40] I found that IA search was ... unreliable. [06:41] it dropped one item and returned one duplicate, to be precise [06:41] For getting a definitive census of items in larger collections. [06:41] BUT - doing search multiple times and then sort|uniq it - worked and returned all elements in the collection [06:41] Feel free to drop a note to jake about it -- I can certainly confirm I've seen the same issue. [06:42] How many searches did you need to do? [06:43] I tried to get all the items with addedates in a particular year, and gave up when I couldn't get consistent results from the search. I should hack up something to retry and combine results until the total mactches the provided number (because searches do generate a total number even before any individual results are requested). [06:44] maybe 3 on 162 item collection, 3 or 4 on bigger collections (I think it was walnutcreekcdrom collection) [06:44] hm [06:46] got 162 items on the first 'ia search' run, but maybe 5 were duplicates, and did it again [06:46] xmc: as far as I know wayback doesn't support it at all. it is possible to construct ftp warcs but I don't know what tools are able to use them [06:46] *** chfoo0 is now known as chfoo [07:31] JesseW, right now: text file from ia search --itemlist 'collection:(prelingeritems)' : [07:31] sort prelingeritems.txt | wc -l : 6533 [07:31] sort -u prelingeritems.txt | wc -l: 4895 [07:31] ha [07:31] yeah, that's ... less than ideal [07:33] uh, just prelinger collection, not prelingeritems [07:42] vitzli: I got all 6533 distinct values the first time I make the search [07:43] 'ia search'? [07:45] JesseW, https://paste.ee/p/vkBQU [07:46] I was using the python interface. [07:50] *** robink has quit IRC (Ping timeout: 190 seconds) [07:50] a is a list of identifiers in collection, len(a): 6533; len(set(a)): 5071 [07:51] is my install somehow broken? [07:51] *** robink has joined #archiveteam-bs [07:52] I'm not sure. I have to head to sleep now. Good luck. [07:52] good night [07:52] *** JesseW has quit IRC (Quit: Leaving.) [07:53] JesseW, on python/IA search results: https://paste.ee/p/iXqQo [08:23] *** kyan has quit IRC (Ping timeout: 260 seconds) [08:47] *** RedType has joined #archiveteam-bs [09:58] Hmm. Best way of getting a csv of a collection, listing the file name and the date it was uploaded, as well as the view [10:00] *** Rotab has joined #archiveteam-bs [10:41] *** lytv has quit IRC (Ping timeout: 250 seconds) [10:41] *** vtyl has joined #archiveteam-bs [11:00] *** achip has quit IRC (hub.efnet.us irc.Prison.NET) [11:07] *** signius has quit IRC (Read error: Operation timed out) [11:17] *** achip has joined #archiveteam-bs [11:20] *** signius has joined #archiveteam-bs [13:01] *** arkiver3 has joined #archiveteam-bs [13:28] SketchCow: https://www.youtube.com/watch?v=cPaij2G3wTQ [13:32] *** arkiver3 has quit IRC (Ping timeout: 252 seconds) [14:13] *** arkiver3 has joined #archiveteam-bs [14:24] *** arkiver3 has quit IRC (Ping timeout: 252 seconds) [14:29] *** arkiver3 has joined #archiveteam-bs [14:56] Yeah, I've seen it. [15:05] what an annoying voice [15:05] but yes yes and yes for everything in it [15:11] *** Start has quit IRC (Quit: Disconnected.) [15:17] ersi: haha, exactly my thoughts [15:17] watched a few eps so far [15:17] "jesus that voice is annoying, but he is so damn right about every single thing he says" [15:17] - every ep [15:31] I wouldn't watch more than that single episode [15:36] *** wednesday has quit IRC (Ping timeout: 252 seconds) [15:37] *** wednesday has joined #archiveteam-bs [15:49] *** Start has joined #archiveteam-bs [15:50] *** wednesday has quit IRC (Ping timeout: 252 seconds) [15:53] *** arkiver3 has quit IRC (Quit: Nettalk6 - www.ntalk.de) [16:48] https://chrome.google.com/webstore/detail/cookiestxt/njabckikapfpffapmjgojcnbfjonfjfg?hl=en cookies.txt [16:51] schbirid2: handy [16:51] btw if you wget -x the same url you spent the night downloading it will start from 0 again \o/ [17:07] *** Start has quit IRC (Quit: Disconnected.) [17:19] *** Start has joined #archiveteam-bs [17:24] *** Start has quit IRC (Quit: Disconnected.) [17:51] *** vitzli has quit IRC (Leaving) [17:52] *** dashcloud has quit IRC (Read error: Operation timed out) [17:55] *** dashcloud has joined #archiveteam-bs [18:20] *** Swizzle has joined #archiveteam-bs [18:43] *** Start has joined #archiveteam-bs [19:00] *** signius has quit IRC (Ping timeout: 300 seconds) [19:08] *** espes__ has quit IRC (Ping timeout: 252 seconds) [19:12] *** signius has joined #archiveteam-bs [19:14] *** Start has quit IRC (Quit: Disconnected.) [19:20] *** Start has joined #archiveteam-bs [19:24] *** acridAxid has quit IRC (Quit: marauder) [19:29] *** kyan has joined #archiveteam-bs [19:29] my grab-site server is out of disk space :( [19:29] downloads faster than it can upload [19:33] Also TIL don't leave a grab-site --1 of a page that mentions Pinterest running over night unattended if you turned off the dupechecker. 56.5GB downloaded, 193k responses, almost all Pinterest 404s [19:33] ffs [19:33] *** acridAxid has joined #archiveteam-bs [19:34] SketchCow: i'm grabbing more of Network World from google books [19:34] cause in part of what you said about googlebooks twitter account going private [19:42] http://trumpdonald.org/ [19:47] kyan: https://gist.github.com/ivan/5779ac8d43817092aca6 [19:47] *** Swizzle has quit IRC (Quit: Leaving) [19:48] verify the df line before deploying [19:48] ivan`: Ooh, cool, thanks! :D [19:49] kyan: not mirroring my 2M YouTube videos to IA. The plan is to scan my collection for deleted/private/unlisted videos and upload those. Just need to write software to check all the IDs and upload to IA. [19:50] lol joepie91 [19:50] ivan`, Aah, cool, that's a good solutino [19:51] Might make sense to add that gist to the readme for grab-site too, that's handy [19:51] yeah [19:53] does anyone have some existing infrastructure to hit a site through many proxies? [19:53] http://crawlera.com/ besides this commercial offering that I don't want to pay for [19:54] *** Silvan has quit IRC (Read error: Operation timed out) [19:54] (Well, the warrior kind of does that) [19:55] Wow, $25 for 150k requests per month. That's pretty expensive [20:01] lol [20:01] kyan: you want to see expensive? [20:01] kyan: https://luminati.io/ [20:02] HAHAHAHAhaha ha .... ha? [20:02] do they get any customers? [20:03] kyan: yeah. [20:03] quite a few [20:04] kyan: it's used by companies scraping prices and shit [20:04] from competitors [20:04] their peers are Hola users [20:04] so, almost all residential [20:04] Hm, interesting [20:04] I'm not sure how well it would work against sophisticated crawler prevention [20:05] e.g. if they're scraping sequential IDs, that could be tracked between IP addresses [20:05] captchas could be required on suspicious requests [20:05] and also Bing would have search results as good as Google if it worked [20:17] *** SilSte has joined #archiveteam-bs [20:22] https://github.com/ludios/grab-site#automatically-pausing-grab-site-processes-when-free-disk-is-low [20:26] I have a 3.5TB grab-site of http://digitalcollections.nypl.org/ going [20:26] and 2.5TB of http://downloads.dell.com/ [20:28] nice old driver/software downloads is good to have [20:28] only a matter of time before they remove old downloads [20:36] joepie91: wtf on luminati [20:37] oh it's Hola [20:38] *** kyan has quit IRC (Quit: This computer has gone to sleep) [20:38] I thought they were actively infecting computers or some shit [20:41] *** JW_work has joined #archiveteam-bs [20:42] *** kyan has joined #archiveteam-bs [20:43] ivan`: Here is a list of 127 youtube channels of contra dance music (with various other random crap mixed in), if you'd like to archive them: https://0bin.net/paste/0MMv2M-eh1hSTydI#SbPWz+5z+HxWt4YurJDQQUEjI7iWskKNDbNLGOBF0ik [20:44] *** Start has quit IRC (Quit: Disconnected.) [20:44] It's in OPML XML format — I'm glad to work on transforming it into an easier to use format if that'd be helpful. [20:45] (the random crap is because various of the channels are their owner's personal channels, so they also uploaded various home video-type stuff — all the channels should have at least some contra dance music, and there aren't any channels focused on other topics, IIRC) [20:48] I can transform XML with my mad sublime text skills, don't worry about that part [20:48] that sure is a lot of channels [20:49] yep, I've been collecting them for a while [20:50] I've been (very slowly) working on indexing the contra dance videos on them on to MusicBrainz — when I heard about your archiving effort, I thought I'd send it over. [20:50] I can also give you a smaller list of higher value ones, if you'd like. [20:51] currently listening to https://www.youtube.com/watch?v=pthkg4f2HAo [20:51] the more I use curl, the more I am disgusted by HTTP libraries [20:52] JW_work: I will add all of them if you think they're all worth archiving [20:52] I think they are all worth archiving. [20:52] my script will work through them over about a month [20:53] Great! None of them are particularly in danger of vanishing right now, so a month should work fine. [20:53] * ivan` goes to write a program to turn channels into usernames [20:54] Yeah, the XML is just the output of https://www.youtube.com/subscription_manager?action_takeout=1 [20:55] so if you write something to handle it, it will likely be generally useful [21:21] JW_work: OK, all of your subscriptions and spreadsheet submissions are queued [21:22] beware my youtube archiver is a stochastic process [21:22] and something like 1% of videos fail to download without manual intervention which I almost never bother with [21:22] youtube is great. announces formats that it fails to serve. [21:28] that shouldn't be a problem — the ones I *have* indexed on MusicBrainz should have already been grabbed by the musicbrainz external links warrior project recently, and I'll likely grab my high-value targets myself too; but it's very good to have another copy elsewhere, so thank you! [21:29] np [21:44] yipdw: they are [21:44] ivan`: Jeez, that's some huge fucking grabs! [21:44] yipdw: with hola [21:44] lol [21:44] yipdw: http://adios-hola.org/ [21:45] yipdw: What in particular are you disgusted about with curl? [21:51] *** kyan_ has joined #archiveteam-bs [21:53] *** kyan has quit IRC (Ping timeout: 258 seconds) [21:53] *** kyan_ is now known as kyan [22:04] don't have time to look into it right now but these ftp://ftp.us.dell.com/video/ just got posted to /r/opendirectories. Lots of drievrs [22:04] ooooh [22:05] will check that in for the ftp project [22:05] might be stuff in the parent dir too [22:06] yep, will get that too [22:16] http://www.bloomberg.com/features/2016-solar-power-buffett-vs-musk/img/buffett_vs_musk.gif [22:16] hehehe [22:27] *** kyan has quit IRC (This computer has gone to sleep) [22:28] is there any chance of getting major to autovoice me in #archivebot ? [22:46] Currently watching my warrior archive the Friends Reunited page for my old school is so satisfying [22:53] wow [22:53] when you think you've seen everything [22:53] http://www.nieuwsbladtransport.nl/Nieuws/Article/tabid/85/ArticleID/40874/ArticleName/Samskipgaatreorganiseren/Default.aspx [22:53] cc arkiver [22:54] "Dear reader, After one year, the pictures in our articles are removed from the site. The texts themselves, however, will remain unchanged." [22:54] .... [22:54] wow [22:54] so yeah, one for your newsbot [22:54] lol [22:54] amazing, though [22:54] never seen this before, boggles the mind [22:54] joepie91, ill add it soon. Not atm though [22:55] Im not popular atm with the datacenter [22:55] HCross: haha, how come [22:55] Bandwith, ALL OF THE BANDWITH [22:55] lol [22:55] HCross: oh, they paywall too [22:55] feck [22:55] might need to make sure you're grabbing it without cookies [22:56] *** vtyl has quit IRC (hub.efnet.us irc.servercentral.net) [22:56] *** RedType has quit IRC (hub.efnet.us irc.servercentral.net) [22:56] *** SimpBrain has quit IRC (hub.efnet.us irc.servercentral.net) [22:56] *** phuzion has quit IRC (hub.efnet.us irc.servercentral.net) [22:56] *** atlogbot has quit IRC (hub.efnet.us irc.servercentral.net) [22:56] *** schbirid2 has quit IRC (hub.efnet.us irc.servercentral.net) [22:56] *** Infreq has quit IRC (hub.efnet.us irc.servercentral.net) [22:56] *** JW_work has quit IRC (hub.efnet.us irc.servercentral.net) [22:56] *** mistym has quit IRC (hub.efnet.us irc.servercentral.net) [22:56] *** dxrt has quit IRC (hub.efnet.us irc.servercentral.net) [22:56] *** swebb has quit IRC (hub.efnet.us irc.servercentral.net) [22:56] *** slyphic has quit IRC (hub.efnet.us irc.servercentral.net) [22:56] *** chazchaz has quit IRC (hub.efnet.us irc.servercentral.net) [22:56] yeah, this site is a bit special [22:56] lol [22:57] *** RedType_ has joined #archiveteam-bs [22:57] *** phuzion_ has joined #archiveteam-bs [22:59] *** mistym- has joined #archiveteam-bs [23:00] *** lytv has joined #archiveteam-bs [23:00] *** dxrt_ has joined #archiveteam-bs [23:01] joepie91: I guess they.. only license the images for one year? [23:01] *** Infreq_ has joined #archiveteam-bs [23:01] Incredibly stupid though [23:01] *** SimpBrai1 has joined #archiveteam-bs [23:01] very much so [23:01] lol [23:01] ersi: also, image licensing for news in NL is not usually time-limited... [23:02] *** schbirid has joined #archiveteam-bs [23:02] they must've gotten the short end of the stick with their licensing agency :P [23:02] or they just wanted cheaper pics [23:06] *** swebb has joined #archiveteam-bs [23:07] *** JW_work2 has joined #archiveteam-bs [23:08] *** chazchaz has joined #archiveteam-bs [23:09] *** slyphic has joined #archiveteam-bs [23:10] *** Start has joined #archiveteam-bs [23:11] from argparse import ArgumentParser [23:11] ImportError: No module named argparse [23:11] arkiver, ^^ [23:12] nvm [23:22] or they don't like paying for storage [23:45] k, my warrior stats are miles off [23:45] I'm on 30Mbit [23:45] it's telling me 280MB/s [23:45] D: [23:45] Oh wait reading the total XD [23:46] Im waiting to get Debian installed on thsi server then I will have a clue what I am doing [23:50] *** dxrt_ is now known as dxrt