[00:20] why can you not search Google Apps for the App Passwords page [00:20] come on Google, index yourself [00:21] * yipdw gcan never friggin find this damn thin [00:22] who indexes the indexer [00:23] interestingly, you can search in the Google Apps Admin site for settings [00:23] I wonder if nobody at Google uses the App Passwords page enough for it to matter [00:23] because why in the world would you ever use an app that wasn't web-based [00:34] *** godane has joined #archiveteam-bs [01:00] *** powerKit2 has joined #archiveteam-bs [01:00] https://catalogd.archive.org/log/599682290 ...did this task break? [01:00] should be still running [01:02] -shrug- it just seemed to be taking longer than it should [01:03] when they break the row in /history/ turns red and there's an error message in the log [01:03] unless something is deeply wrong [01:04] if the derive for a 20 minute video doesn't complete in six hours, then that's cause for worry [01:04] but an hour? ehhh [01:05] I think the longest video in the item is 3 hours and 20 minutes [01:05] *** nickname_ has quit IRC (Read error: Operation timed out) [01:06] oh, well, then that's yeah [01:06] take two aspirin and call me in the morning [01:08] I'm guessing this is why people don't typically upload 39Gigabytes of video onto the Internet Archive. [01:09] I just figured it'd be kinda mean to dump 121+ individual items into community video. [01:20] Anyway, I've been meaning to start recording my videos in FFv1 from now on. Can the archive derive from video encoded that way? [01:23] well. an item should be a work that stands on its own [01:23] not three works, not half a work [01:23] how you define this ... hard to say [01:26] *** Yoshimura has quit IRC (Remote host closed the connection) [01:28] Honestly, I just didn't want to go through 121 random videos with non descriptive names and figure out what each one. [01:28] fair [01:28] *what each one was. [01:30] Anyway, before I start recording my future videos in FFv1, can the archive actually derive from them? [01:30] what is ffv1 [01:30] https://en.wikipedia.org/wiki/FFV1 [01:30] i suggest you make a short test video and upload it into a test item and see what happens [01:30] test items get deleted after a month [01:31] I think it'd work, it looks like derive.php uses llibavcodec which included FFV1. [01:31] *libavcodec [01:33] Yeah, I'll just make a test video later and see. [02:12] *** powerKit2 has quit IRC (Quit: Page closed) [02:46] *** zenguy has quit IRC (Ping timeout: 370 seconds) [02:57] *** Yoshimura has joined #archiveteam-bs [03:03] *** zenguy has joined #archiveteam-bs [03:12] *** n00b184 has joined #archiveteam-bs [04:05] *** Ravenloft has quit IRC (Read error: Connection reset by peer) [04:55] *** krazedkat has quit IRC (Leaving) [05:06] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [05:07] i'm at 995k items now [05:07] less then 5k items away from 1 million items [05:08] also nasa docs are almost done [05:13] *** Sk1d has joined #archiveteam-bs [05:26] *** mst__ has joined #archiveteam-bs [05:35] *** mst__ has quit IRC (Quit: bye) [06:49] *** Asparagir has joined #archiveteam-bs [07:08] National Library of Australis PANDORA internet archive.. I had no idea this thing existed - http://pandora.nla.gov.au/ [07:08] 485,506,170 files and 25.66 TB [07:26] *** turnkit has joined #archiveteam-bs [07:27] Anyone heard of college "viewbooks" -- they are basically mini booklets describing a college for prospective students. I'm considering trying to create a large collection of them. [07:27] I found a site that has them sort of aggregated already: https://issuu.com/search?q=viewbook [07:28] but many of them are marked "no download" [07:28] and if I go to different college sites I can find them. But I think it'd be basically a lot of just manual searching to get one from each college for each year that they were available. [07:29] The part I am interested in is finding what college clubs each college had each year. [07:29] I would think someone already has indexed this but I haven't found an index of college clubs yet. [07:30] anyone happen to already stumble on a college viewbook pdf collection that I could use to extract that info? [07:30] i guess... https://www.google.com/search?q=college+viewbook+type%3A.pdf [07:31] Can I just run that into wget somehow? (time to listen to the man) [07:33] *** ravetcofx has quit IRC (Read error: Operation timed out) [07:49] turnkit: if you've got a list of URLs, yeah, you can feed those into wget/wpull/whatever [07:50] this is a pretty dumb question but do you know an easy way to get google results into a list? I guess I could save the whole page then grep or sed for http:// but seems like there should be an simplier way [07:50] unfortunately I don't know of any Google search scraper offhand that'll do this [07:50] the main difficulty is that Google builds a lot of bot checks into the search [07:51] I found a SEO plugin that claims to save Google results as CSV but it was bloaty [07:51] Well I found how to change the Google setting to get 100 results per page -- that sort of helps [07:52] ? http://www.labnol.org/internet/google-web-scraping/28450/ [07:53] oh that doesn't work -- I found that last week and couldn't figure it out [07:54] i guess this is more basic than I thought.... stumbling around. https://www.google.com/search?&q=scrape+google+search+results+into+links [07:56] so, the basics are not too bad; if you keep a human-like pace and don't give yourself away obviously (e.g. use the default curl/wget user-agent or whatever) you'll probably be fine just grabbing each search page [07:56] and parsing out the links with nokogiri/beautifulsoup/whatever [07:57] the problem comes when people go "oh, one process is good, let me scale up to 47" [07:57] and then they wonder why they are getting no result [07:57] s [07:59] you will have to deal with getting the URL out of the Google URL redirect thingy [08:06] turnkit: e.g. https://gitlab.peach-bun.com/snippets/44, quick scripting [08:26] I'll check that out. Thanks! [08:32] *** turnkit_ has joined #archiveteam-bs [09:10] *** krazedkat has joined #archiveteam-bs [09:11] *** GE has joined #archiveteam-bs [09:51] *** turnkit_ has quit IRC (Ping timeout: 268 seconds) [10:05] *** Smiley has joined #archiveteam-bs [10:07] *** SmileyG has quit IRC (Ping timeout: 250 seconds) [10:38] *** GE has quit IRC (Quit: zzz) [10:42] *** turnkit has quit IRC (Quit: Page closed) [10:44] *** BlueMaxim has quit IRC (Quit: Leaving) [11:21] *** n00b184 has quit IRC (Ping timeout: 268 seconds) [12:35] *** GE has joined #archiveteam-bs [14:00] *** SilSte has joined #archiveteam-bs [14:08] *** tfgbd_znc has quit IRC (Read error: Operation timed out) [14:09] *** tfgbd_znc has joined #archiveteam-bs [14:11] *** SilSte has quit IRC (Read error: Connection reset by peer) [14:12] *** SilSte has joined #archiveteam-bs [14:49] *** sep332_ has quit IRC (konversation out) [14:51] *** sep332_ has joined #archiveteam-bs [14:54] *** Start has quit IRC (Quit: Disconnected.) [15:50] *** Ravenloft has joined #archiveteam-bs [16:07] *** ravetcofx has joined #archiveteam-bs [16:25] *** Shakespea has joined #archiveteam-bs [16:25] Afternoon [16:26] I found an intresting site [16:27] www.oldapps.com any possibility of getting it archived? I would mention this on the Web site, but owing to some unfrotunate misunderstandings I can't raise the matter there at the moment. [16:28] "This web page at download.oldapps.com has been reported to contain unwanted software and has been blocked " [16:28] thanks firefox [16:28] Are you using an ad-blocker? [16:28] It loaded fine for me [16:29] http://www.oldapps.com/index.php being the full URL [16:29] Loads fine, just doesn't let me download anything. Weird [16:30] The useful thing is that seems to have older versions of some 'sharing' tools... ;) [16:30] :D [16:30] I also noted - www.mdgx.com [16:31] Which has suport files going back nearly 20 years [16:31] (And which probably should be mirrored at some point) [16:32] And I'm down by2 on my 3 suggestions this month :( [16:39] mdgx was grabbed by archivebot in 2015 [16:39] http://archive.fart.website/archivebot/viewer/job/49n9f [16:40] oldapps.com in 2014 http://archive.fart.website/archivebot/viewer/job/7dvez [16:42] Aoede: Thanks... mgdx gets updates a quite a bit though... so I hope it's on a regular schedule :) [16:42] *mdgx [16:42] Want me to throw it in Archivebot? [16:43] Feel free, if it's possible to do an incremental [16:43] The one thing I can never find online is old sewing patterns though.... [16:44] Dunno if incremental is possible [16:48] I think OldApps may be covered, not sure though. [17:09] Aoede: My next query would be to look into whether wget has an 'incremental' option in it, as it save badnwidth if you only have to add a few new files vs the whole site. [17:10] If you want to throw it in the bot anyway , don't let me stop you :) [17:10] wget does [17:10] --continue [17:11] xmc: I meant "date incremental" I.e grab evreything that's changed since we last took a sample... [17:11] yep [17:11] --warc-dedup? [17:11] Aoede: Possibly... [17:12] --continue --mirror will crawl the site but only download files that are different [17:12] i'm not sure exactly how it works, to be honest [17:12] Thanks [17:12] /topic unofficial wget user group [17:12] anyway. wget --continue --mirror will probably do what you want. but test first [17:13] My third suggestion for this month would be to ask whose archiving "adult" fiction sites like asstr, Fictionmania etc [17:13] These can apparently vanish without warning ... [17:13] i'm not aware of an active project for those sites [17:13] you're welcome to start one [17:14] I can't use the wiki at the moment, owing to some unfortunate misunderstandings... [17:14] archivebot's crawlers support incremental fetch to the degree the site itself makes it possible to determine what's changed [17:14] archivebot itself does not [17:14] good news is you can use wget/wpull to do that manually until that situation's resolved [17:14] Thank you for that explanation. [17:15] doesn't it use the If-Modified-Since: header ? [17:16] wget can use that yeah [17:16] but a website doesn't have to send that or send one that makes any sense [17:16] er, sorry, wget uses Last-Modified [17:17] sent by the client [17:17] ah [17:17] it's not clear to me whether wget does conditional GETs yet [17:17] yes. the web is garbage, and we try to layer useful things over that [17:17] yeah whoops [17:18] I confused If-Modified-Since with Last-Modified, go me [17:18] np [17:18] they're only different parts of the request [17:20] But still in theory possible not to have to grab a whole site multiple times... [17:20] ( which some may still want to do for other reasons, of course...) [17:21] Thanks .... [17:21] BTW My forurth of 3 suggestions for archive this month (Sorry) would be news sites on Trump that are pre-election before his lawyers get to them ;) ) [17:22] * Shakespea out [17:22] *** Shakespea has left [17:22] uh [17:28] typing on the edge of chaos [17:44] *** computerf has quit IRC (Read error: Operation timed out) [18:13] *** computerf has joined #archiveteam-bs [19:42] *** Start has joined #archiveteam-bs [19:46] *** kristian_ has joined #archiveteam-bs [19:53] *** krazedkat has quit IRC (Read error: Operation timed out) [20:09] *** Start has quit IRC (Remote host closed the connection) [20:50] *** Start has joined #archiveteam-bs [20:52] *** Start has quit IRC (Client Quit) [20:54] *** Start has joined #archiveteam-bs [20:59] *** Start has quit IRC (Client Quit) [21:03] *** Start has joined #archiveteam-bs [21:11] *** Yoshimura has quit IRC (Remote host closed the connection) [21:44] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [21:49] *** BartoCH has joined #archiveteam-bs [21:54] *** Start has quit IRC (Remote host closed the connection) [22:24] *** krazedkat has joined #archiveteam-bs [23:04] *** Stiletto has quit IRC (Read error: Operation timed out) [23:14] *** BlueMaxim has joined #archiveteam-bs [23:17] *** GE has quit IRC (Quit: zzz) [23:17] so i should be past 1 million items by the morning [23:17] wow! [23:29] *** Start has joined #archiveteam-bs