[00:20] <yipdw> why can you not search Google Apps for the App Passwords page
[00:20] <yipdw> come on Google, index yourself
[00:21] * yipdw gcan never friggin find this damn thin
[00:22] <xmc> who indexes the indexer
[00:23] <yipdw> interestingly, you can search in the Google Apps Admin site for settings
[00:23] <yipdw> I wonder if nobody at Google uses the App Passwords page enough for it to matter
[00:23] <yipdw> because why in the world would you ever use an app that wasn't web-based
[00:34] *** godane has joined #archiveteam-bs
[01:00] *** powerKit2 has joined #archiveteam-bs
[01:00] <powerKit2> https://catalogd.archive.org/log/599682290 ...did this task break?
[01:00] <xmc> should be still running
[01:02] <powerKit2> -shrug- it just seemed to be taking longer than it should
[01:03] <xmc> when they break the row in /history/ turns red and there's an error message in the log
[01:03] <xmc> unless something is deeply wrong
[01:04] <xmc> if the derive for a 20 minute video doesn't complete in six hours, then that's cause for worry
[01:04] <xmc> but an hour? ehhh
[01:05] <powerKit2> I think the longest video in the item is 3 hours and 20 minutes
[01:05] *** nickname_ has quit IRC (Read error: Operation timed out)
[01:06] <xmc> oh, well, then that's yeah
[01:06] <xmc> take two aspirin and call me in the morning
[01:08] <powerKit2> I'm guessing this is why people don't typically upload 39Gigabytes of video onto the Internet Archive.
[01:09] <powerKit2> I just figured it'd be kinda mean to dump 121+ individual items into community video.
[01:20] <powerKit2> Anyway, I've been meaning to start recording my videos in FFv1 from now on. Can the archive derive from video encoded that way?
[01:23] <xmc> well. an item should be a work that stands on its own
[01:23] <xmc> not three works, not half a work
[01:23] <xmc> how you define this ... hard to say
[01:26] *** Yoshimura has quit IRC (Remote host closed the connection)
[01:28] <powerKit2> Honestly, I just didn't want to go through 121 random videos with non descriptive names and figure out what each one.
[01:28] <xmc> fair
[01:28] <powerKit2> *what each one was.
[01:30] <powerKit2> Anyway, before I start recording my future videos in FFv1, can the archive actually derive from them?
[01:30] <xmc> what is ffv1
[01:30] <powerKit2> https://en.wikipedia.org/wiki/FFV1
[01:30] <xmc> i suggest you make a short test video and upload it into a test item and see what happens
[01:30] <xmc> test items get deleted after a month
[01:31] <powerKit2> I think it'd work, it looks like derive.php uses llibavcodec which included FFV1.
[01:31] <powerKit2> *libavcodec
[01:33] <powerKit2> Yeah, I'll just make a test video later and see.
[02:12] *** powerKit2 has quit IRC (Quit: Page closed)
[02:46] *** zenguy has quit IRC (Ping timeout: 370 seconds)
[02:57] *** Yoshimura has joined #archiveteam-bs
[03:03] *** zenguy has joined #archiveteam-bs
[03:12] *** n00b184 has joined #archiveteam-bs
[04:05] *** Ravenloft has quit IRC (Read error: Connection reset by peer)
[04:55] *** krazedkat has quit IRC (Leaving)
[05:06] *** Sk1d has quit IRC (Ping timeout: 250 seconds)
[05:07] <godane> i'm at 995k items now
[05:07] <godane> less then 5k items away from 1 million items
[05:08] <godane> also nasa docs are almost done
[05:13] *** Sk1d has joined #archiveteam-bs
[05:26] *** mst__ has joined #archiveteam-bs
[05:35] *** mst__ has quit IRC (Quit: bye)
[06:49] *** Asparagir has joined #archiveteam-bs
[07:08] <whopper> National Library of Australis PANDORA internet archive.. I had no idea this thing existed - http://pandora.nla.gov.au/
[07:08] <whopper> 485,506,170 files and 25.66 TB
[07:26] *** turnkit has joined #archiveteam-bs
[07:27] <turnkit> Anyone heard of college "viewbooks" -- they are basically mini booklets describing a college for prospective students.  I'm considering trying to create a large collection of them.
[07:27] <turnkit> I found a site that has them sort of aggregated already: https://issuu.com/search?q=viewbook
[07:28] <turnkit> but many of them are marked "no download"
[07:28] <turnkit> and if I go to different college sites I can find them.  But I think it'd be basically a lot of just manual searching to get one from each college for each year that they were available.
[07:29] <turnkit> The part I am interested in is finding what college clubs each college had each year.
[07:29] <turnkit> I would think someone already has indexed this but I haven't found an index of college clubs yet.
[07:30] <turnkit> anyone happen to already stumble on a college viewbook pdf collection that I could use to extract that info?
[07:30] <turnkit> i guess... https://www.google.com/search?q=college+viewbook+type%3A.pdf
[07:31] <turnkit> Can I just run that into wget somehow?  (time to listen to the man)
[07:33] *** ravetcofx has quit IRC (Read error: Operation timed out)
[07:49] <yipdw> turnkit: if you've got a list of URLs, yeah, you can feed those into wget/wpull/whatever
[07:50] <turnkit>  this is a pretty dumb question but do you know an easy way to get google results into a list?  I guess I could save the whole page then grep or sed for http:// but seems like there should be an simplier way
[07:50] <yipdw> unfortunately I don't know of any Google search scraper offhand that'll do this
[07:50] <yipdw> the main difficulty is that Google builds a lot of bot checks into the search
[07:51] <turnkit> I found a SEO plugin that claims to save Google results as CSV but it was bloaty
[07:51] <turnkit> Well I found how to change the Google setting to get 100 results per page -- that sort of helps
[07:52] <turnkit> ? http://www.labnol.org/internet/google-web-scraping/28450/
[07:53] <turnkit> oh that doesn't work -- I found that last week and couldn't figure it out
[07:54] <turnkit> i guess this is more basic than I thought.... stumbling around.  https://www.google.com/search?&q=scrape+google+search+results+into+links
[07:56] <yipdw> so, the basics are not too bad; if you keep a human-like pace and don't give yourself away obviously (e.g. use the default curl/wget user-agent or whatever) you'll probably be fine just grabbing each search page
[07:56] <yipdw> and parsing out the links with nokogiri/beautifulsoup/whatever
[07:57] <yipdw> the problem comes when people go "oh, one process is good, let me scale up to 47"
[07:57] <yipdw> and then they wonder why they are getting no result
[07:57] <yipdw> s
[07:59] <yipdw> you will have to deal with getting the URL out of the Google URL redirect thingy
[08:06] <yipdw> turnkit: e.g. https://gitlab.peach-bun.com/snippets/44, quick scripting
[08:26] <turnkit> I'll check that out.  Thanks!
[08:32] *** turnkit_ has joined #archiveteam-bs
[09:10] *** krazedkat has joined #archiveteam-bs
[09:11] *** GE has joined #archiveteam-bs
[09:51] *** turnkit_ has quit IRC (Ping timeout: 268 seconds)
[10:05] *** Smiley has joined #archiveteam-bs
[10:07] *** SmileyG has quit IRC (Ping timeout: 250 seconds)
[10:38] *** GE has quit IRC (Quit: zzz)
[10:42] *** turnkit has quit IRC (Quit: Page closed)
[10:44] *** BlueMaxim has quit IRC (Quit: Leaving)
[11:21] *** n00b184 has quit IRC (Ping timeout: 268 seconds)
[12:35] *** GE has joined #archiveteam-bs
[14:00] *** SilSte has joined #archiveteam-bs
[14:08] *** tfgbd_znc has quit IRC (Read error: Operation timed out)
[14:09] *** tfgbd_znc has joined #archiveteam-bs
[14:11] *** SilSte has quit IRC (Read error: Connection reset by peer)
[14:12] *** SilSte has joined #archiveteam-bs
[14:49] *** sep332_ has quit IRC (konversation out)
[14:51] *** sep332_ has joined #archiveteam-bs
[14:54] *** Start has quit IRC (Quit: Disconnected.)
[15:50] *** Ravenloft has joined #archiveteam-bs
[16:07] *** ravetcofx has joined #archiveteam-bs
[16:25] *** Shakespea has joined #archiveteam-bs
[16:25] <Shakespea> Afternoon
[16:26] <Shakespea> I found an intresting site 
[16:27] <Shakespea> www.oldapps.com any possibility of getting it archived?  I would mention this on the Web site, but owing to some unfrotunate misunderstandings I can't raise the matter there at the moment.
[16:28] <Aoede> "This web page at download.oldapps.com has been reported to contain unwanted software and has been blocked "
[16:28] <Aoede> thanks firefox
[16:28] <Shakespea> Are you using an ad-blocker?
[16:28] <Shakespea> It loaded fine for me
[16:29] <Shakespea> http://www.oldapps.com/index.php being the full URL
[16:29] <Aoede> Loads fine, just doesn't let me download anything. Weird
[16:30] <Shakespea> The useful thing is that seems to have older versions of some 'sharing' tools... ;)
[16:30] <Aoede> :D
[16:30] <Shakespea> I also noted - www.mdgx.com
[16:31] <Shakespea> Which has suport files going back nearly 20 years
[16:31] <Shakespea> (And which probably should be mirrored at some point)
[16:32] <Shakespea> And I'm down by2 on my 3 suggestions this month :(
[16:39] <Aoede> mdgx was grabbed by archivebot in 2015
[16:39] <Aoede> http://archive.fart.website/archivebot/viewer/job/49n9f
[16:40] <Aoede> oldapps.com in 2014 http://archive.fart.website/archivebot/viewer/job/7dvez
[16:42] <Shakespea> Aoede: Thanks...  mgdx gets updates a quite a bit though... so I hope it's on a regular schedule :)
[16:42] <Shakespea> *mdgx
[16:42] <Aoede> Want me to throw it in Archivebot?
[16:43] <Shakespea> Feel free, if it's possible to do an incremental 
[16:43] <Shakespea> The one thing I can never find online is old sewing patterns though....
[16:44] <Aoede> Dunno if incremental is possible
[16:48] <Sanqui> I think OldApps may be covered, not sure though.
[17:09] <Shakespea> Aoede:  My next query would be to look into whether wget has an 'incremental' option in it, as it save badnwidth if you only have to add a few new files vs the whole site.
[17:10] <Shakespea> If you want to throw it in the bot anyway , don't let me stop you :)
[17:10] <xmc> wget does
[17:10] <xmc> --continue
[17:11] <Shakespea> xmc: I meant "date incremental" I.e grab evreything that's changed since we last took a sample...
[17:11] <xmc> yep
[17:11] <Aoede> --warc-dedup?
[17:11] <Shakespea> Aoede: Possibly...
[17:12] <xmc> --continue --mirror will crawl the site but only download files that are different
[17:12] <xmc> i'm not sure exactly how it works, to be honest
[17:12] <Shakespea> Thanks
[17:12] <xmc> /topic unofficial wget user group
[17:12] <xmc> anyway. wget --continue --mirror will probably do what you want. but test first
[17:13] <Shakespea> My third suggestion for this month would be  to ask whose archiving "adult" fiction sites like asstr, Fictionmania etc
[17:13] <Shakespea> These can apparently vanish without warning ...
[17:13] <xmc> i'm not aware of an active project for those sites
[17:13] <xmc> you're welcome to start one
[17:14] <Shakespea> I can't use the wiki at the moment, owing to some unfortunate misunderstandings...
[17:14] <yipdw> archivebot's crawlers support incremental fetch to the degree the site itself makes it possible to determine what's changed
[17:14] <yipdw> archivebot itself does not
[17:14] <yipdw> good news is you can use wget/wpull to do that manually until that situation's resolved
[17:14] <Shakespea> Thank you for that explanation.
[17:15] <xmc> doesn't it use the If-Modified-Since: header ?
[17:16] <yipdw> wget can use that yeah
[17:16] <yipdw> but a website doesn't have to send that or send one that makes any sense
[17:16] <yipdw> er, sorry, wget uses Last-Modified
[17:17] <xmc> sent by the client
[17:17] <xmc> ah
[17:17] <yipdw> it's not clear to me whether wget does conditional GETs yet
[17:17] <xmc> yes. the web is garbage, and we try to layer useful things over that
[17:17] <yipdw> yeah whoops
[17:18] <yipdw> I confused If-Modified-Since with Last-Modified, go me
[17:18] <xmc> np
[17:18] <yipdw> they're only different parts of the request
[17:20] <Shakespea> But still in theory possible not to have to grab a whole site multiple times...
[17:20] <Shakespea> ( which some may still want to do for other reasons, of course...)
[17:21] <Shakespea> Thanks ....
[17:21] <Shakespea> BTW My forurth of 3 suggestions for archive this month (Sorry) would be news sites on Trump that are pre-election before his lawyers get to them ;) )
[17:22] * Shakespea out
[17:22] *** Shakespea has left 
[17:22] <xmc> uh
[17:28] <yipdw> typing on the edge of chaos
[17:44] *** computerf has quit IRC (Read error: Operation timed out)
[18:13] *** computerf has joined #archiveteam-bs
[19:42] *** Start has joined #archiveteam-bs
[19:46] *** kristian_ has joined #archiveteam-bs
[19:53] *** krazedkat has quit IRC (Read error: Operation timed out)
[20:09] *** Start has quit IRC (Remote host closed the connection)
[20:50] *** Start has joined #archiveteam-bs
[20:52] *** Start has quit IRC (Client Quit)
[20:54] *** Start has joined #archiveteam-bs
[20:59] *** Start has quit IRC (Client Quit)
[21:03] *** Start has joined #archiveteam-bs
[21:11] *** Yoshimura has quit IRC (Remote host closed the connection)
[21:44] *** BartoCH has quit IRC (Ping timeout: 260 seconds)
[21:49] *** BartoCH has joined #archiveteam-bs
[21:54] *** Start has quit IRC (Remote host closed the connection)
[22:24] *** krazedkat has joined #archiveteam-bs
[23:04] *** Stiletto has quit IRC (Read error: Operation timed out)
[23:14] *** BlueMaxim has joined #archiveteam-bs
[23:17] *** GE has quit IRC (Quit: zzz)
[23:17] <godane> so i should be past 1 million items by the morning
[23:17] <xmc> wow!
[23:29] *** Start has joined #archiveteam-bs