Time |
Nickname |
Message |
00:20
🔗
|
yipdw |
why can you not search Google Apps for the App Passwords page |
00:20
🔗
|
yipdw |
come on Google, index yourself |
00:21
🔗
|
* |
yipdw gcan never friggin find this damn thin |
00:22
🔗
|
xmc |
who indexes the indexer |
00:23
🔗
|
yipdw |
interestingly, you can search in the Google Apps Admin site for settings |
00:23
🔗
|
yipdw |
I wonder if nobody at Google uses the App Passwords page enough for it to matter |
00:23
🔗
|
yipdw |
because why in the world would you ever use an app that wasn't web-based |
00:34
🔗
|
|
godane has joined #archiveteam-bs |
01:00
🔗
|
|
powerKit2 has joined #archiveteam-bs |
01:00
🔗
|
powerKit2 |
https://catalogd.archive.org/log/599682290 ...did this task break? |
01:00
🔗
|
xmc |
should be still running |
01:02
🔗
|
powerKit2 |
-shrug- it just seemed to be taking longer than it should |
01:03
🔗
|
xmc |
when they break the row in /history/ turns red and there's an error message in the log |
01:03
🔗
|
xmc |
unless something is deeply wrong |
01:04
🔗
|
xmc |
if the derive for a 20 minute video doesn't complete in six hours, then that's cause for worry |
01:04
🔗
|
xmc |
but an hour? ehhh |
01:05
🔗
|
powerKit2 |
I think the longest video in the item is 3 hours and 20 minutes |
01:05
🔗
|
|
nickname_ has quit IRC (Read error: Operation timed out) |
01:06
🔗
|
xmc |
oh, well, then that's yeah |
01:06
🔗
|
xmc |
take two aspirin and call me in the morning |
01:08
🔗
|
powerKit2 |
I'm guessing this is why people don't typically upload 39Gigabytes of video onto the Internet Archive. |
01:09
🔗
|
powerKit2 |
I just figured it'd be kinda mean to dump 121+ individual items into community video. |
01:20
🔗
|
powerKit2 |
Anyway, I've been meaning to start recording my videos in FFv1 from now on. Can the archive derive from video encoded that way? |
01:23
🔗
|
xmc |
well. an item should be a work that stands on its own |
01:23
🔗
|
xmc |
not three works, not half a work |
01:23
🔗
|
xmc |
how you define this ... hard to say |
01:26
🔗
|
|
Yoshimura has quit IRC (Remote host closed the connection) |
01:28
🔗
|
powerKit2 |
Honestly, I just didn't want to go through 121 random videos with non descriptive names and figure out what each one. |
01:28
🔗
|
xmc |
fair |
01:28
🔗
|
powerKit2 |
*what each one was. |
01:30
🔗
|
powerKit2 |
Anyway, before I start recording my future videos in FFv1, can the archive actually derive from them? |
01:30
🔗
|
xmc |
what is ffv1 |
01:30
🔗
|
powerKit2 |
https://en.wikipedia.org/wiki/FFV1 |
01:30
🔗
|
xmc |
i suggest you make a short test video and upload it into a test item and see what happens |
01:30
🔗
|
xmc |
test items get deleted after a month |
01:31
🔗
|
powerKit2 |
I think it'd work, it looks like derive.php uses llibavcodec which included FFV1. |
01:31
🔗
|
powerKit2 |
*libavcodec |
01:33
🔗
|
powerKit2 |
Yeah, I'll just make a test video later and see. |
02:12
🔗
|
|
powerKit2 has quit IRC (Quit: Page closed) |
02:46
🔗
|
|
zenguy has quit IRC (Ping timeout: 370 seconds) |
02:57
🔗
|
|
Yoshimura has joined #archiveteam-bs |
03:03
🔗
|
|
zenguy has joined #archiveteam-bs |
03:12
🔗
|
|
n00b184 has joined #archiveteam-bs |
04:05
🔗
|
|
Ravenloft has quit IRC (Read error: Connection reset by peer) |
04:55
🔗
|
|
krazedkat has quit IRC (Leaving) |
05:06
🔗
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
05:07
🔗
|
godane |
i'm at 995k items now |
05:07
🔗
|
godane |
less then 5k items away from 1 million items |
05:08
🔗
|
godane |
also nasa docs are almost done |
05:13
🔗
|
|
Sk1d has joined #archiveteam-bs |
05:26
🔗
|
|
mst__ has joined #archiveteam-bs |
05:35
🔗
|
|
mst__ has quit IRC (Quit: bye) |
06:49
🔗
|
|
Asparagir has joined #archiveteam-bs |
07:08
🔗
|
whopper |
National Library of Australis PANDORA internet archive.. I had no idea this thing existed - http://pandora.nla.gov.au/ |
07:08
🔗
|
whopper |
485,506,170 files and 25.66 TB |
07:26
🔗
|
|
turnkit has joined #archiveteam-bs |
07:27
🔗
|
turnkit |
Anyone heard of college "viewbooks" -- they are basically mini booklets describing a college for prospective students. I'm considering trying to create a large collection of them. |
07:27
🔗
|
turnkit |
I found a site that has them sort of aggregated already: https://issuu.com/search?q=viewbook |
07:28
🔗
|
turnkit |
but many of them are marked "no download" |
07:28
🔗
|
turnkit |
and if I go to different college sites I can find them. But I think it'd be basically a lot of just manual searching to get one from each college for each year that they were available. |
07:29
🔗
|
turnkit |
The part I am interested in is finding what college clubs each college had each year. |
07:29
🔗
|
turnkit |
I would think someone already has indexed this but I haven't found an index of college clubs yet. |
07:30
🔗
|
turnkit |
anyone happen to already stumble on a college viewbook pdf collection that I could use to extract that info? |
07:30
🔗
|
turnkit |
i guess... https://www.google.com/search?q=college+viewbook+type%3A.pdf |
07:31
🔗
|
turnkit |
Can I just run that into wget somehow? (time to listen to the man) |
07:33
🔗
|
|
ravetcofx has quit IRC (Read error: Operation timed out) |
07:49
🔗
|
yipdw |
turnkit: if you've got a list of URLs, yeah, you can feed those into wget/wpull/whatever |
07:50
🔗
|
turnkit |
this is a pretty dumb question but do you know an easy way to get google results into a list? I guess I could save the whole page then grep or sed for http:// but seems like there should be an simplier way |
07:50
🔗
|
yipdw |
unfortunately I don't know of any Google search scraper offhand that'll do this |
07:50
🔗
|
yipdw |
the main difficulty is that Google builds a lot of bot checks into the search |
07:51
🔗
|
turnkit |
I found a SEO plugin that claims to save Google results as CSV but it was bloaty |
07:51
🔗
|
turnkit |
Well I found how to change the Google setting to get 100 results per page -- that sort of helps |
07:52
🔗
|
turnkit |
? http://www.labnol.org/internet/google-web-scraping/28450/ |
07:53
🔗
|
turnkit |
oh that doesn't work -- I found that last week and couldn't figure it out |
07:54
🔗
|
turnkit |
i guess this is more basic than I thought.... stumbling around. https://www.google.com/search?&q=scrape+google+search+results+into+links |
07:56
🔗
|
yipdw |
so, the basics are not too bad; if you keep a human-like pace and don't give yourself away obviously (e.g. use the default curl/wget user-agent or whatever) you'll probably be fine just grabbing each search page |
07:56
🔗
|
yipdw |
and parsing out the links with nokogiri/beautifulsoup/whatever |
07:57
🔗
|
yipdw |
the problem comes when people go "oh, one process is good, let me scale up to 47" |
07:57
🔗
|
yipdw |
and then they wonder why they are getting no result |
07:57
🔗
|
yipdw |
s |
07:59
🔗
|
yipdw |
you will have to deal with getting the URL out of the Google URL redirect thingy |
08:06
🔗
|
yipdw |
turnkit: e.g. https://gitlab.peach-bun.com/snippets/44, quick scripting |
08:26
🔗
|
turnkit |
I'll check that out. Thanks! |
08:32
🔗
|
|
turnkit_ has joined #archiveteam-bs |
09:10
🔗
|
|
krazedkat has joined #archiveteam-bs |
09:11
🔗
|
|
GE has joined #archiveteam-bs |
09:51
🔗
|
|
turnkit_ has quit IRC (Ping timeout: 268 seconds) |
10:05
🔗
|
|
Smiley has joined #archiveteam-bs |
10:07
🔗
|
|
SmileyG has quit IRC (Ping timeout: 250 seconds) |
10:38
🔗
|
|
GE has quit IRC (Quit: zzz) |
10:42
🔗
|
|
turnkit has quit IRC (Quit: Page closed) |
10:44
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
11:21
🔗
|
|
n00b184 has quit IRC (Ping timeout: 268 seconds) |
12:35
🔗
|
|
GE has joined #archiveteam-bs |
14:00
🔗
|
|
SilSte has joined #archiveteam-bs |
14:08
🔗
|
|
tfgbd_znc has quit IRC (Read error: Operation timed out) |
14:09
🔗
|
|
tfgbd_znc has joined #archiveteam-bs |
14:11
🔗
|
|
SilSte has quit IRC (Read error: Connection reset by peer) |
14:12
🔗
|
|
SilSte has joined #archiveteam-bs |
14:49
🔗
|
|
sep332_ has quit IRC (konversation out) |
14:51
🔗
|
|
sep332_ has joined #archiveteam-bs |
14:54
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
15:50
🔗
|
|
Ravenloft has joined #archiveteam-bs |
16:07
🔗
|
|
ravetcofx has joined #archiveteam-bs |
16:25
🔗
|
|
Shakespea has joined #archiveteam-bs |
16:25
🔗
|
Shakespea |
Afternoon |
16:26
🔗
|
Shakespea |
I found an intresting site |
16:27
🔗
|
Shakespea |
www.oldapps.com any possibility of getting it archived? I would mention this on the Web site, but owing to some unfrotunate misunderstandings I can't raise the matter there at the moment. |
16:28
🔗
|
Aoede |
"This web page at download.oldapps.com has been reported to contain unwanted software and has been blocked " |
16:28
🔗
|
Aoede |
thanks firefox |
16:28
🔗
|
Shakespea |
Are you using an ad-blocker? |
16:28
🔗
|
Shakespea |
It loaded fine for me |
16:29
🔗
|
Shakespea |
http://www.oldapps.com/index.php being the full URL |
16:29
🔗
|
Aoede |
Loads fine, just doesn't let me download anything. Weird |
16:30
🔗
|
Shakespea |
The useful thing is that seems to have older versions of some 'sharing' tools... ;) |
16:30
🔗
|
Aoede |
:D |
16:30
🔗
|
Shakespea |
I also noted - www.mdgx.com |
16:31
🔗
|
Shakespea |
Which has suport files going back nearly 20 years |
16:31
🔗
|
Shakespea |
(And which probably should be mirrored at some point) |
16:32
🔗
|
Shakespea |
And I'm down by2 on my 3 suggestions this month :( |
16:39
🔗
|
Aoede |
mdgx was grabbed by archivebot in 2015 |
16:39
🔗
|
Aoede |
http://archive.fart.website/archivebot/viewer/job/49n9f |
16:40
🔗
|
Aoede |
oldapps.com in 2014 http://archive.fart.website/archivebot/viewer/job/7dvez |
16:42
🔗
|
Shakespea |
Aoede: Thanks... mgdx gets updates a quite a bit though... so I hope it's on a regular schedule :) |
16:42
🔗
|
Shakespea |
*mdgx |
16:42
🔗
|
Aoede |
Want me to throw it in Archivebot? |
16:43
🔗
|
Shakespea |
Feel free, if it's possible to do an incremental |
16:43
🔗
|
Shakespea |
The one thing I can never find online is old sewing patterns though.... |
16:44
🔗
|
Aoede |
Dunno if incremental is possible |
16:48
🔗
|
Sanqui |
I think OldApps may be covered, not sure though. |
17:09
🔗
|
Shakespea |
Aoede: My next query would be to look into whether wget has an 'incremental' option in it, as it save badnwidth if you only have to add a few new files vs the whole site. |
17:10
🔗
|
Shakespea |
If you want to throw it in the bot anyway , don't let me stop you :) |
17:10
🔗
|
xmc |
wget does |
17:10
🔗
|
xmc |
--continue |
17:11
🔗
|
Shakespea |
xmc: I meant "date incremental" I.e grab evreything that's changed since we last took a sample... |
17:11
🔗
|
xmc |
yep |
17:11
🔗
|
Aoede |
--warc-dedup? |
17:11
🔗
|
Shakespea |
Aoede: Possibly... |
17:12
🔗
|
xmc |
--continue --mirror will crawl the site but only download files that are different |
17:12
🔗
|
xmc |
i'm not sure exactly how it works, to be honest |
17:12
🔗
|
Shakespea |
Thanks |
17:12
🔗
|
xmc |
/topic unofficial wget user group |
17:12
🔗
|
xmc |
anyway. wget --continue --mirror will probably do what you want. but test first |
17:13
🔗
|
Shakespea |
My third suggestion for this month would be to ask whose archiving "adult" fiction sites like asstr, Fictionmania etc |
17:13
🔗
|
Shakespea |
These can apparently vanish without warning ... |
17:13
🔗
|
xmc |
i'm not aware of an active project for those sites |
17:13
🔗
|
xmc |
you're welcome to start one |
17:14
🔗
|
Shakespea |
I can't use the wiki at the moment, owing to some unfortunate misunderstandings... |
17:14
🔗
|
yipdw |
archivebot's crawlers support incremental fetch to the degree the site itself makes it possible to determine what's changed |
17:14
🔗
|
yipdw |
archivebot itself does not |
17:14
🔗
|
yipdw |
good news is you can use wget/wpull to do that manually until that situation's resolved |
17:14
🔗
|
Shakespea |
Thank you for that explanation. |
17:15
🔗
|
xmc |
doesn't it use the If-Modified-Since: header ? |
17:16
🔗
|
yipdw |
wget can use that yeah |
17:16
🔗
|
yipdw |
but a website doesn't have to send that or send one that makes any sense |
17:16
🔗
|
yipdw |
er, sorry, wget uses Last-Modified |
17:17
🔗
|
xmc |
sent by the client |
17:17
🔗
|
xmc |
ah |
17:17
🔗
|
yipdw |
it's not clear to me whether wget does conditional GETs yet |
17:17
🔗
|
xmc |
yes. the web is garbage, and we try to layer useful things over that |
17:17
🔗
|
yipdw |
yeah whoops |
17:18
🔗
|
yipdw |
I confused If-Modified-Since with Last-Modified, go me |
17:18
🔗
|
xmc |
np |
17:18
🔗
|
yipdw |
they're only different parts of the request |
17:20
🔗
|
Shakespea |
But still in theory possible not to have to grab a whole site multiple times... |
17:20
🔗
|
Shakespea |
( which some may still want to do for other reasons, of course...) |
17:21
🔗
|
Shakespea |
Thanks .... |
17:21
🔗
|
Shakespea |
BTW My forurth of 3 suggestions for archive this month (Sorry) would be news sites on Trump that are pre-election before his lawyers get to them ;) ) |
17:22
🔗
|
* |
Shakespea out |
17:22
🔗
|
|
Shakespea has left |
17:22
🔗
|
xmc |
uh |
17:28
🔗
|
yipdw |
typing on the edge of chaos |
17:44
🔗
|
|
computerf has quit IRC (Read error: Operation timed out) |
18:13
🔗
|
|
computerf has joined #archiveteam-bs |
19:42
🔗
|
|
Start has joined #archiveteam-bs |
19:46
🔗
|
|
kristian_ has joined #archiveteam-bs |
19:53
🔗
|
|
krazedkat has quit IRC (Read error: Operation timed out) |
20:09
🔗
|
|
Start has quit IRC (Remote host closed the connection) |
20:50
🔗
|
|
Start has joined #archiveteam-bs |
20:52
🔗
|
|
Start has quit IRC (Client Quit) |
20:54
🔗
|
|
Start has joined #archiveteam-bs |
20:59
🔗
|
|
Start has quit IRC (Client Quit) |
21:03
🔗
|
|
Start has joined #archiveteam-bs |
21:11
🔗
|
|
Yoshimura has quit IRC (Remote host closed the connection) |
21:44
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
21:49
🔗
|
|
BartoCH has joined #archiveteam-bs |
21:54
🔗
|
|
Start has quit IRC (Remote host closed the connection) |
22:24
🔗
|
|
krazedkat has joined #archiveteam-bs |
23:04
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
23:14
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
23:17
🔗
|
|
GE has quit IRC (Quit: zzz) |
23:17
🔗
|
godane |
so i should be past 1 million items by the morning |
23:17
🔗
|
xmc |
wow! |
23:29
🔗
|
|
Start has joined #archiveteam-bs |