[01:03] *** BlueMaxim has joined #archiveteam-bs [01:25] *** BlueMaxim has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** schbirid has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** superkuh has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** HCross has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** ohhdemgir has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** signius has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** Sanqui has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** balrog has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** slyphic|a has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** jk[SVP] has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** SN4T14 has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** Infreq_ has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** rduser has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** w0rp has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** swebb has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** chazchaz has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** dcmorton has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** dxrt has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** atlogbot has quit IRC (hub.efnet.us irc.servercentral.net) [01:25] *** swebb_ has joined #archiveteam-bs [01:25] *** rduser` has joined #archiveteam-bs [01:25] *** superkuh_ has joined #archiveteam-bs [01:25] *** Infreq has joined #archiveteam-bs [01:26] *** w0rp_ has joined #archiveteam-bs [01:27] *** Sanky has joined #archiveteam-bs [01:28] *** balrog_ has joined #archiveteam-bs [01:40] *** rduser` is now known as rduser [01:40] *** w0rp_ is now known as w0rp [01:40] *** balrog_ is now known as balrog [01:40] *** swebb_ is now known as swebb [01:44] *** slyphic has joined #archiveteam-bs [01:45] *** chazchaz has joined #archiveteam-bs [01:48] *** jk[SVP] has joined #archiveteam-bs [02:00] *** dxrt has joined #archiveteam-bs [02:18] *** JesseW has quit IRC (Read error: Operation timed out) [02:22] *** JesseW has joined #archiveteam-bs [02:27] *** schbirid2 has joined #archiveteam-bs [02:59] *** ohhdemgir has joined #archiveteam-bs [04:04] https://ia600305.us.archive.org/29/items/dn2001-0926_vid/dn2001-0926_vid_files.xml contains a claim that the md5 of *itself* is 8bb5e8561b541b2cb205b9415278a870 [04:04] That can't work, can it? [04:05] and it isn't correct, in any case... [04:07] At this point MD5 is so broken I think it would be possible to generate files containing their own MD5 sum (possibly with some garbage at the end) [04:07] But I doubt the Internet Archive would put the effort into doing that [04:08] lol [04:08] it does seem to be *present* in all the files I look at, though... [04:08] very strange [04:08] More likely they're just running md5sum on all the files and then updating the metadata [04:09] Which obviously changes the metadata's MD5 [04:11] lol [04:12] hm, I wonder if the provided one is correct if that line is removed, then [04:22] nope, the provided one does not match any variations on the actual file that I've tried [04:30] there's also the oddity that in this old item I'm looking at https://archive.org/metadata/dn2001-0926_vid -- it claims the _files.xml file is "original" while in the more recent item https://archive.org/metadata/nasa -- it is more correctly identified as "metadata". [04:38] I bet someone at IA would know why [04:39] I've asked in the stacks slack channel. [04:49] *** superkuh_ has quit IRC (Remote host closed the connection) [05:20] next interesting question -- what's the _meta.sqlite file, and why is *it* considered original Metadata? [05:22] *** acridAxid has quit IRC (Read error: Operation timed out) [05:45] *** coretx has joined #archiveteam-bs [05:46] * JesseW has now written a jq filter that will extract the md5s from both the census file and live data. Now to compare them (at least for some items to start) [05:46] hi coretx, welcome [05:46] ty ^_^ [05:46] (I don't know who you are) [05:46] Some people do, and many don't need to know :) [05:46] (I just noticed you trying to make it in here from #archiveteam) [05:47] Yeah, nasty bug in my quassel client. [05:47] It get's even worse when you msg nickserv for identification. [05:47] oh badass, I can chat from a Wayland session [05:48] I gotta reconfigure my system to do this without having to do X -> Weston crap [06:22] *** vitzli has joined #archiveteam-bs [06:41] coretx: Which version of Quassel are you running that has that problem? [07:21] https://archive.org/details/Galactic_Video_Review_1989_VHSRip [07:23] Hm, anyone have any ideas why IA's search apparently returns multiple entries for the same item sometimes? [07:29] godane: Don't add covers [07:30] ok [07:30] why did you add covers? [07:30] I did a range, and will do another [07:30] because it looked like crap the other way [07:31] Now when you go to an individual day of broadcast, it's the logos and the waveforms as you click on them. [07:31] vs just the waveforms [07:32] the collection cover was there before derive [07:32] It'll work out. [07:32] i really didn't like [07:33] plus side is my cover adding redrived like 5 items in 2004 [07:33] that you missed [07:35] look like 2004-07-29 and 2004-07-31 was derive [07:35] *wasn't [07:35] are you all talking about ones like https://archive.org/details/kpfa-archives-radio-podcast-2007-02-15 ? [07:36] (which has neither waveforms or a cover, FWIW) [07:38] it as waveforms when i'm on the item and on collection page [07:39] i get waveforms when i'm logged out too [07:40] *** Stiletto has quit IRC (Ping timeout: 246 seconds) [07:40] SketchCow: you only have to add covers to items past march 2005 i think now [07:40] godane: for the item I linked to? Strange. [07:41] thats what i thought [07:41] i thought it was cause i was login or something [07:41] but its not that [07:42] I see cover images or waveforms for many of them on https://archive.org/details/kpfa_podcasts [07:43] including the one I linked to. But nothing on the item page itself... [07:43] just "There Is No Preview Available For This Item" [07:44] oh [07:44] i only noticed there was no differents on search page [07:45] SketchCow: i think this was the podcasts collection problem i told you about [07:45] where items are not viewable if your not login [07:46] same problem with a cbsnews.com page: https://archive.org/details/cbsnews.com-video-2007-06-29 [07:47] i think when there is more then one file in a item it cause this problem [07:49] but some things i have to that with just so i can get it uploaded [07:49] * JesseW is about to finish downloading the IA census data :-) [07:53] nevermind [07:53] the problem is on 1 file per item collections too: https://archive.org/details/Martin_Yan_Quick_and_Easy_S01E03 [07:54] good news is there is no dmca noticed on it yet [07:54] the collection was saying 0 items on the podcasts collection page [07:56] SketchCow: so if you want to know there is a bug the plagues items to not display right inless your login [07:56] They don't display for me even when I log in... [07:56] when i'm login it works [07:57] SketchCow: correction *only when user is login does the item display right [07:58] SketchCow: best idea is to take one of the smaller collections out to see if the problem exists anymore [07:59] my vote goes for Martin Yan's shows: https://archive.org/details/Martin_Yan_Shows [08:01] cause its small and it doesn't belong in the podcasts collection in the first place [08:01] JesseW: does this item work when logged out: https://archive.org/details/Joystiq_Xbox_360_Fancast_212 [08:03] *** JesseW has quit IRC (Ping timeout: 246 seconds) [09:31] *** Sanky is now known as Sanqui [09:38] *** RichardG has quit IRC (Ping timeout: 260 seconds) [10:03] *** vitzli has quit IRC (Leaving) [10:28] *** superkuh has joined #archiveteam-bs [10:33] *** signius has joined #archiveteam-bs [11:29] *** vtyl has quit IRC (Ping timeout: 252 seconds) [11:41] *** lytv has joined #archiveteam-bs [13:29] *** RichardG has joined #archiveteam-bs [13:43] *** HCross has joined #archiveteam-bs [14:44] :) [14:44] You two pairing up is a dangerous thing [14:49] Friends Reunited is closing. http://www.bbc.co.uk/news/technology-35343091 Shall we go for it? [17:01] Im grabbing http://parliamentlive.tv/Event/Index/83208344-218d-4c43-9300-ca78c374b875 [17:06] *** JesseW has joined #archiveteam-bs [17:27] *** JesseW has quit IRC (Read error: Operation timed out) [17:45] *** JW_work has quit IRC (Read error: Operation timed out) [17:48] *** JW_work has joined #archiveteam-bs [17:48] *** Start has quit IRC (Quit: Disconnected.) [17:48] godane: I do see a waveform for https://archive.org/details/Joystiq_Xbox_360_Fancast_212 when logged in or logged out. [18:24] i think its cause the item is not part of the podcasts collection [18:42] *** Start has joined #archiveteam-bs [19:17] *** Start has quit IRC (Quit: Disconnected.) [19:34] *** unstable has quit IRC (Ping timeout: 260 seconds) [19:34] *** unstable has joined #archiveteam-bs [19:38] *** Stiletto has joined #archiveteam-bs [19:43] *** Start has joined #archiveteam-bs [20:28] Its all grouped around a school it seems. http://www.friendsreunited.co.uk/chancellor-s-school/People/a445cb53-6c27-4ce8-b39e-830ce1efc15d is an example URL for a school [20:30] would be great if it didnt return to login upon brosing [20:30] Yea, and their email validation worked [20:30] http://www.friendsreunited.co.uk/Home/Login?ReturnUrl= [20:31] you able to get email from them? [20:31] not yet, doing a support account recovery [20:31] might know tomorrow [20:32] Cant seem to get email validation from them [20:34] http://www.friendsreunited.co.uk/Memory/a445cb53-6c27-4ce8-b39e-830ce1efc15d/Group/0?nullableid=f91241a8-fb0c-4285-ace2-d9852c2b48eb is an example URL for a "memory" or picture [20:35] gah return url [20:35] you also have "discussions" http://www.friendsreunited.co.uk/Discussion/View?topicId=e2b0fec9-df39-4c6a-b39f-8e46b63ff255 [20:36] its all about breaking that hash at the end [20:36] yea [20:37] this is a mammoth task [20:37] since it's like archiving multi million webpages [20:37] and probably hundred of thousand images [20:37] it is, and we dont have long [20:38] on a site that seems to be aging and slow. https://harrycross.me/aea.png [20:38] in a very bad state too [20:41] I cant get to a lot of places as it requires validation and their mail servers are already down or something [20:42] the hash at the end is a uuid [20:42] Which if they are using an external provider, sounds very likely [20:42] 126 random bits [20:42] uuid -d f91241a8-fb0c-4285-ace2-d9852c2b48eb [20:42] encode: STR: f91241a8-fb0c-4285-ace2-d9852c2b48eb [20:42] SIV: 331072564038548777722403356915675121899 [20:42] decode: variant: DCE 1.1, ISO/IEC 11578:1996 [20:42] version: 4 (random data based) [20:42] content: F9:12:41:A8:FB:0C:02:85:2C:E2:D9:85:2C:2B:48:EB [20:42] (no semantics: random data only) [20:45] *** Start has quit IRC (Quit: Disconnected.) [20:45] yeah need validation email for my new account [20:47] images stored elsewhere [20:47] e.g. http://www.assetstorage.co.uk/AssetStorageService.svc/GetImageFriendly/666573312/400/600/0/0/1/80/ResizeBestFit/0/FRU/577273B6AF61D4F1F65A6C5DD1B0E3C4/these-are-my-old-school-mates-im-not-in-the-photo.jpg [20:48] GHood [20:48] Good [20:48] If someone has a working friendsunited accounts, please let me know [20:48] I'll only use the account to investigate the website. It will not be used to grab the project with [20:48] ^repost fo #archiveteam [20:48] HCross, github url? [20:48] https://github.com/ArchiveTeam/gamefront-grab [20:48] at least the site can be done 2 pronged, site and images [20:49] arkiver, SimpBrain shall we get a chan for Friends Ununited? [20:49] let's do that [20:49] #friendsununited [20:49] ? [20:49] *** Start has joined #archiveteam-bs [20:51] yes [20:56] hmm twitter feed is old, never used [21:02] Preparing something for the Wiki [21:05] ! Whoa! Your concurrency level is at 10. ! [21:05] ! Please check if this is what you wanted. ! [21:05] ! Continuing anyway... ! [21:05] ! [21:05] Traceback (most recent call last): [21:05] File "/usr/local/bin/run-pipeline", line 6, in [21:05] main() [21:05] File "/usr/local/lib/python2.7/dist-packages/seesaw/script/run_pipeline.py", line 223, in main [21:05] runner = init_runner(args) [21:05] File "/usr/local/lib/python2.7/dist-packages/seesaw/script/run_pipeline.py", line 252, in init_runner [21:05] (project, pipeline) = load_pipeline(args.pipeline, context) [21:05] File "/usr/local/lib/python2.7/dist-packages/seesaw/script/run_pipeline.py", line 39, in load_pipeline [21:05] exec(pipeline_str, local_context, global_context) [21:05] File "", line 18, in [21:06] ImportError: No module named requests [21:06] such fuckup [21:06] pip install requests [21:07] *** bzc6p has joined #archiveteam-bs [21:09] So I've read on the Digitize wiki that for scanning paper media, 600 DPI and TIFF format is preferred. [21:09] yep [21:10] Am I doing something wrong if that makes an A4 (color) page ~30 MB? Thus, a 50-page magazine will be 1,5 GB. Isn't that much? [21:10] that's what happens [21:10] it seems like a reasonable amount of disk space to me tbh [21:11] but you can turn on compression i guess [21:11] try not to make it look like shite :P [21:11] Also, I was looking at some magazines, books on IA, and they are surprisingly little in terms of file size [21:11] bzc6p: TIFF is lossless and uncompressed, so yeah, 30MB for a color A4 is totally reasonable. [21:11] IA uses jpeg2000 for derived images [21:12] Here's one of SketchCow's knitting magazines: https://ia601509.us.archive.org/17/items/Knit_Simple_2013-11/ [21:12] Which was the original file? [21:13] They are in the 100 MB range, although 67 pages in 600 dpi. How's that? [21:13] bzc6p: you can tell which are original by looking at the metadata [21:13] Guyz [21:13] good news [21:13] Belgium works [21:13] pulling data for ya now [21:14] https://ia601509.us.archive.org/17/items/Knit_Simple_2013-11/Knit_Simple_2013-11_files.xml [21:15] JW_work: thanks [21:15] So 86 MB? That's like 3 pages for me. [21:15] Did SketchCow use JPG or did some other magic? [21:17] probably pdf [21:17] er [21:17] jpeg in pdf [21:17] This is the original file: https://ia601509.us.archive.org/17/items/Knit_Simple_2013-11/Knit_Simple_2013-11.pdf [21:17] I see [21:17] That can't be TIFF [21:18] Original files might have been scanned as TIFF, then converted to JPEG in PDF before being uploaded to IA. [21:18] jpeg in pdf is pretty normal [21:18] Yeah [21:20] Is there a difference between scanning in TIFF than converting to JPG and scanning directly to JPG? I thought conversion is done when exporting, not right in the scanning software. [21:20] MIght depend on the software though. I use xsane [21:20] *then [21:21] there is no difference except maybe the settings on the jpeg encoder [21:23] OH. Now I know why I'm confused. [21:23] I though the TIFF 600dpi was written by SketchCow. (I wondered why did he use jpg then) It wasn't. [21:24] ah yeah it's probably a thing he found somewhere [21:24] The TIFF 600 dpi was written by "Savetz" [21:25] the document saying to choose it, or the image? [21:25] http://digitize.archiveteam.org/index.php/Paper_Media Here Savetz emphasizes 600 dpi TIFF. [21:26] yes [21:26] use 600dpi tiff [21:26] I thought it was written by SketchCow, that's why I didn't get why SketchCow uploaded 600dpi jpg. [21:26] what is the purpose of this conversation though [21:26] I'm planning to scan some magazines and want to decide the format. [21:27] and quality [21:27] pile of sequentially numbered tiffs in a zipfile is good idea [21:27] Doesn't compress well. [21:27] ok? [21:27] what is your goal though [21:28] what does that not do for you [21:29] Savetz, btw, is Kevin Savetz — collector of computer magazines & Atari material, big supporter of IA [21:29] It just seemed a bit too big for me that a year of a magazine can take up 50–100 GBs, but if IA can hold it, I don't mind. [21:29] you can make tiff with jpeg compression if you can't stand the size of tiff normally [21:31] I'll think it over. Thanks for the answers. [21:32] bzc6p: Why do you care if it compresses well? [21:32] It doesn't. [21:32] Are you an it? [21:32] so i made my grab script for Aviation Week grab better [21:33] bzc6p: i agree [21:33] ersi_: I think I'll check the scanner software settings again, but 7z-ing the TIFF didn't decrease the size notably. [21:34] I'm asking why you care, not if it's possible [21:34] godane: with what? [21:34] about magazines being 100gb for 1 year [21:35] Just scan it in the highest possible resolution in a lossless format (TIFF?). If you scan it with as high resolution as you can (and is of quality) and losslessy, you can always convert it into some other kind of format later on - if you want it in lower resolution, or worse quality (for size) [21:35] i have noticed that sometimes archive.org derives comic book zips right but not all the time [21:35] take the pdf from this: https://archive.org/details/NextGeneration36Dec1997 [21:36] that pdf is like thumbnails [21:36] I thought IA might mind that big size compared to the amount of information such a thing holds. [21:36] ersi_ ^ [21:37] but the watt meter pdf redrive is very good: https://archive.org/details/Instruction_Book_for_the_Thruline_Wattmeter_Model_43 [21:38] one theory i have is it doesn't redrive cbr/rar files right [21:40] I think I'll just scan it in 600dpi JPG so I can sleep well too. Some of the stuff is not that important anyway. And I don't really fear OCR's won't be able to cope with it. [21:41] high quality jpeg is pretty good [21:41] Also, if SketchCow did that, then it must be fair. [21:41] after all, jason scott can do no wrong [21:43] Well, then thanks again, and good night/afternoon [21:43] bzc6p: No worries - but it's always good to check the scans. Don't use higher resolution if it doesn't provide any gain [21:43] But still, high resolution is good - as you can always shrink it afterwards if it's "too big" :) [21:43] That's what I've been considering! [21:43] Oh, right. Nighty :) [21:43] Ah, yeah [21:45] *** antomati_ has joined #archiveteam-bs [21:46] 600dpi jpeg is better than 300dpi tiff, imo [21:46] Agreed [21:46] It's hard to say what the perfect resolution is. But if in doubt, use higher than lower IMO :) [21:46] *** lytv has quit IRC (Read error: Connection reset by peer) [21:46] *** antomatic has quit IRC (Read error: Connection reset by peer) [21:46] *** wednesday has quit IRC (Ping timeout: 252 seconds) [21:47] *** wednesday has joined #archiveteam-bs [21:47] *** lytv has joined #archiveteam-bs [21:56] *** bzc6p has left [22:16] *** Start has quit IRC (Quit: Disconnected.) [23:08] *** Start has joined #archiveteam-bs [23:58] *** superkuh has quit IRC (Remote host closed the connection)