#archiveteam-bs 2016-01-18,Mon

↑back Search

Time Nickname Message
01:03 🔗 BlueMaxim has joined #archiveteam-bs
01:25 🔗 BlueMaxim has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 schbirid has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 superkuh has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 HCross has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 ohhdemgir has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 signius has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 Sanqui has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 balrog has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 slyphic|a has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 jk[SVP] has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 SN4T14 has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 Infreq_ has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 rduser has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 w0rp has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 swebb has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 chazchaz has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 dcmorton has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 dxrt has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 atlogbot has quit IRC (hub.efnet.us irc.servercentral.net)
01:25 🔗 swebb_ has joined #archiveteam-bs
01:25 🔗 rduser` has joined #archiveteam-bs
01:25 🔗 superkuh_ has joined #archiveteam-bs
01:25 🔗 Infreq has joined #archiveteam-bs
01:26 🔗 w0rp_ has joined #archiveteam-bs
01:27 🔗 Sanky has joined #archiveteam-bs
01:28 🔗 balrog_ has joined #archiveteam-bs
01:40 🔗 rduser` is now known as rduser
01:40 🔗 w0rp_ is now known as w0rp
01:40 🔗 balrog_ is now known as balrog
01:40 🔗 swebb_ is now known as swebb
01:44 🔗 slyphic has joined #archiveteam-bs
01:45 🔗 chazchaz has joined #archiveteam-bs
01:48 🔗 jk[SVP] has joined #archiveteam-bs
02:00 🔗 dxrt has joined #archiveteam-bs
02:18 🔗 JesseW has quit IRC (Read error: Operation timed out)
02:22 🔗 JesseW has joined #archiveteam-bs
02:27 🔗 schbirid2 has joined #archiveteam-bs
02:59 🔗 ohhdemgir has joined #archiveteam-bs
04:04 🔗 JesseW https://ia600305.us.archive.org/29/items/dn2001-0926_vid/dn2001-0926_vid_files.xml contains a claim that the md5 of *itself* is 8bb5e8561b541b2cb205b9415278a870
04:04 🔗 JesseW That can't work, can it?
04:05 🔗 JesseW and it isn't correct, in any case...
04:07 🔗 MrRadar At this point MD5 is so broken I think it would be possible to generate files containing their own MD5 sum (possibly with some garbage at the end)
04:07 🔗 MrRadar But I doubt the Internet Archive would put the effort into doing that
04:08 🔗 JesseW lol
04:08 🔗 JesseW it does seem to be *present* in all the files I look at, though...
04:08 🔗 JesseW very strange
04:08 🔗 MrRadar More likely they're just running md5sum on all the files and then updating the metadata
04:09 🔗 MrRadar Which obviously changes the metadata's MD5
04:11 🔗 JesseW lol
04:12 🔗 JesseW hm, I wonder if the provided one is correct if that line is removed, then
04:22 🔗 JesseW nope, the provided one does not match any variations on the actual file that I've tried
04:30 🔗 JesseW there's also the oddity that in this old item I'm looking at https://archive.org/metadata/dn2001-0926_vid -- it claims the _files.xml file is "original" while in the more recent item https://archive.org/metadata/nasa -- it is more correctly identified as "metadata".
04:38 🔗 yipdw I bet someone at IA would know why
04:39 🔗 JesseW I've asked in the stacks slack channel.
04:49 🔗 superkuh_ has quit IRC (Remote host closed the connection)
05:20 🔗 JesseW next interesting question -- what's the _meta.sqlite file, and why is *it* considered original Metadata?
05:22 🔗 acridAxid has quit IRC (Read error: Operation timed out)
05:45 🔗 coretx has joined #archiveteam-bs
05:46 🔗 * JesseW has now written a jq filter that will extract the md5s from both the census file and live data. Now to compare them (at least for some items to start)
05:46 🔗 JesseW hi coretx, welcome
05:46 🔗 coretx ty ^_^
05:46 🔗 JesseW (I don't know who you are)
05:46 🔗 coretx Some people do, and many don't need to know :)
05:46 🔗 JesseW (I just noticed you trying to make it in here from #archiveteam)
05:47 🔗 coretx Yeah, nasty bug in my quassel client.
05:47 🔗 coretx It get's even worse when you msg nickserv for identification.
05:47 🔗 yipdw oh badass, I can chat from a Wayland session
05:48 🔗 yipdw I gotta reconfigure my system to do this without having to do X -> Weston crap
06:22 🔗 vitzli has joined #archiveteam-bs
06:41 🔗 phuzion coretx: Which version of Quassel are you running that has that problem?
07:21 🔗 godane https://archive.org/details/Galactic_Video_Review_1989_VHSRip
07:23 🔗 JesseW Hm, anyone have any ideas why IA's search apparently returns multiple entries for the same item sometimes?
07:29 🔗 SketchCow godane: Don't add covers
07:30 🔗 godane ok
07:30 🔗 godane why did you add covers?
07:30 🔗 SketchCow I did a range, and will do another
07:30 🔗 SketchCow because it looked like crap the other way
07:31 🔗 SketchCow Now when you go to an individual day of broadcast, it's the logos and the waveforms as you click on them.
07:31 🔗 godane vs just the waveforms
07:32 🔗 godane the collection cover was there before derive
07:32 🔗 SketchCow It'll work out.
07:32 🔗 godane i really didn't like
07:33 🔗 godane plus side is my cover adding redrived like 5 items in 2004
07:33 🔗 godane that you missed
07:35 🔗 godane look like 2004-07-29 and 2004-07-31 was derive
07:35 🔗 godane *wasn't
07:35 🔗 JesseW are you all talking about ones like https://archive.org/details/kpfa-archives-radio-podcast-2007-02-15 ?
07:36 🔗 JesseW (which has neither waveforms or a cover, FWIW)
07:38 🔗 godane it as waveforms when i'm on the item and on collection page
07:39 🔗 godane i get waveforms when i'm logged out too
07:40 🔗 Stiletto has quit IRC (Ping timeout: 246 seconds)
07:40 🔗 godane SketchCow: you only have to add covers to items past march 2005 i think now
07:40 🔗 JesseW godane: for the item I linked to? Strange.
07:41 🔗 godane thats what i thought
07:41 🔗 godane i thought it was cause i was login or something
07:41 🔗 godane but its not that
07:42 🔗 JesseW I see cover images or waveforms for many of them on https://archive.org/details/kpfa_podcasts
07:43 🔗 JesseW including the one I linked to. But nothing on the item page itself...
07:43 🔗 JesseW just "There Is No Preview Available For This Item"
07:44 🔗 godane oh
07:44 🔗 godane i only noticed there was no differents on search page
07:45 🔗 godane SketchCow: i think this was the podcasts collection problem i told you about
07:45 🔗 godane where items are not viewable if your not login
07:46 🔗 godane same problem with a cbsnews.com page: https://archive.org/details/cbsnews.com-video-2007-06-29
07:47 🔗 godane i think when there is more then one file in a item it cause this problem
07:49 🔗 godane but some things i have to that with just so i can get it uploaded
07:49 🔗 * JesseW is about to finish downloading the IA census data :-)
07:53 🔗 godane nevermind
07:53 🔗 godane the problem is on 1 file per item collections too: https://archive.org/details/Martin_Yan_Quick_and_Easy_S01E03
07:54 🔗 godane good news is there is no dmca noticed on it yet
07:54 🔗 godane the collection was saying 0 items on the podcasts collection page
07:56 🔗 godane SketchCow: so if you want to know there is a bug the plagues items to not display right inless your login
07:56 🔗 JesseW They don't display for me even when I log in...
07:56 🔗 godane when i'm login it works
07:57 🔗 godane SketchCow: correction *only when user is login does the item display right
07:58 🔗 godane SketchCow: best idea is to take one of the smaller collections out to see if the problem exists anymore
07:59 🔗 godane my vote goes for Martin Yan's shows: https://archive.org/details/Martin_Yan_Shows
08:01 🔗 godane cause its small and it doesn't belong in the podcasts collection in the first place
08:01 🔗 godane JesseW: does this item work when logged out: https://archive.org/details/Joystiq_Xbox_360_Fancast_212
08:03 🔗 JesseW has quit IRC (Ping timeout: 246 seconds)
09:31 🔗 Sanky is now known as Sanqui
09:38 🔗 RichardG has quit IRC (Ping timeout: 260 seconds)
10:03 🔗 vitzli has quit IRC (Leaving)
10:28 🔗 superkuh has joined #archiveteam-bs
10:33 🔗 signius has joined #archiveteam-bs
11:29 🔗 vtyl has quit IRC (Ping timeout: 252 seconds)
11:41 🔗 lytv has joined #archiveteam-bs
13:29 🔗 RichardG has joined #archiveteam-bs
13:43 🔗 HCross has joined #archiveteam-bs
14:44 🔗 SketchCow :)
14:44 🔗 SketchCow You two pairing up is a dangerous thing
14:49 🔗 HCross Friends Reunited is closing. http://www.bbc.co.uk/news/technology-35343091 Shall we go for it?
17:01 🔗 HCross Im grabbing http://parliamentlive.tv/Event/Index/83208344-218d-4c43-9300-ca78c374b875
17:06 🔗 JesseW has joined #archiveteam-bs
17:27 🔗 JesseW has quit IRC (Read error: Operation timed out)
17:45 🔗 JW_work has quit IRC (Read error: Operation timed out)
17:48 🔗 JW_work has joined #archiveteam-bs
17:48 🔗 Start has quit IRC (Quit: Disconnected.)
17:48 🔗 JW_work godane: I do see a waveform for https://archive.org/details/Joystiq_Xbox_360_Fancast_212 when logged in or logged out.
18:24 🔗 godane i think its cause the item is not part of the podcasts collection
18:42 🔗 Start has joined #archiveteam-bs
19:17 🔗 Start has quit IRC (Quit: Disconnected.)
19:34 🔗 unstable has quit IRC (Ping timeout: 260 seconds)
19:34 🔗 unstable has joined #archiveteam-bs
19:38 🔗 Stiletto has joined #archiveteam-bs
19:43 🔗 Start has joined #archiveteam-bs
20:28 🔗 HCross Its all grouped around a school it seems. http://www.friendsreunited.co.uk/chancellor-s-school/People/a445cb53-6c27-4ce8-b39e-830ce1efc15d is an example URL for a school
20:30 🔗 SimpBrain would be great if it didnt return to login upon brosing
20:30 🔗 HCross Yea, and their email validation worked
20:30 🔗 SimpBrain http://www.friendsreunited.co.uk/Home/Login?ReturnUrl=
20:31 🔗 HCross you able to get email from them?
20:31 🔗 SimpBrain not yet, doing a support account recovery
20:31 🔗 SimpBrain might know tomorrow
20:32 🔗 HCross Cant seem to get email validation from them
20:34 🔗 HCross http://www.friendsreunited.co.uk/Memory/a445cb53-6c27-4ce8-b39e-830ce1efc15d/Group/0?nullableid=f91241a8-fb0c-4285-ace2-d9852c2b48eb is an example URL for a "memory" or picture
20:35 🔗 SimpBrain gah return url
20:35 🔗 HCross you also have "discussions" http://www.friendsreunited.co.uk/Discussion/View?topicId=e2b0fec9-df39-4c6a-b39f-8e46b63ff255
20:36 🔗 SimpBrain its all about breaking that hash at the end
20:36 🔗 HCross yea
20:37 🔗 SimpBrain this is a mammoth task
20:37 🔗 SimpBrain since it's like archiving multi million webpages
20:37 🔗 SimpBrain and probably hundred of thousand images
20:37 🔗 HCross it is, and we dont have long
20:38 🔗 HCross on a site that seems to be aging and slow. https://harrycross.me/aea.png
20:38 🔗 HCross in a very bad state too
20:41 🔗 HCross I cant get to a lot of places as it requires validation and their mail servers are already down or something
20:42 🔗 xmc the hash at the end is a uuid
20:42 🔗 HCross Which if they are using an external provider, sounds very likely
20:42 🔗 xmc 126 random bits
20:42 🔗 xmc uuid -d f91241a8-fb0c-4285-ace2-d9852c2b48eb
20:42 🔗 xmc encode: STR: f91241a8-fb0c-4285-ace2-d9852c2b48eb
20:42 🔗 xmc SIV: 331072564038548777722403356915675121899
20:42 🔗 xmc decode: variant: DCE 1.1, ISO/IEC 11578:1996
20:42 🔗 xmc version: 4 (random data based)
20:42 🔗 xmc content: F9:12:41:A8:FB:0C:02:85:2C:E2:D9:85:2C:2B:48:EB
20:42 🔗 xmc (no semantics: random data only)
20:45 🔗 Start has quit IRC (Quit: Disconnected.)
20:45 🔗 SimpBrain yeah need validation email for my new account
20:47 🔗 SimpBrain images stored elsewhere
20:47 🔗 SimpBrain e.g. http://www.assetstorage.co.uk/AssetStorageService.svc/GetImageFriendly/666573312/400/600/0/0/1/80/ResizeBestFit/0/FRU/577273B6AF61D4F1F65A6C5DD1B0E3C4/these-are-my-old-school-mates-im-not-in-the-photo.jpg
20:48 🔗 HCross GHood
20:48 🔗 HCross Good
20:48 🔗 arkiver <arkiver>If someone has a working friendsunited accounts, please let me know
20:48 🔗 arkiver <arkiver>I'll only use the account to investigate the website. It will not be used to grab the project with
20:48 🔗 arkiver ^repost fo #archiveteam
20:48 🔗 limebyte HCross, github url?
20:48 🔗 HCross https://github.com/ArchiveTeam/gamefront-grab
20:48 🔗 SimpBrain at least the site can be done 2 pronged, site and images
20:49 🔗 HCross arkiver, SimpBrain shall we get a chan for Friends Ununited?
20:49 🔗 arkiver let's do that
20:49 🔗 HCross #friendsununited
20:49 🔗 HCross ?
20:49 🔗 Start has joined #archiveteam-bs
20:51 🔗 arkiver yes
20:56 🔗 SimpBrain hmm twitter feed is old, never used
21:02 🔗 HCross Preparing something for the Wiki
21:05 🔗 limebyte ! Whoa! Your concurrency level is at 10. !
21:05 🔗 limebyte ! Please check if this is what you wanted. !
21:05 🔗 limebyte ! Continuing anyway... !
21:05 🔗 limebyte !
21:05 🔗 limebyte Traceback (most recent call last):
21:05 🔗 limebyte File "/usr/local/bin/run-pipeline", line 6, in <module>
21:05 🔗 limebyte main()
21:05 🔗 limebyte File "/usr/local/lib/python2.7/dist-packages/seesaw/script/run_pipeline.py", line 223, in main
21:05 🔗 limebyte runner = init_runner(args)
21:05 🔗 limebyte File "/usr/local/lib/python2.7/dist-packages/seesaw/script/run_pipeline.py", line 252, in init_runner
21:05 🔗 limebyte (project, pipeline) = load_pipeline(args.pipeline, context)
21:05 🔗 limebyte File "/usr/local/lib/python2.7/dist-packages/seesaw/script/run_pipeline.py", line 39, in load_pipeline
21:05 🔗 limebyte exec(pipeline_str, local_context, global_context)
21:05 🔗 limebyte File "<string>", line 18, in <module>
21:06 🔗 limebyte ImportError: No module named requests
21:06 🔗 limebyte such fuckup
21:06 🔗 HCross pip install requests
21:07 🔗 bzc6p has joined #archiveteam-bs
21:09 🔗 bzc6p So I've read on the Digitize wiki that for scanning paper media, 600 DPI and TIFF format is preferred.
21:09 🔗 xmc yep
21:10 🔗 bzc6p Am I doing something wrong if that makes an A4 (color) page ~30 MB? Thus, a 50-page magazine will be 1,5 GB. Isn't that much?
21:10 🔗 xmc that's what happens
21:10 🔗 xmc it seems like a reasonable amount of disk space to me tbh
21:11 🔗 xmc but you can turn on compression i guess
21:11 🔗 xmc try not to make it look like shite :P
21:11 🔗 bzc6p Also, I was looking at some magazines, books on IA, and they are surprisingly little in terms of file size
21:11 🔗 phuzion bzc6p: TIFF is lossless and uncompressed, so yeah, 30MB for a color A4 is totally reasonable.
21:11 🔗 xmc IA uses jpeg2000 for derived images
21:12 🔗 bzc6p Here's one of SketchCow's knitting magazines: https://ia601509.us.archive.org/17/items/Knit_Simple_2013-11/
21:12 🔗 bzc6p Which was the original file?
21:13 🔗 bzc6p They are in the 100 MB range, although 67 pages in 600 dpi. How's that?
21:13 🔗 JW_work bzc6p: you can tell which are original by looking at the metadata
21:13 🔗 limebyte Guyz
21:13 🔗 limebyte good news
21:13 🔗 limebyte Belgium works
21:13 🔗 limebyte pulling data for ya now
21:14 🔗 JW_work https://ia601509.us.archive.org/17/items/Knit_Simple_2013-11/Knit_Simple_2013-11_files.xml
21:15 🔗 bzc6p JW_work: thanks
21:15 🔗 bzc6p So 86 MB? That's like 3 pages for me.
21:15 🔗 bzc6p Did SketchCow use JPG or did some other magic?
21:17 🔗 xmc probably pdf
21:17 🔗 xmc er
21:17 🔗 xmc jpeg in pdf
21:17 🔗 phuzion This is the original file: https://ia601509.us.archive.org/17/items/Knit_Simple_2013-11/Knit_Simple_2013-11.pdf
21:17 🔗 bzc6p I see
21:17 🔗 bzc6p That can't be TIFF
21:18 🔗 phuzion Original files might have been scanned as TIFF, then converted to JPEG in PDF before being uploaded to IA.
21:18 🔗 xmc jpeg in pdf is pretty normal
21:18 🔗 phuzion Yeah
21:20 🔗 bzc6p Is there a difference between scanning in TIFF than converting to JPG and scanning directly to JPG? I thought conversion is done when exporting, not right in the scanning software.
21:20 🔗 bzc6p MIght depend on the software though. I use xsane
21:20 🔗 bzc6p *then
21:21 🔗 xmc there is no difference except maybe the settings on the jpeg encoder
21:23 🔗 bzc6p OH. Now I know why I'm confused.
21:23 🔗 bzc6p I though the TIFF 600dpi was written by SketchCow. (I wondered why did he use jpg then) It wasn't.
21:24 🔗 xmc ah yeah it's probably a thing he found somewhere
21:24 🔗 bzc6p The TIFF 600 dpi was written by "Savetz"
21:25 🔗 xmc the document saying to choose it, or the image?
21:25 🔗 bzc6p http://digitize.archiveteam.org/index.php/Paper_Media Here Savetz emphasizes 600 dpi TIFF.
21:26 🔗 xmc yes
21:26 🔗 xmc use 600dpi tiff
21:26 🔗 bzc6p I thought it was written by SketchCow, that's why I didn't get why SketchCow uploaded 600dpi jpg.
21:26 🔗 xmc what is the purpose of this conversation though
21:26 🔗 bzc6p I'm planning to scan some magazines and want to decide the format.
21:27 🔗 bzc6p and quality
21:27 🔗 xmc pile of sequentially numbered tiffs in a zipfile is good idea
21:27 🔗 bzc6p Doesn't compress well.
21:27 🔗 xmc ok?
21:27 🔗 xmc what is your goal though
21:28 🔗 xmc what does that not do for you
21:29 🔗 JW_work Savetz, btw, is Kevin Savetz — collector of computer magazines & Atari material, big supporter of IA
21:29 🔗 bzc6p It just seemed a bit too big for me that a year of a magazine can take up 50–100 GBs, but if IA can hold it, I don't mind.
21:29 🔗 xmc you can make tiff with jpeg compression if you can't stand the size of tiff normally
21:31 🔗 bzc6p I'll think it over. Thanks for the answers.
21:32 🔗 ersi_ bzc6p: Why do you care if it compresses well?
21:32 🔗 bzc6p It doesn't.
21:32 🔗 ersi_ Are you an it?
21:32 🔗 godane so i made my grab script for Aviation Week grab better
21:33 🔗 godane bzc6p: i agree
21:33 🔗 bzc6p ersi_: I think I'll check the scanner software settings again, but 7z-ing the TIFF didn't decrease the size notably.
21:34 🔗 ersi_ I'm asking why you care, not if it's possible
21:34 🔗 bzc6p godane: with what?
21:34 🔗 godane about magazines being 100gb for 1 year
21:35 🔗 ersi_ Just scan it in the highest possible resolution in a lossless format (TIFF?). If you scan it with as high resolution as you can (and is of quality) and losslessy, you can always convert it into some other kind of format later on - if you want it in lower resolution, or worse quality (for size)
21:35 🔗 godane i have noticed that sometimes archive.org derives comic book zips right but not all the time
21:35 🔗 godane take the pdf from this: https://archive.org/details/NextGeneration36Dec1997
21:36 🔗 godane that pdf is like thumbnails
21:36 🔗 bzc6p I thought IA might mind that big size compared to the amount of information such a thing holds.
21:36 🔗 bzc6p ersi_ ^
21:37 🔗 godane but the watt meter pdf redrive is very good: https://archive.org/details/Instruction_Book_for_the_Thruline_Wattmeter_Model_43
21:38 🔗 godane one theory i have is it doesn't redrive cbr/rar files right
21:40 🔗 bzc6p I think I'll just scan it in 600dpi JPG so I can sleep well too. Some of the stuff is not that important anyway. And I don't really fear OCR's won't be able to cope with it.
21:41 🔗 xmc high quality jpeg is pretty good
21:41 🔗 bzc6p Also, if SketchCow did that, then it must be fair.
21:41 🔗 xmc after all, jason scott can do no wrong
21:43 🔗 bzc6p Well, then thanks again, and good night/afternoon
21:43 🔗 ersi_ bzc6p: No worries - but it's always good to check the scans. Don't use higher resolution if it doesn't provide any gain
21:43 🔗 ersi_ But still, high resolution is good - as you can always shrink it afterwards if it's "too big" :)
21:43 🔗 bzc6p That's what I've been considering!
21:43 🔗 ersi_ Oh, right. Nighty :)
21:43 🔗 ersi_ Ah, yeah
21:45 🔗 antomati_ has joined #archiveteam-bs
21:46 🔗 xmc 600dpi jpeg is better than 300dpi tiff, imo
21:46 🔗 bzc6p Agreed
21:46 🔗 ersi_ It's hard to say what the perfect resolution is. But if in doubt, use higher than lower IMO :)
21:46 🔗 lytv has quit IRC (Read error: Connection reset by peer)
21:46 🔗 antomatic has quit IRC (Read error: Connection reset by peer)
21:46 🔗 wednesday has quit IRC (Ping timeout: 252 seconds)
21:47 🔗 wednesday has joined #archiveteam-bs
21:47 🔗 lytv has joined #archiveteam-bs
21:56 🔗 bzc6p has left
22:16 🔗 Start has quit IRC (Quit: Disconnected.)
23:08 🔗 Start has joined #archiveteam-bs
23:58 🔗 superkuh has quit IRC (Remote host closed the connection)

irclogger-viewer